Cloud Dataproc is a Google Cloud Platform (GCP) service that manages Hadoop and Spark clusters in the cloud and can be used to create large clusters quickly. The Google Dataproc provisioner simply calls the Cloud Dataproc APIs to create and delete clusters in your GCP account. The provisioner exposes several configuration settings that control what type of cluster is created.
Version compatibility
Problem: The version of your CDAP environment might not be compatible with the version of your Dataproc cluster.
Recommended: Upgrade to the latest CDAP version and use one of the supported Dataproc versions.
Earlier versions of CDAP are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.
CDAP version | Dataproc version |
---|---|
6.10 | 2.1, 2.0 * |
6.9 | 2.1, 2.0, 1.5 * |
6.7, 6.8 | 2.0, 1.5 * |
6.4-6.6 | 2.0 *, 1.3 ** |
6.1-6.3 | 1.3** |
* CDAP versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor
image version.
** CDAP versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.
Best Practices
Configurations
Recommended: When you create a static cluster for your pipelines, use the following configurations.
Parameters | |
---|---|
| Retains YARN logs. |
| Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory. |
| Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory. |
Account Information
Project ID
...
The number of master nodes to have in your cluster. Master nodes contain the YARN Resource Manager, HDFS NameNode, and will be the node that CDAP will connect to when executing jobs. Must be set to either 1 or 3.
Default is 1.
Master Machine Type
The type of master machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Master Cores
The number of virtual cores to allocate to each master node.
Default is 2.
Master Memory
The amount of memory in gigabytes to allocate to each master node.
...
Standard Persistent Disk
SSD Persistent Disk
Default is Standard Persistent Disk
Worker Nodes Settings
Number of Primary Workers
Worker
...
Default is 2.
Number of Secondary Workers
Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero unless an autoscale policy requires it to be higher.
Worker Machine Type
The type of worker machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Worker Cores
The number of virtual cores to allocate for each worker node.
Default is 2.
Worker Memory (GB)
The amount of memory in gigabytes to allocate to each worker node.
Default is 8 GB.
Worker Disk Size (GB)
The disk size in gigabytes to allocate for each worker node.
...
Standard Persistent Disk
SSD Persistent Disk
Default is Standard Persistent Disk.
Number of Cluster Workers Settings
Enable Predefined Dataproc Autoscaling
Info |
---|
Enable Predefined Dataproc Autoscaling was introduced in CDAP 6.6.0 |
Defines whether the Dataproc autoscaling with predefined Autoscaling policy by CDAP will be enabled.
When you enable predefined autoscaling:
The properties
Number of primary workers
,Number of secondary workers
, andAutoscaling policy
are not considered.The worker machine type/configuration is the same as the chosen profile's.
Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.
You can also use a Runtime argument to enable autoscaling for a pipeline run.
system.profile.properties.enablePredefinedAutoScaling
= true
.
Number of Primary Workers
Worker nodes contain a YARN NodeManager and an HDFS DataNode.
Default is 2.
Number of Secondary Workers
Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero unless an autoscale policy requires it to be higher.
Autoscaling Policy
Specify the Autoscaling Policy ID (name) or the resource URI.
For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.
Recommended: Use autoscaling policies for increasing the cluster size, not for decreasing the size. Decreasing the cluster size with autoscaling removes nodes that hold intermediate data, which might cause your pipelines to run slowly or fail.
Cluster Metadata Settings
...
Assign Network tags to apply firewall rules to the specific nodes of a cluster. Network tags must start with a lowercase letter and can contain lowercase letters, numbers, and hyphens. Tags must end with a lowercase letter or number.
Shielded VMs
Info |
---|
...
Shielded VM settings were introduced in CDAP 6.5.0. |
Enable Secure Boot
Defines whether the Dataproc VMs have Secure Boot enabled.
...
Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.
...
Staging Bucket
The Google Cloud storage Storage bucket used by Cloud Dataproc to read/write cluster and job datato stage job dependencies and config files for running pipelines in Google Cloud Dataproc.
Temp Bucket
Note: Temp Bucket config was introduced in CDAP 6.9.2.
Google Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files in Google Cloud Dataproc.
Encryption Key Name
The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.
Autoscaling Policy
Specify the Autoscaling Policy ID (name) or the resource URI.
For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.
...
OAuth Scopes
Note: OAuth Scopes config was introduced in CDAP 6.9.2.
The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.
Initialization Actions
A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.
...
Cluster properties used to override default configuration properties for the Hadoop services. The applicable key-value pairs can found here.
Common Labels
Note: Labels were introduced in CDAP 6.5.0.
A label is a key-value pair that helps you organize your Google Cloud Dataproc clusters and jobs. You can attach a label to each resource, and then filter the resources based on their labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.
Specifies labels for the Dataproc clusters and jobs being created.
Cluster Labels
Note: Labels (now Cluster Labels) were introduced in CDAP 6.5.0.
Specifies labels for the Dataproc cluster being created.
...
Configure Dataproc to delete the cluster if it has been idle for longer than this many minutes. Clusters are normally deleted directly after a run ends, but this delete may fail in rare situations. For example, if permissions are revoked in the middle of a run, or if there is a Dataproc outage. Use this to ensure that clusters are eventually deleted even if the instance is unable to delete the cluster for any reason.
Default is 30 minutes (starting in CDAP 6.6.0)
Skip Cluster Delete
Whether to skip cluster deletion at the end of a run. Clusters will need to be deleted manually. This should only be used when debugging a failed run.
...
The number of seconds to wait in between polls for cluster status.
Default is 2.
Dataproc profile UI properties mapped to JSON properties
Dataproc profile UI property name | Dataproc profile JSON property name |
---|---|
Profile Label | name |
Profile Name | label |
Description | description |
Property ID | projectId |
Creator Service Account Key | accountKey |
Region | region |
Zone | zone |
Network | network |
Network Host Project ID | networkHostProjectId |
Subnet | subnet |
Runner Service Account | serviceAccount |
Number of masters | masterNumNodes |
Master Machine Type | masterMachineType |
Master Cores | masterCPUs |
Master Memory (GB) | masterMemoryMB |
Master Disk Size (GB) | masterDiskGB |
Master Disk Type | masterDiskType |
Number of Primary Workers | workerNumNodes |
Number of Secondary Workers | secondaryWorkerNumNodes |
Worker Machine Type | workerMachineType |
Worker Cores | workerCPUs |
Worker Memory (GB) | workerMemoryMB |
Worker Disk Size (GB) | workerDiskGB |
Worker Disk Type | workerDiskType |
Metadata | clusterMetaData |
Network Tags | networkTags |
Enable Secure Boot | secureBootEnabled |
Enable vTPM | vTpmEnabled |
Enable Integrity Monitoring | integrityMonitoringEnabled |
Image Version | imageVersion |
Custom Image URI | customImageUri |
GCS Bucket | gcsBucket |
Encryption Key Name | encryptionKeyName |
Autoscaling Policy | autoScalingPolicy |
Initialization Actions | initActions |
Cluster Properties | clusterProperties |
Labels | clusterLabels |
Max Idle Time | idleTTL |
Skip Cluster Delete | skipDelete |
Enable Stackdriver Logging Integration | stackdriverLoggingEnabled |
Enable Stackdriver Monitoring Integration | stackdriverMonitoringEnabled |
Enable Component Gateway | componentGatewayEnabled |
Prefer External IP | preferExternalIP |
Create Poll Delay | pollCreateDelay |
Create Poll Jitter | pollCreateJitter |
Delete Poll Delay | pollDeleteDelay |
Poll Interval | pollInterval |