Cloud Dataproc is a Google Cloud Platform (GCP) service that manages Hadoop and Spark clusters in the cloud and can be used to create large clusters quickly. The Google Dataproc provisioner simply calls the Cloud Dataproc APIs to create and delete clusters in your GCP account. The provisioner exposes several configuration settings that control what type of cluster is created.
Version compatibility
Problem: The version of your CDAP environment might not be compatible with the version of your Dataproc cluster.
Recommended: Upgrade to the latest CDAP version and use one of the supported Dataproc versions.
Earlier versions of CDAP are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.
CDAP version | Dataproc version |
---|---|
6.10 | 2.1, 2.0 * |
6.9 | 2.1, 2.0, 1.5 * |
6.7, 6.8 | 2.0, 1.5 * |
6.4-6.6 | 2.0 *, 1.3 ** |
6.1-6.3 | 1.3** |
* CDAP versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor
image version.
** CDAP versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.
Best Practices
Configurations
Recommended: When you create a static cluster for your pipelines, use the following configurations.
Parameters | |
---|---|
| Retains YARN logs. |
| Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory. |
| Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory. |
Account Information
Project ID
...
The type of master machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Master Cores
The number of virtual cores to allocate to each master node.
...
The type of worker machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Worker Cores
The number of virtual cores to allocate for each worker node.
...
Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.
...
Staging Bucket
The Google Cloud storage Storage bucket used by Cloud Dataproc to read/write cluster and job datato stage job dependencies and config files for running pipelines in Google Cloud Dataproc.
Temp Bucket
Note: Temp Bucket config was introduced in CDAP 6.9.2.
Google Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files in Google Cloud Dataproc.
Encryption Key Name
The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.
OAuth Scopes
Note: OAuth Scopes config was introduced in CDAP 6.9.2.
The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.
Initialization Actions
A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.
...
Cluster properties used to override default configuration properties for the Hadoop services. The applicable key-value pairs can found here.
Common Labels
Note: Labels were introduced in CDAP 6.5.0.
A label is a key-value pair that helps you organize your Google Cloud Dataproc clusters and jobs. You can attach a label to each resource, and then filter the resources based on their labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.
Specifies labels for the Dataproc clusters and jobs being created.
Cluster Labels
Note: Labels (now Cluster Labels) were introduced in CDAP 6.5.0.
Specifies labels for the Dataproc cluster being created.
...
Dataproc profile UI properties mapped to JSON properties
Dataproc profile UI property name | Dataproc profile JSON property name |
---|---|
Profile Label | name |
Profile Name | label |
Description | description |
Property ID | projectId |
Creator Service Account Key | accountKey |
Region | region |
Zone | zone |
Network | network |
Network Host Project ID | networkHostProjectId |
Subnet | subnet |
Runner Service Account | serviceAccount |
Number of masters | masterNumNodes |
Master Machine Type | masterMachineType |
Master Cores | masterCPUs |
Master Memory (GB) | masterMemoryMB |
Master Disk Size (GB) | masterDiskGB |
Master Disk Type | masterDiskType |
Number of Primary Workers | workerNumNodes |
Number of Secondary Workers | secondaryWorkerNumNodes |
Worker Machine Type | workerMachineType |
Worker Cores | workerCPUs |
Worker Memory (GB) | workerMemoryMB |
Worker Disk Size (GB) | workerDiskGB |
Worker Disk Type | workerDiskType |
Metadata | clusterMetaData |
Network Tags | networkTags |
Enable Secure Boot | secureBootEnabled |
Enable vTPM | vTpmEnabled |
Enable Integrity Monitoring | integrityMonitoringEnabled |
Image Version | imageVersion |
Custom Image URI | customImageUri |
GCS Bucket | gcsBucket |
Encryption Key Name | encryptionKeyName |
Autoscaling Policy | autoScalingPolicy |
Initialization Actions | initActions |
Cluster Properties | clusterProperties |
Labels | clusterLabels |
Max Idle Time | idleTTL |
Skip Cluster Delete | skipDelete |
Enable Stackdriver Logging Integration | stackdriverLoggingEnabled |
Enable Stackdriver Monitoring Integration | stackdriverMonitoringEnabled |
Enable Component Gateway | componentGatewayEnabled |
Prefer External IP | preferExternalIP |
Create Poll Delay | pollCreateDelay |
Create Poll Jitter | pollCreateJitter |
Delete Poll Delay | pollDeleteDelay |
Poll Interval | pollInterval |