Google Dataproc
Cloud Dataproc is a Google Cloud Platform (GCP) service that manages Hadoop and Spark clusters in the cloud and can be used to create large clusters quickly. The Google Dataproc provisioner simply calls the Cloud Dataproc APIs to create and delete clusters in your GCP account. The provisioner exposes several configuration settings that control what type of cluster is created.
Version compatibility
Problem: The version of your CDAP environment might not be compatible with the version of your Dataproc cluster.
Recommended: Upgrade to the latest CDAP version and use one of the supported Dataproc versions.
Earlier versions of CDAP are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.
CDAP version | Dataproc version |
---|---|
6.10 | 2.1, 2.0 * |
6.9 | 2.1, 2.0, 1.5 * |
6.7, 6.8 | 2.0, 1.5 * |
6.4-6.6 | 2.0 *, 1.3 ** |
6.1-6.3 | 1.3** |
* CDAP versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor
image version.
** CDAP versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.
Best Practices
Configurations
Recommended: When you create a static cluster for your pipelines, use the following configurations.
Parameters | |
---|---|
| Retains YARN logs. |
| Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory. |
| Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory. |
Account Information
Project ID
A GCP project ID must be provided. This will be the project that the Cloud Dataproc cluster is created in. The project must have the Cloud Dataproc APIs enabled.
Creator Service Account Key
The service account key provided to the provisioner must have rights to access the Cloud Dataproc APIs and the Google Compute Engine APIs. Since your account key is sensitive, we recommend that you provide your account key through the CDAP Secure Storage by adding a secure key with the Microservices. After you create the secure key, you can add it to a namespace or system compute profile. For a namespace compute profile, click the shield icon and select the secure key. For a system compute profile, type the name of the key in the Secure Account Key field to add the secure key to the compute profile.
General Settings
Region
A region is a specific geographical location where you can host resources, like the compute nodes for your Cloud Dataproc cluster.
Zone
A zone is an isolated location within a region.
Network
This is the VPC network in your GCP project that will be used when creating a Cloud Dataproc cluster.
Network Host Project ID
Google Cloud Project ID, which uniquely identifies the project where the network resides. This can be left blank if the network resides in the same project as specified in the Project ID. In the case of shared VPC, this must be set to the host Project ID where the network resides.
Subnet
Subnet to use when creating clusters. The subnet must be within the given network and it must be for the region that the zone is in. If this is left blank, a subnet will automatically be chosen based on the network and zone.
Runner Service Account
Name of the service account of the Dataproc virtual machines (VM) that are used to programs. If none is given, the default Compute Engine service account will be used.
Master Nodes Settings
Size settings control how big of a cluster to create.
Number of Masters
The number of master nodes to have in your cluster. Master nodes contain the YARN Resource Manager, HDFS NameNode, and will be the node that CDAP will connect to when executing jobs. Must be set to either 1 or 3.
Default is 1.
Master Machine Type
The type of master machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Master Cores
The number of virtual cores to allocate to each master node.
Default is 2.
Master Memory
The amount of memory in gigabytes to allocate to each master node.
Default is 8 GB.
Master Disk Size (GB)
The disk size in gigabytes to allocate for the master node.
Default is 1000 GB
Master Disk Type
Type of boot disk for the master node:
Standard Persistent Disk
SSD Persistent Disk
Default is Standard Persistent Disk
Worker Nodes Settings
Worker Machine Type
The type of worker machine to use. Select one of the following machine types:
n1
n2
n2d
e2
Default is e2 (6.7.2)
Default is n2 (6.7.1)
Default is n1 (6.7.0 and earlier)
Worker Cores
The number of virtual cores to allocate for each worker node.
Default is 2.
Worker Memory (GB)
The amount of memory in gigabytes to allocate to each worker node.
Default is 8 GB.
Worker Disk Size (GB)
The disk size in gigabytes to allocate for each worker node.
Default is 1000 GB
Worker Disk Type
Type of boot disk for the worker node:
Standard Persistent Disk
SSD Persistent Disk
Default is Standard Persistent Disk.
Number of Cluster Workers Settings
Enable Predefined Dataproc Autoscaling
Enable Predefined Dataproc Autoscaling was introduced in CDAP 6.6.0
Defines whether the Dataproc autoscaling with predefined Autoscaling policy by CDAP will be enabled.
When you enable predefined autoscaling:
The properties
Number of primary workers
,Number of secondary workers
, andAutoscaling policy
are not considered.The worker machine type/configuration is the same as the chosen profile's.
Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.
You can also use a Runtime argument to enable autoscaling for a pipeline run.
system.profile.properties.enablePredefinedAutoScaling
= true
.
Number of Primary Workers
Worker nodes contain a YARN NodeManager and an HDFS DataNode.
Default is 2.
Number of Secondary Workers
Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero unless an autoscale policy requires it to be higher.
Autoscaling Policy
Specify the Autoscaling Policy ID (name) or the resource URI.
For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.
Recommended: Use autoscaling policies for increasing the cluster size, not for decreasing the size. Decreasing the cluster size with autoscaling removes nodes that hold intermediate data, which might cause your pipelines to run slowly or fail.
Cluster Metadata Settings
Metadata
Additional metadata for instances that run in your cluster. You can typically use it for tracking billing and chargebacks. For more information, see Cluster metadata.
Network Tags
Assign Network tags to apply firewall rules to the specific nodes of a cluster. Network tags must start with a lowercase letter and can contain lowercase letters, numbers, and hyphens. Tags must end with a lowercase letter or number.
Shielded VMs
Shielded VM settings were introduced in CDAP 6.5.0.
Enable Secure Boot
Defines whether the Dataproc VMs have Secure Boot enabled.
Default is False.
Enable vTPM
Defines whether the Dataproc VMs have the virtual Trusted Platform Module (vTPM) enabled.
Default is False.
Enable Integrity Monitoring
Defines whether Dataproc VMs have integrity monitoring enabled.
Default is False.
Advanced Settings
Image Version
The Dataproc image version. If none is given, one will automatically be chosen. If custom image URI is specified, this field will be ignored.
Custom Image URI
Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.
Staging Bucket
Google Cloud Storage bucket used to stage job dependencies and config files for running pipelines in Google Cloud Dataproc.
Temp Bucket
Note: Temp Bucket config was introduced in CDAP 6.9.2.
Google Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files in Google Cloud Dataproc.
Encryption Key Name
The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.
OAuth Scopes
The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.
Initialization Actions
A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.
Cluster Properties
Cluster properties used to override default configuration properties for the Hadoop services. The applicable key-value pairs can found here.
Common Labels
A label is a key-value pair that helps you organize your Google Cloud Dataproc clusters and jobs. You can attach a label to each resource, and then filter the resources based on their labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.
Specifies labels for the Dataproc clusters and jobs being created.
Cluster Labels
Specifies labels for the Dataproc cluster being created.
Max Idle Time
Configure Dataproc to delete the cluster if it has been idle for longer than this many minutes. Clusters are normally deleted directly after a run ends, but this delete may fail in rare situations. For example, if permissions are revoked in the middle of a run, or if there is a Dataproc outage. Use this to ensure that clusters are eventually deleted even if the instance is unable to delete the cluster for any reason.
Default is 30 minutes (starting in CDAP 6.6.0)
Skip Cluster Delete
Whether to skip cluster deletion at the end of a run. Clusters will need to be deleted manually. This should only be used when debugging a failed run.
Default is False.
Enable Stack Driver Logging Integration
Enable or disable Stackdriver logging integration.
Default is True.
Enable Stack Driver Monitoring Integration
Enable or disable Stackdriver monitoring integration.
Default is True.
Enable Component Gateway
Enable Component Gateway to allow access to cluster UIs like the YARN ResourceManager and Spark HistoryServer.
Default is False.
Prefer External IP
When the system is running on Google Cloud Platform in the same network as the cluster, it will normally use the internal IP when communicating with the cluster. Set to True to always use the external IP.
Default is False.
Polling Settings
Polling settings control how often cluster status should be polled when creating and deleting clusters. You may want to change these settings if you have a lot of pipelines scheduled to run at the same time using the same GCP account.
Create Poll Delay
The number of seconds to wait after creating a cluster to begin polling to see if the cluster has been created.
Default is 60 seconds.
Create Poll Jitter
Maximum amount of random jitter in seconds to add to the create poll delay. This is used to prevent a lot of simultaneous API calls against your GCP account when you have a lot of pipelines that are scheduled to run at the exact same time.
Default is 20 seconds.
Delete Poll Delay
The number of seconds to wait after deleting a cluster to begin polling to see if the cluster has been deleted.
Default is 30 seconds.
Poll Interval
The number of seconds to wait in between polls for cluster status.
Default is 2.
Dataproc profile UI properties mapped to JSON properties
Dataproc profile UI property name | Dataproc profile JSON property name |
---|---|
Profile Label | name |
Profile Name | label |
Description | description |
Property ID | projectId |
Creator Service Account Key | accountKey |
Region | region |
Zone | zone |
Network | network |
Network Host Project ID | networkHostProjectId |
Subnet | subnet |
Runner Service Account | serviceAccount |
Number of masters | masterNumNodes |
Master Machine Type | masterMachineType |
Master Cores | masterCPUs |
Master Memory (GB) | masterMemoryMB |
Master Disk Size (GB) | masterDiskGB |
Master Disk Type | masterDiskType |
Number of Primary Workers | workerNumNodes |
Number of Secondary Workers | secondaryWorkerNumNodes |
Worker Machine Type | workerMachineType |
Worker Cores | workerCPUs |
Worker Memory (GB) | workerMemoryMB |
Worker Disk Size (GB) | workerDiskGB |
Worker Disk Type | workerDiskType |
Metadata | clusterMetaData |
Network Tags | networkTags |
Enable Secure Boot | secureBootEnabled |
Enable vTPM | vTpmEnabled |
Enable Integrity Monitoring | integrityMonitoringEnabled |
Image Version | imageVersion |
Custom Image URI | customImageUri |
GCS Bucket | gcsBucket |
Encryption Key Name | encryptionKeyName |
Autoscaling Policy | autoScalingPolicy |
Initialization Actions | initActions |
Cluster Properties | clusterProperties |
Labels | clusterLabels |
Max Idle Time | idleTTL |
Skip Cluster Delete | skipDelete |
Enable Stackdriver Logging Integration | stackdriverLoggingEnabled |
Enable Stackdriver Monitoring Integration | stackdriverMonitoringEnabled |
Enable Component Gateway | componentGatewayEnabled |
Prefer External IP | preferExternalIP |
Create Poll Delay | pollCreateDelay |
Create Poll Jitter | pollCreateJitter |
Delete Poll Delay | pollDeleteDelay |
Poll Interval | pollInterval |
Created in 2020 by Google Inc.