Page Comparison

Cloud Dataproc is a Google Cloud Platform (GCP) service that manages Hadoop and Spark clusters in the cloud and can be used to create large clusters quickly. The Google Dataproc provisioner simply calls the Cloud Dataproc APIs to create and delete clusters in your GCP account. The provisioner exposes several configuration settings that control what type of cluster is created.

Version compatibility

Problem: The version of your CDAP environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to the latest CDAP version and use one of the supported Dataproc versions.

Earlier versions of CDAP are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing it with one created with a supported version.

CDAP version	Dataproc version
6.10	2.1, 2.0 *
6.9	2.1, 2.0, 1.5 *
6.7, 6.8	2.0, 1.5 *
6.4-6.6	2.0 ^, 1.3 ^*
6.1-6.3	1.3^**

^* CDAP versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor image version.

^** CDAP versions 6.1 to 6.6 are compatible with unsupported Dataproc version 1.3.

Best Practices

Configurations

Recommended: When you create a static cluster for your pipelines, use the following configurations.

Parameters
`yarn.nodemanager.delete.debug-delay-sec`	Retains YARN logs. Recommended value: `86400` (equivalent to one day)
`yarn.nodemanager.pmem-check-enabled`	Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory. Recommended value: `false`
`yarn.nodemanager.vmem-check-enabled`	Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory. Recommended value: `false`.

Account Information

Project ID

...

The number of master nodes to have in your cluster. Master nodes contain the YARN Resource Manager, HDFS NameNode, and will be the node that CDAP will connect to when executing jobs. Must be set to either 1 or 3.

Default is 1.

Master Machine Type

The type of master machine to use. Select one of the following machine types:

n1
n2
n2d
e2

Default is e2 (6.7.2)

Default is n2 (6.7.1)

Default is n1 (6.7.0 and earlier)

Master Cores

The number of virtual cores to allocate to each master node.

Default is 2.

Master Memory

The amount of memory in gigabytes to allocate to each master node.

...

Standard Persistent Disk
SSD Persistent Disk

Default is Standard Persistent Disk

Worker Nodes Settings

Number of Primary Workers

Worker

...

Default is 2.

Number of Secondary Workers

Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero unless an autoscale policy requires it to be higher.

Worker Machine Type

The type of worker machine to use. Select one of the following machine types:

n1
n2
n2d
e2

Default is e2 (6.7.2)

Default is n2 (6.7.1)

Default is n1 (6.7.0 and earlier)

Worker Cores

The number of virtual cores to allocate for each worker node.

Default is 2.

Worker Memory (GB)

The amount of memory in gigabytes to allocate to each worker node.

Default is 8 GB.

Worker Disk Size (GB)

The disk size in gigabytes to allocate for each worker node.

...

Standard Persistent Disk
SSD Persistent Disk

Default is Standard Persistent Disk.

Number of Cluster Workers Settings

Enable Predefined Dataproc Autoscaling

Info
Enable Predefined Dataproc Autoscaling was introduced in CDAP 6.6.0

Defines whether the Dataproc autoscaling with predefined Autoscaling policy by CDAP will be enabled.

When you enable predefined autoscaling:

The properties Number of primary workers, Number of secondary workers, and Autoscaling policy are not considered.
The worker machine type/configuration is the same as the chosen profile's.
Turning off the Use predefined Autoscaling toggle disables predefined autoscaling, and then runs the original behavior of the profile.

You can also use a Runtime argument to enable autoscaling for a pipeline run.

system.profile.properties.enablePredefinedAutoScaling = true.

Number of Primary Workers

Worker nodes contain a YARN NodeManager and an HDFS DataNode.

Default is 2.

Number of Secondary Workers

Secondary worker nodes contain a YARN NodeManager, but not an HDFS DataNode. This is normally set to zero unless an autoscale policy requires it to be higher.

Autoscaling Policy

Specify the Autoscaling Policy ID (name) or the resource URI.

For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.

Recommended: Use autoscaling policies for increasing the cluster size, not for decreasing the size. Decreasing the cluster size with autoscaling removes nodes that hold intermediate data, which might cause your pipelines to run slowly or fail.

Cluster Metadata Settings

...

Assign Network tags to apply firewall rules to the specific nodes of a cluster. Network tags must start with a lowercase letter and can contain lowercase letters, numbers, and hyphens. Tags must end with a lowercase letter or number.

Shielded VMs

Info

...

Shielded VM settings were introduced in CDAP 6.5.0.

Enable Secure Boot

Defines whether the Dataproc VMs have Secure Boot enabled.

...

Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.

...

Staging Bucket

The Google Cloud storage Storage bucket used by Cloud Dataproc to read/write cluster and job datato stage job dependencies and config files for running pipelines in Google Cloud Dataproc.

Temp Bucket

Note: Temp Bucket config was introduced in CDAP 6.9.2.

Google Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files in Google Cloud Dataproc.

Encryption Key Name

The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.

Autoscaling Policy

Specify the Autoscaling Policy ID (name) or the resource URI.

For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see the Autoscaling clusters guide.

...

OAuth Scopes

Note: OAuth Scopes config was introduced in CDAP 6.9.2.

The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.

Initialization Actions

A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.

...

Cluster properties used to override default configuration properties for the Hadoop services. The applicable key-value pairs can found here.

Common Labels

Note: Labels were introduced in CDAP 6.5.0.

A label is a key-value pair that helps you organize your Google Cloud Dataproc clusters and jobs. You can attach a label to each resource, and then filter the resources based on their labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.

Specifies labels for the Dataproc clusters and jobs being created.

Cluster Labels

Note: Labels (now Cluster Labels) were introduced in CDAP 6.5.0.

Specifies labels for the Dataproc cluster being created.

...

Configure Dataproc to delete the cluster if it has been idle for longer than this many minutes. Clusters are normally deleted directly after a run ends, but this delete may fail in rare situations. For example, if permissions are revoked in the middle of a run, or if there is a Dataproc outage. Use this to ensure that clusters are eventually deleted even if the instance is unable to delete the cluster for any reason.

Default is 30 minutes (starting in CDAP 6.6.0)

Skip Cluster Delete

Whether to skip cluster deletion at the end of a run. Clusters will need to be deleted manually. This should only be used when debugging a failed run.

...

The number of seconds to wait in between polls for cluster status.

Default is 2.

Dataproc profile UI properties mapped to JSON properties

Dataproc profile UI property name	Dataproc profile JSON property name
Profile Label	name
Profile Name	label
Description	description
Property ID	projectId
Creator Service Account Key	accountKey
Region	region
Zone	zone
Network	network
Network Host Project ID	networkHostProjectId
Subnet	subnet
Runner Service Account	serviceAccount
Number of masters	masterNumNodes
Master Machine Type	masterMachineType
Master Cores	masterCPUs
Master Memory (GB)	masterMemoryMB
Master Disk Size (GB)	masterDiskGB
Master Disk Type	masterDiskType
Number of Primary Workers	workerNumNodes
Number of Secondary Workers	secondaryWorkerNumNodes
Worker Machine Type	workerMachineType
Worker Cores	workerCPUs
Worker Memory (GB)	workerMemoryMB
Worker Disk Size (GB)	workerDiskGB
Worker Disk Type	workerDiskType
Metadata	clusterMetaData
Network Tags	networkTags
Enable Secure Boot	secureBootEnabled
Enable vTPM	vTpmEnabled
Enable Integrity Monitoring	integrityMonitoringEnabled
Image Version	imageVersion
Custom Image URI	customImageUri
GCS Bucket	gcsBucket
Encryption Key Name	encryptionKeyName
Autoscaling Policy	autoScalingPolicy
Initialization Actions	initActions
Cluster Properties	clusterProperties
Labels	clusterLabels
Max Idle Time	idleTTL
Skip Cluster Delete	skipDelete
Enable Stackdriver Logging Integration	stackdriverLoggingEnabled
Enable Stackdriver Monitoring Integration	stackdriverMonitoringEnabled
Enable Component Gateway	componentGatewayEnabled
Prefer External IP	preferExternalIP
Create Poll Delay	pollCreateDelay
Create Poll Jitter	pollCreateJitter
Delete Poll Delay	pollDeleteDelay
Poll Interval	pollInterval

Versions Compared

Old Version 25

New Version Current

Key

Version compatibility

Best Practices

Configurations

Account Information

Project ID

Master Machine Type

Master Cores

Master Memory

Worker Nodes Settings

Number of Primary Workers

Worker

Number of Secondary Workers

Worker Machine Type

Worker Cores

Worker Memory (GB)

Worker Disk Size (GB)

Number of Cluster Workers Settings

Enable Predefined Dataproc Autoscaling

Number of Primary Workers

Number of Secondary Workers

Autoscaling Policy

Cluster Metadata Settings

Shielded VMs

Enable Secure Boot

Staging Bucket

Temp Bucket

Encryption Key Name

Autoscaling Policy

OAuth Scopes

Initialization Actions

Common Labels

Cluster Labels

Skip Cluster Delete

Dataproc profile UI properties mapped to JSON properties