Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Problem: The version of your CDAP environment might not be compatible with the version of your Dataproc cluster.

Recommended: Upgrade to the latest CDAP version 6.4 or later and use one of the supported Dataproc versions.

CDAP versions before 6.4 Earlier versions of CDAP are only compatible with unsupported versions of Dataproc. Dataproc does not provide updates and support for clusters created with these versions. Although you can continue running a cluster that was created with an unsupported version, we recommend replacing the cluster with a new cluster that is it with one created with a supported version is recommended.

CDAP version

Dataproc version

6.10

2.1

to

, 2.0 *

6.

3*

9

2.1, 2.

3.x

0, 1.5 *

6.

4 to

7, 6.

6

8

2.0, 1

.3.x and

.5 *

6.4-6.6

2.0

.x

*, 1.3 **

6.1-6.

7

3

1.

5.x and 2.0.x

3**

* CDAP versions 6.4 and later are compatible with supported versions of Dataproc. Unless specific OS features are needed, the recommended practice is to specify the major.minor image version.

** CDAP versions 6.1 to 6.3 6 are compatible with unsupported Dataproc version 1.3. You don't need additional components to make them compatible. CDAP uses HDFS and Spark.

Best Practices

Configurations

Recommended: When you create a static cluster for your pipelines, use the following configurations.

Parameters

yarn.nodemanager.delete.debug-delay-sec

Retains YARN logs.
Recommended value: 86400 (equivalent to one day)

yarn.nodemanager.pmem-check-enabled

Enables YARN to check for physical memory limits and kill containers if they go beyond physical memory.
Recommended value: false

yarn.nodemanager.vmem-check-enabled

Enables YARN to check for virtual memory limits and kill containers if they go beyond physical memory.
Recommended value: false.

Account Information

Project ID

...

The type of master machine to use. Select one of the following machine types:

  • n1

  • n2

  • n2d

  • e2

Default is e2 (6.7.2)

Default is n2 (6.7.1)

Default is n1 (6.7.0 and earlier)

Master Cores

The number of virtual cores to allocate to each master node.

...

The type of worker machine to use. Select one of the following machine types:

  • n1

  • n2

  • n2d

  • e2

Default is e2 (6.7.2)

Default is n2 (6.7.1)

Default is n1 (6.7.0 and earlier)

Worker Cores

The number of virtual cores to allocate for each worker node.

...

Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.

...

Staging Bucket

The Google Cloud storage Storage bucket used by Cloud Dataproc to read/write cluster and job datato stage job dependencies and config files for running pipelines in Google Cloud Dataproc.

Temp Bucket

Note: Temp Bucket config was introduced in CDAP 6.9.2.

Google Cloud Storage bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files in Google Cloud Dataproc.

Encryption Key Name

The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.

OAuth Scopes

Note: OAuth Scopes config was introduced in CDAP 6.9.2.

The OAuth 2.0 scopes that you might need to request to access Google APIs, depending on the level of access you need. Google Cloud Platform Scope is always included.

Initialization Actions

A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.

...

Cluster properties used to override default configuration properties for the Hadoop services. The applicable key-value pairs can found here.

Common Labels

Note: Labels were introduced in CDAP 6.5.0.

A label is a key-value pair that helps you organize your Google Cloud Dataproc clusters and jobs. You can attach a label to each resource, and then filter the resources based on their labels. Information about labels is forwarded to the billing system, so customers can break down your billing charges by label.

Specifies labels for the Dataproc clusters and jobs being created.

Cluster Labels

Note: Labels (now Cluster Labels) were introduced in CDAP 6.5.0.

Specifies labels for the Dataproc cluster being created.

...

Dataproc profile UI properties mapped to JSON properties

Dataproc profile UI property name

Dataproc profile JSON property name

Profile Label

name

Profile Name

label

Description

description

Property ID

projectId

Creator Service Account Key

accountKey

Region

region

Zone

zone

Network

network

Network Host Project ID

networkHostProjectId

Subnet

subnet

Runner Service Account

serviceAccount

Number of masters

masterNumNodes

Master Machine Type

masterMachineType

Master Cores

masterCPUs

Master Memory (GB)

masterMemoryMB

Master Disk Size (GB)

masterDiskGB

Master Disk Type

masterDiskType

Number of Primary Workers

workerNumNodes

Number of Secondary Workers

secondaryWorkerNumNodes

Worker Machine Type

workerMachineType

Worker Cores

workerCPUs

Worker Memory (GB)

workerMemoryMB

Worker Disk Size (GB)

workerDiskGB

Worker Disk Type

workerDiskType

Metadata

clusterMetaData

Network Tags

networkTags

Enable Secure Boot

secureBootEnabled

Enable vTPM

vTpmEnabled

Enable Integrity Monitoring

integrityMonitoringEnabled

Image Version

imageVersion

Custom Image URI

customImageUri

GCS Bucket

gcsBucket

Encryption Key Name

encryptionKeyName

Autoscaling Policy

autoScalingPolicy

Initialization Actions

initActions

Cluster Properties

clusterProperties

Labels

clusterLabels

Max Idle Time

idleTTL

Skip Cluster Delete

skipDelete

Enable Stackdriver Logging Integration

stackdriverLoggingEnabled

Enable Stackdriver Monitoring Integration

stackdriverMonitoringEnabled

Enable Component Gateway

componentGatewayEnabled

Prefer External IP

preferExternalIP

Create Poll Delay

pollCreateDelay

Create Poll Jitter

pollCreateJitter

Delete Poll Delay

pollDeleteDelay

Poll Interval

pollInterval