Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Advanced Settings

Image Version

The Dataproc image version. If none is given, one will automatically be chosen. If custom image URI is specified, this field will be ignored.

Custom Image URI

Dataproc image URI. If the URI is not specified, it will be inferred from the Image Version.

GCS Bucket

The Cloud storage bucket used by Cloud Dataproc to read/write cluster and job data.

Encryption Key Name

The GCP customer managed encryption key (CMEK) name used by Cloud Dataproc.

Autoscaling Policy

Specify the Autoscaling Policy ID (name) or the resource URI.

For information about configuring and using Dataproc autoscaling to automatically and dynamically resize clusters to meet workload demands, see theĀ Autoscaling clusters guide.

Recommended: Use autoscaling policies for increasing the cluster size, not for decreasing the size. Decreasing the cluster size with autoscaling removes nodes that hold intermediate data, which might cause your pipelines to run slowly or fail.

Initialization Actions

A list of scripts to be executed during initialization of the cluster. Init actions should be placed on Google Cloud Storage.

Cluster Properties

Cluster properties used to override default configuration properties for the Hadoop services. For example, the default Spark parallelism can be overridden by setting a value for spark:spark.default.parallelism. For more information, see Cluster properties.

...

Whether to skip cluster deletion at the end of a run. Clusters will need to be deleted manually. This should only be used when debugging a failed run.

Default is False.

Enable Stack Driver Logging Integration

...

Enable Component Gateway to allow access to cluster UIs like the YARN ResourceManager and Spark HistoryServer.

Default is False.

Prefer External IP

When the system is running on Google Cloud Platform in the same network as the cluster, it will normally use the internal IP when communicating with the cluster. Set to True to always use the external IP.

Default is False.

Polling Settings

Polling settings control how often cluster status should be polled when creating and deleting clusters. You may want to change these settings if you have a lot of pipelines scheduled to run at the same time using the same GCP account.

...