Reusing Dataproc clusters

Starting in CDAP 6.5.0, you can reuse Dataproc clusters between runs to improve processing time.

For chained pipelines that include over 3 pipelines, it can also provide cost saving (see Cost saving considerations below).

Cluster reuse is implemented in a model similar to connection pooling or thread pooling: any cluster is kept up and running for a specified time after the run is finished. When a new run is started, it will try to find an idle cluster available that matches the configuration of the compute profile. If one is present, it will be used, otherwise a new cluster will be started.

Frequently Asked Questions

Are clusters shared?

Clusters are not shared. Similar to the regular ephemeral cluster provisioning model, a cluster runs a single pipeline run at a time. A cluster is reused only if it is idle.

What if I have many parallel pipeline runs?

If you enable cluster reuse for all your runs, the necessary number of clusters to process all your runs will be created as needed. Similar to the ephemeral Dataproc provisioner, there is no direct control on the number of clusters created. You can still use Google Cloud quotes to manage resources.

For example, if you run 100 runs with 7 maximum parallel runs, you will have up to 7 clusters at a given point of time.

Will clusters be reused between different pipelines and any pipeline runs?

Clusters are reused between different pipelines as soon as those pipelines are using the same profile and share the same profile settings. 

If profile customization is used, clusters will still be reused, but only if customizations are exactly the same, including all cluster settings like cluster labeling.

Cost savings considerations

When cluster reuse is enabled, there are two main cost considerations:

  1. Less resources are used for cluster startup and initialization.

  2. More resources are used for clusters to sit idle between the pipeline runs and after the last pipeline run.

While it’s hard to predict the cost effect of cluster reuse, you can employ a strategy to get maximum savings. The strategy is to identify a critical path for chained pipelines and enable cluster reuse for this critical path. This would ensure the cluster is immediately reused, no idle time is wasted and maximum performance benefits are achieved.

How to enable cluster reuse

To enable cluster reuse, you must complete both Step 1 and Step 2. If you only complete Step 1, the cluster continues running after the pipeline run is finished, but it will not be reused.

Step 1. Set Max Idle Time and Skip Cluster Delete in the compute profile

This can be done when creating the compute profile:

Or with compute profile customization in the Compute Config section of deployed pipeline configuration:

For Max Idle Time, consider cost vs availability of the cluster for reuse. The higher the value for Max Idle Time, the more clusters will be sitting idle ready for a run.

Step 2. Set system.profile.properties.clusterReuseEnabled runtime argument to true