Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 Each cluster has totals of 40 cpu, 120gb memory, and 20tb of disk. Each cluster executed the pipeline in roughly the same amount of time. Cost is roughly the same as well, as pricing is generally linear to the amount of cores, memory, and disk in use. For more information, see https://cloud.google.com/compute/vm-instance-pricing.

Autoscale

If you are running on Dataproc clusters, you can enable Dataproc autoscaling to automatically size your cluster depending on the workload. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling for more information about Dataproc autoscaling.

Autoscale involves dividing a cluster into two types of worker nodes -- primary and secondary. Primary nodes run both HDFS (storage) and YARN (computation). Secondary nodes only run YARN (computation). Primary nodes should not be allowed to autoscale because it causes problems with HDFS. Only secondary nodes should be allowed to scale.

Scaling Down

It is simplest to use an autoscaling policy that allows the cluster to scale up, but never to scale down. Scaling down in the middle of a job will often cause Spark tasks to fail, resulting in pipeline delays or outright failure. A policy that never scales down matches well with the ephemeral nature of the Dataproc provisioner, which creates a cluster for each pipeline run and tears it down after the run. When possible, it is advised to use this type of policy.

If you are using a static cluster, you may want an autoscaling policy that does scale down to reduce costs during periods of low activity. Scale down policies are more complicated to configure, as you will want to set the cooldown duration to a large enough value to avoid scaling down in the middle of a pipeline run. Alternatively, you can use Dataproc's Enhanced Flexibility Mode to make scaling down a safer operation.

Enhanced Flexibility Mode (EFM)

EFM allows you to specify that only primary worker nodes be involved when shuffling data. Since secondary workers are no longer responsible for intermediate shuffle data, when they are removed from a cluster, Spark jobs will not run into delays or errors. Since primary workers are never scaled down, this makes cluster scale down more stable and efficient. If you are running pipelines with shuffles on a static cluster, it is recommended that you use EFM.

See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex for more information on EFM.