Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If your pipeline is shuffling a lot of data, disk performance will make a difference. If you are using Dataproc, it is recommended that you use disk sizes of at least 1tb, as disk performance scales up with disk size. For information about disk performance, see https://cloud.google.com/compute/docs/disks/performance. To give an idea of the performance difference, we executed a pipeline joining 500gb of data on clusters with different disk sizes, and therefore different disk speeds.

...

 Each cluster has totals of 40 cpu, 120gb memory, and 20tb of disk. Each cluster executed the pipeline in roughly the same amount of time. Cost is roughly the same as well, as pricing is generally linear to the amount of cores, memory, and disk in use. For more information, see https://cloud.google.com/compute/vm-instance-pricing.

Autoscale

If you are running on Dataproc clusters, you can enable Dataproc autoscaling to automatically size your cluster depending on the workload. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling for more information about Dataproc autoscaling.

...