Page Comparison

...

If you are using Dataproc, the memory available for YARN containers will be roughly 75% of the VM memory. The minimum YARN container size is also adjusted depending on the size of the worker VMs. Some common worker sizes and their corresponding YARN settings are given in the table below.

Worker CPU	Worker memory (gb)	YARN node memory (gb)	Yarn min allocation memory (mb)
1	4	3	256
2	8	6	512
4	16	12	1024
8	32	24	1024
16	64	51	1024

Keep in mind that Spark will request for more memory than the executor memory set for the pipeline, and that YARN will round that requested amount up. For example, suppose you have set your executor memory to 2048mb, and have not given a value for spark.yarn.executor.memoryOverhead, which means the default of 384mb is used. That means Spark will request 2048mb + 384mb for each executor, which YARN will round up to an exact multiple of the YARN minimum allocation. When running on a 8gb worker node, because the YARN minimum allocation is 512mb, it will get rounded up to 2.5gb. This means each worker will be able to run two containers, using up all of the available CPUs, but leaving 1gb of YARN memory (6gb - 2.5gb - 2.5gb) unused. This means the worker node can actually be sized a little smaller, or the executors can be given a little bit more memory. When running on a 16gb worker node, 2048mb + 384mb 1024mb will get rounded up to 3gb because the YARN minimum allocation is 1024mb. This means each worker node is able to run four containers, with all CPUs and YARN memory in use.

This can be rather confusing to put together, so below is a table of recommended worker sizes given some common executor sizes.

Executor CPU	Executor Memory (mb)	Worker CPU	Worker Memory (gb)
1	2048	4	15
1	3072	4	21
1	4096	4	26
2	8192	4	26

For example, a 26gb worker node translates to 20gb of memory usable for running YARN containers. With executor memory set to 4gb, 1gb is added as overhead, which means 5gb YARN containers for each executor. This means the worker can run four containers without any extra resources leftover. You can also multiply the size of the workers. For example, if executor memory is set to 4096gb, a worker with 8 cpus and 52gb memory would also work well.

...

If your pipeline is shuffling a lot of data, disk performance will make a difference. If you are using Dataproc, it is recommended that you use disk sizes of at least 1tb, as disk performance scales up with disk size. For information about disk performance, see https://cloud.google.com/compute/docs/disks/performance. To give an idea of the performance difference, we executed a pipeline joining 500gb of data on clusters with different disk sizes, and therefore different disk speeds.

Disk Size (gb)	Total Time	Source + Shuffle Write Time	Shuffle Read + Join Time
256	119 min	78 min	37 min
512	63 min	45 min	16 min
1000	52 min	42 min	8 min
2048	45 min	35 min	8 min

The 256gb disks take around twice as much time as the 512gb disks. The 1000gb disks also perform significantly better than the 512gb disks, but the gains start to level off after that. This is because the disks are fast enough to no longer be the major bottleneck for the join.

For heavy shuffle pipelines, you may also experience a performance boost using SSDs instead of a HDD disk, though you will have to make a decision on whether the extra cost is worth it. To give an idea of the performance difference, we executed a pipeline joining 500gb of data on clusters with SSD and without SSD. Each cluster was a 10 worker cluster, using n1-standard-4 machines (4cpu, 15gb).

Local SSDs	Total Time	Source +Shuffle Write Time	Shuffle Read + Join Time
0	52 min	42 min	8 min
4 (one per cpu)	35 min	26 min	7 min

Number of Workers

In order to minimize execution time, you will want to ensure that your cluster is large enough that it can run as much as it can in parallel. For example, if your pipeline source reads data using 100 splits, you will want to make sure the cluster is large enough to run 100 executors at once.

...

One question people sometimes have is whether it is better to scale up the size of each worker, or whether it is better to add additional worker machines. For example, is it better to have 50 nodes with 16gb memory each, or is it better to have 100 nodes with 8gb memory each? The answer is that it normally does not matter much. The following table shows execution times for a pipeline that performs an evenly distributed 500gb join on a few different clusters. The overall cluster CPU, memory, and disk is the same. The only difference is the number of workers.

Worker Nodes	Worker CPU	Worker Memory	Worker Disk	Total Time	Source +Shuffle Write Time	Shuffle Read + Join Time
5	8	30gb	4000gb	42 min	33 min	8 min
10	4	15gb	2048gb	45 min	35 min	8 min
2 0	2	7.5gb	1000gb	44 min	34 min	7.5 min

Each cluster has totals of 40 cpu, 120gb memory, and 20tb of disk. Each cluster executed the pipeline in roughly the same amount of time. Cost is roughly the same as well, as pricing is generally linear to the amount of cores, memory, and disk in use. For more information, see https://cloud.google.com/compute/vm-instance-pricing.

...

Versions Compared

Old Version 4

New Version Current

Key

Number of Workers