Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you are using Dataproc, the memory available for YARN containers will be roughly 75% of the VM memory. The minimum YARN container size is also adjusted depending on the size of the worker VMs. Some common worker sizes and their corresponding YARN settings are given in the table below.

Worker CPU

Worker memory (gb)

YARN node memory (gb)

Yarn min allocation memory (mb)

1

4

3

256

2

8

6

512

4

16

12

1024

8

32

24

1024

16

64

51

1024

Keep in mind that Spark will request for more memory than the executor memory set for the pipeline, and that YARN will round that requested amount up. For example, suppose you have set your executor memory to 2048mb, and have not given a value for spark.yarn.executor.memoryOverhead, which means the default of 384mb is used. That means Spark will request 2048mb + 384mb for each executor, which YARN will round up to an exact multiple of the YARN minimum allocation. When running on a 8gb worker node, because the YARN minimum allocation is 512mb, it will get rounded up to 2.5gb. This means each worker will be able to run two containers, using up all of the available CPUs, but leaving 1gb of YARN memory (6gb - 2.5gb - 2.5gb) unused. This means the worker node can actually be sized a little smaller, or the executors can be given a little bit more memory. When running on a 16gb worker node, 2048mb + 384mb 1024mb will get rounded up to 3gb because the YARN minimum allocation is 1024mb. This means each worker node is able to run four containers, with all CPUs and YARN memory in use.

This can be rather confusing to put together, so below is a table of recommended worker sizes given some common executor sizes.

Executor CPU

Executor Memory (mb)

Worker CPU

Worker Memory (gb)

1

2048

4

15

1

3072

4

21

1

4096

4

26

2

8192

4

26

For example, a 26gb worker node translates to 20gb of memory usable for running YARN containers. With executor memory set to 4gb, 1gb is added as overhead, which means 5gb YARN containers for each executor. This means the worker can run four containers without any extra resources leftover. You can also multiply the size of the workers. For example, if executor memory is set to 4096gb, a worker with 8 cpus and 52gb memory would also work well.

...

If your pipeline is shuffling a lot of data, disk performance will make a difference. If you are using Dataproc, it is recommended that you use disk sizes of at least 1tb, as disk performance scales up with disk size. For information about disk performance, see https://cloud.google.com/compute/docs/disks/performance. To give an idea of the performance difference, we executed a pipeline joining 500gb of data on clusters with different disk sizes, and therefore different disk speeds.

Disk Size (gb)

Total Time

Source + Shuffle Write Time

Shuffle Read + Join Time

256

119 min

78 min

37 min

512

63 min

45 min

16 min

1000

52 min

42 min

8 min

2048

45 min

35 min

8 min

The 256gb disks take around twice as much time as the 512gb disks. The 1000gb disks also perform significantly better than the 512gb disks, but the gains start to level off after that. This is because the disks are fast enough to no longer be the major bottleneck for the join.

For heavy shuffle pipelines, you may also experience a performance boost using SSDs instead of a HDD disk, though you will have to make a decision on whether the extra cost is worth it. To give an idea of the performance difference, we executed a pipeline joining 500gb of data on clusters with SSD and without SSD. Each cluster was a 10 worker cluster, using n1-standard-4 machines (4cpu, 15gb).

Local SSDs

Total Time

Source +Shuffle Write Time

Shuffle Read + Join Time

0

52 min

42 min

8 min

4 (one per cpu)

35 min

26 min

7 min

Number of Workers

In order to minimize execution time, you will want to ensure that your cluster is large enough that it can run as much as it can in parallel. For example, if your pipeline source reads data using 100 splits, you will want to make sure the cluster is large enough to run 100 executors at once.

...

One question people sometimes have is whether it is better to scale up the size of each worker, or whether it is better to add additional worker machines. For example, is it better to have 50 nodes with 16gb memory each, or is it better to have 100 nodes with 8gb memory each? The answer is that it normally does not matter much. The following table shows execution times for a pipeline that performs an evenly distributed 500gb join on a few different clusters. The overall cluster CPU, memory, and disk is the same. The only difference is the number of workers.

Worker Nodes

Worker CPU

Worker Memory

Worker Disk

Total Time

Source +Shuffle Write Time

Shuffle Read + Join Time

5

8

30gb

4000gb

42 min

33 min

8 min

10

4

15gb

2048gb

45 min

35 min

8 min

2 0

2

7.5gb

1000gb

44 min

34 min

7.5 min

 Each cluster has totals of 40 cpu, 120gb memory, and 20tb of disk. Each cluster executed the pipeline in roughly the same amount of time. Cost is roughly the same as well, as pricing is generally linear to the amount of cores, memory, and disk in use. For more information, see https://cloud.google.com/compute/vm-instance-pricing.

...