Pipelines allow you to specify the CPUs and memory that should be given to the Spark driver and to each Spark executor.
Since the driver does not do much work, the default of 1 CPU and 2 GB memory is normally enough to run most pipelines. You may need to increase the memory for pipelines that contain many stages or large schemas. As mentioned in the join section, if the pipeline is performing in-memory joins, the in-memory dataset(s) also need to fit in the driver's memory.
The number of CPUs assigned to an executor determines the number of tasks the executor can run in parallel. Each partition of data requires one task to process. In most cases, it is simplest to set the number of CPUs to one and just concentrate on adjusting memory.
For most pipelines, 4 GB of executor memory is enough to successfully run the pipeline. Heavily skewed, multi terabyte joins have been completed with 4gb executors. It is possible to improve execution speed by increasing the amount of memory, but this requires a good understanding of both your data and your pipeline.
Spark divides memory into several groups. One section is reserved for Spark internal usage. The other is used for execution and storage.
By default, the storage and execution section is roughly 60% of the total memory. This percentage is controlled by Spark's spark.memory.fraction config property (defaults to 0.6). This performs well for most workloads and should not normally be adjusted.
The storage and execution section is further divided into a separate storage and separate execution section. By default, they are the same size, but can be adjusted by setting spark.memory.storageFraction (defaults to 0.5) to control what percentage of the space is reserved for storage. The diagram below illustrates the full picture with default settings.
The storage space is used to store cached data. The execution space is used to store shuffle, join, sort, and aggregation data. If there is extra space in the execution section, Spark can use some of it as storage data. However, execution data will never use any of the storage space.
If you know your pipeline is not caching any data, you can reduce the storage fraction to leave more room for execution requirements.
YARN Container Memory
The executor memory setting controls the amount of heap memory given to the executors. Spark adds an additional amount of memory for off-heap memory, which is controlled by the spark.executor.memoryOverhead setting, which defaults to 384m. This means the amount of memory YARN reserves for each executor will be higher than the number set in the pipeline resources configuration. For example, if you set executor memory to 2048m, Spark will add 384m to that number and request YARN for a 2432m container. On top of that, YARN will round the request number up to a multiple of yarn.scheduler.increment-allocation-mb, which defaults to the value of yarn.scheduler.minimum-allocation-mb. So if it is set to 512, YARN will round the 2432m up to 2560m. If the value is set to 1024, YARN will round up the 2432m to 3072m. This is useful to keep in mind when determining the size of each worker node in your cluster.