Configuring pipeline for aggregation and join use cases

This topic provides a set of recommendations from a memory configuration for aggregation and join use cases.

Memory configurations

system.resources.reserved.memory.override should be set to 1024. This is needed because YARN kills the container as a false positive for increased use of physical memory. Java heap memory usage is capped however permgen is not and will increase with thread usage and classloader usage. So even if a java process has heap set to 16 GB, its physical memory usage can go over that because of permgen. Setting the reserved memory to 1024 gives some room before YARN kills the container.
Spark Driver memory from Resources > Driver in Configure pipeline section should be set to 4 GB.

Number of Executors

Executors in Spark jobs control the parallelism. By default, Dataproc is configured for dynamic allocation that scales the number of executors based on the workload .

Number of Worker nodes

By default the number of worker nodes is set as 2. To increase the parallelism for large workloads or a pipeline with multiple Deduplicate, aggregate, or Joiner plugins, configure a Dataproc compute profile with a larger number of workers.

Number of Partitions

By default, the number of partitions is not set in the Joiner, Deduplicate, and aggregate plugins. This allows for the underlying framework (Spark) to determine the partitions. If the number of partitions are changed manually, ensure that the number of partitions is less that number of executors (in the case of dynamic allocation, the number of container per node).