This topic provides a set of recommendations from a memory configuration for aggregation and join use cases.

Memory configurations

Number of Executors

Executors in Spark jobs control the parallelism. By default, Dataproc is configured for dynamic allocation that scales the number of executors based on the workload .

Number of Worker nodes

By default the number of worker nodes is set as 2. To increase the parallelism for large workloads or a pipeline with multiple Deduplicate, aggregate, or Joiner plugins, configure a Dataproc compute profile with a larger number of workers.

Number of Partitions

By default, the number of partitions is not set in the Joiner, Deduplicate, and aggregate plugins. This allows for the underlying framework (Spark) to determine the partitions. If the number of partitions are changed manually, ensure that the number of partitions is less that number of executors (in the case of dynamic allocation, the number of container per node).