Pipeline resource configurations

This article provides best practice for configuring pipeline resources for Spark pipelines.

General Tips

  • Driver resources

    • Set the driver memory to 8 GB
      An increased driver memory is needed for pipelines that load inputs into memory or have large number of nodes (20+). If the driver memory is not set high, then it results in a driver crash and leads to the following error “Malformed reply from SOCKS server”.

    • Set the vCores to 1.

  • Executor resources

    • Set executor memory to minimum value of 4 GB (4096 MB).
      For use cases that involve join and aggregation that have high cardinality, this configuration should be increased.

    • Set vCores to 1.
      A note on Cluster Workers Memory - Be sure to set the cluster worker memory setting according to executor resources to ensure full cluster utilization. To use all cores of the worker, it should have enough memory to accommodate enough executors with about 2 GB per-executor extra.

      For example, if workers have 8 vCores and we use a default setting of 4GB/1 vCore per executor, the worker should have at least (4+2)*8 = 48 GB of memory to utilize all cores.

For more information, see Pipeline Performance Tuning.

Created in 2020 by Google Inc.