Optimizing Joiner Performance

This section provides information about how to ensure optimal data pipeline performance.

Datasets with very large numbers of keys

Some datasets can have thousands or millions of key values. If one or more of your datasets have a large numbers of keys, there are a few things you can do to improve performance:

  • Use Wrangler to drop null records in key columns and drop columns that are not needed in the join output schema. After you drop null records in key columns and columns that are not needed in the join output, add the Wrangler transformation to the pipeline and then link it to a Joiner. The Wrangler logic must be executed before the join. Reducing unnecessary records and columns prior to the join improves performance.

  • Increase memory resources for executors to at least 8 GB. In the Pipeline Studio, click Configure > Resources. Set Spark executor memory to at least 8 GB. You may need to use more than 8 GB if your data is heavily skewed. Records that have the same key will be assigned to the same partition. Not enough executor memory to fit the whole partition leads to degraded performance due to on-disk spilling during shuffle.

 

  • Increase the number of partitions. The exact number of partitions will vary depending on your data. One common practice is to set the number of partitions to a macro, so that it can be adjusted as needed after deployment, or changed per run. For example, if you are joining 10 million records, you might need 400 partitions.

  • Use Kryo serialization. Click Configure > Engine Config > Custom config. In the Name field, add spark.serializer and in the Value field, add org.apache.spark.serializer.KryoSerializer:

Created in 2020 by Google Inc.