Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Use Wrangler to drop null records in key columns and drop columns that are not needed in the join output schema. After you drop null records in key columns and columns that are not needed in the join output, add the Wrangler transformation to the pipeline and then link it to a Joiner. The Wrangler logic must be executed before the join. Reducing unnecessary records and columns prior to the join improves performance.

  • Increase memory resources for executors to at least 8 GB. In the Pipeline Studio, click Configure > Resources. Set Spark executor memory to at least 8 GB. You may need to use more than 8 GB if your data is heavily skewed. Records that have the same key will be processed into memory in assigned to the same executorpartition. Not enough executor memory to fit the whole partition leads to degraded performance due to on-disk sorting and degraded performancespilling during shuffle.

...

 

  • Increase the number of partitions. The exact number of partitions will vary depending on your data. One common practice is to set the number of partitions to a macro, so that it can be adjusted as needed after deployment, or changed per run. For example, if you are joining 10 million records, you might need 400 partitions.

  • Use Kryo serialization. Click Configure > Engine Config > Custom config. In the Name field, add spark.serializer and in the Value field, add org.apache.spark.serializer.KryoSerializer:

...