Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Data skew is an important characteristic to consider when implementing joins. A skewed join happens when a significant percentage of input records in one dataset have the same key and therefore join to the same record in the second dataset. This is problematic due to the way the execution framework handles joins. At a high level, all records with matching keys are grouped into a partition, these partitions are distributed across the nodes in a cluster to perform the join operation. In a skewed join, one or more of these partitions will be significantly larger than the rest, which will result in a majority of the workers in the cluster remaining idle while a couple workers process the large partitions. This results in poor performance since the cluster is being under utilized.

Plugin version: 2.9.0

Solution 1: In-Memory Join (Spark only)

...