Page Comparison

...

Like any shuffle, data skew will negatively impact performance. In the small example above, eggs are purchased much more frequently than chicken or milk, which means the executor joining egg purchases does more work than the other executors. If you notice that a join is skewed, there are two ways to improve performance.

Automatic splitting up of skewed partitions

Starting CDAP 6.4.0, with adaptive query execution really heavy skews will be handled automatically. As soon as a join produces some partitions much bigger than others, they are split into smaller ones. Please see the Autotuning section above to confirm you have adaptive query execution enabled

In-Memory Joins

In-memory joins were added in version 6.1.3. An in-memory join can be performed if one side of the join is small enough to fit in memory. In this situation, the small dataset is loaded into memory, and then broadcast to every executor. This means the large dataset is not shuffled at all, removing the uneven partitions generated when shuffling on the join key.

...

Versions Compared

Old Version 2

New Version 3

Key

Automatic splitting up of skewed partitions

In-Memory Joins