Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In CDAP 6.1.3 and earlier, data is cached aggressively to try and prevent any type of reprocessing in the pipeline. Starting from CDAP 6.1.4, data is cached at fewer points, only to prevent sources from being read multiple times. This change was done because caching is often more expensive than re-computing a transformation. In the example above, this means 6.1.3 would cache the output of the Distinct plugin, while 6.1.4 would not, since Spark itself knows it can start from the intermediate shuffle data. However, it does mean that the metrics for the number of output records from the Distinct stage will be double the number of records processed on each branch, as the post-shuffle work is still computed twice.

...