Add native support for Dataframe APIs in data processing pipeline

Proposal

Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.

Benefits of doing so!

  • Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
  • Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.

How to implement it?

Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.

Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.


Created in 2020 by Google Inc.