Proposal

Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.

Benefits of doing so!

Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.

Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.

Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.