Add native support for Dataframe APIs in data processing pipeline
Proposal
Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.
Benefits of doing so!
- Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
- Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.
How to implement it?
Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.
Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.
Created in 2020 by Google Inc.