/
Add native support for Dataframe APIs in data processing pipeline

Add native support for Dataframe APIs in data processing pipeline

Proposal

Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.

Benefits of doing so!

  • Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
  • Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.

How to implement it?

Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.

Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.


Related content

Created in 2020 by Google Inc.