/
Add native support for Dataframe APIs in data processing pipeline
Add native support for Dataframe APIs in data processing pipeline
Proposal
Currently, CDAP with Spark engine uses RDD APIs, we propose to add support for Spark Dataframe/Dataset API for CDAP data processing.
Benefits of doing so!
- Performance benefits of Spark data frame (tungsten, filter pushdowns, serialization, GC to name few)
- Data frame/Dataset to and from RDD conversion in each plugin is a major overhead in the pipeline runtime.
How to implement it?
Option 1: We drop support of RDD<StructuredRecord> and move to Dataframe<Row> with the necessary change in CDAP core.
Option 2: Add a new Engine "Spark Dataframe" and add the implementation for it with the necessary change in CDAP core.
, multiple selections available,
Related content
Introduction to data pipelines
Introduction to data pipelines
More like this
Spark Computation in Scala Analytics
Spark Computation in Scala Analytics
More like this
Spark Sink in Scala
Spark Sink in Scala
More like this
Spark2 and Interactive Spark
Spark2 and Interactive Spark
More like this
Spark Revamp
Spark Revamp
More like this
CDAP Release 6.1.4
CDAP Release 6.1.4
More like this
Created in 2020 by Google Inc.