Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The general progression in a pipeline is:

  1. Pre-run operations: any actions required before the pipeline can actually run, such as preparing resources

  2. Data acquisition: obtaining data from a source or sources

  3. Data transformation: manipulating the data acquired from the sources

  4. Data publishing: saving the results of the transformation, either as additional data to a data sink or to a report

  5. Post-run operations: any actions required once the pipeline run has completed, such as emailing notifications or cleaning up resources, regardless if the pipeline run succeeded or failed

Different plugins are available to provide functionality for each stage.

...

The data flows of a pipeline can be either batch or real-time, and a variety of processing paradigms (MapReduce or Spark) can be used.Batch applications can

The pipelines are created from artifacts, either system artifacts (supplied as part of CDAP) or user artifacts (installed from the Hub or created by a developer).

Batch data pipelines

Batch data pipelines can be scheduled to run periodically using a cron expression and can read data from batch sources using a Spark or MapReduce job. The batch application then performs any of a number of optional transformations before writing to one or more batch sinks.Real-time applications are designed to

Realtime data pipelines

Realtime pipelines poll sources periodically to fetch the data, perform any optional transformations required, and then write the output to one or more real-time sinks.The pipelines are created from artifacts, either system artifacts (supplied as part of CDAP) or user artifacts (installed from the Hub or created by a developer)

Note: CDAP supports at-least-once output of data into sinks in real-time pipelines, but it doesn't guarantee exactly-once delivery. If you require exactly-once output, plan for occasional duplication of data in sinks.

Data pipeline lifecycle

Similar to other CDAP applications, pipelines have a lifecycle, and can be managed and controlled using the tools supplied by CDAP, such as the Pipelines Studio, the Wrangler, the CDAP CLI, command line tools, and the Lifecycle Microservices.

...