Introduction to data pipelines

CDAP is a self-service, reconfigurable, extendable framework to develop, run, automate, and operate data pipelines on Spark or Hadoop. Completely open source, it is licensed under the Apache 2.0 license.

CDAP includes the Pipeline Studio, a visual click-and-drag interface for building data pipelines from a library of pre-built plugins.

CDAP provides an operational view of the resulting pipeline that allows for lifecycle control and monitoring of the metrics, logs, and other runtime information. The pipeline can be run directly in CDAP with tools such as the Pipeline Studio, the CDAP CLI, or command line tools.

Data pipelines

Pipelines are applications, specifically for the processing of data flows, created from artifacts.

An artifact is an "application template". A pipeline application is created by CDAP by using a configuration file that defines the desired application, along with whichever artifacts are specified inside the configuration. Artifacts for creating data pipelines are supplied with CDAP.

Stages and plugins

A pipeline can be viewed as consisting of a series of stages. Each stage is a usage of a plugin, an extension to CDAP that provides a specific functionality.

The configuration properties for a stage describes what that plugin is to do (read from a data source, write to a table, run a script), and is dependent on the particular plugin used.

All stages are connected together in a directed acyclic graph (or DAG), which is shown in the Pipeline Studio and in CDAP as a connected series of plugins:

The general progression in a pipeline is:

  1. Pre-run operations: any actions required before the pipeline can actually run, such as preparing resources

  2. Data acquisition: obtaining data from a source or sources

  3. Data transformation: manipulating the data acquired from the sources

  4. Data publishing: saving the results of the transformation, either as additional data to a data sink or to a report

  5. Post-run operations: any actions required once the pipeline run has completed, such as emailing notifications or cleaning up resources, regardless if the pipeline run succeeded or failed

Different plugins are available to provide functionality for each stage.

Note: Transformations, Analytics, and Error Handling nodes between Source and Sink nodes are shown as solid lines. Action nodes before Source nodes or after Sink nodes are shown as dotted lines.

Data and Control Flow

Processing in the pipeline is governed by the following aspects:

  • Data flow is the movement of data, in the form of records, from one step of a pipeline to another. When data arrives at a stage, it triggers that stage's processing of the data and then the transference of results (if any) to the next stage.

  • Control flow is a parallel process that triggers a stage based on the result from another process, independent of the pipeline. Currently, control flow can be applied only to the initial stages (before any data flow stages run) and final stages (after all other data flow stages run) of a pipeline. A post-run stage is available after each pipeline run, successful or otherwise.

Logical and physical pipelines

Within CDAP, there is the concept of logical and physical pipelines.

A logical pipeline is the view as seen in the Pipeline Studio. It shows the stages, but not the underlying technology used to actually manifest and run the pipeline.

A physical pipeline is the manifestation of a logical pipeline as a CDAP application, which is a collection of programs and services that read and write through the data abstraction layer in CDAP.

planner is responsible for converting the logical pipeline to the physical pipeline. The planner analyzes the logical view of the pipeline and converts it to the CDAP application.

Types of pipelines

The data flows of a pipeline can be either batch or realtime, and a variety of processing paradigms (MapReduce or Spark) can be used.

The pipelines are created from artifacts, either system artifacts (supplied as part of CDAP) or user artifacts (installed from the Hub or created by a developer).

Batch data pipelines

Batch data pipelines can be scheduled to run periodically using a cron expression and can read data from batch sources using a Spark or MapReduce job. The batch application then performs any of a number of optional transformations before writing to one or more batch sinks.

Realtime data pipelines

Realtime pipelines poll sources periodically to fetch the data, perform optional transformations, and then write the output to one or more realtime sinks.

Note: CDAP supports at-least-once output of data into sinks in realtime pipelines, but it doesn't guarantee exactly-once delivery. If you require exactly-once output, plan for occasional duplication of data in sinks.

Data pipeline lifecycle

Similar to other CDAP applications, pipelines have a lifecycle, and can be managed and controlled using the tools supplied by CDAP, such as the Pipelines Studio, the Wrangler, the CDAP CLI, command line tools, and the Lifecycle Microservices.

Plugins

Data sources, transformations (called transforms for short), and data sinks are generically referred to as a plugin. Plugins provide a way to extend the functionality of existing artifacts. An application can be created with system plugins installed with CDAP, user plugins available in the Hub, or custom plugins.

For information about the capabilities and behavior of each plugin, see the Data Pipeline Plugin Reference.

Properties

Each stage in a pipeline represents the configuration of a specific plugin, and that configuration usually requires that certain properties be specified. At a minimum, a unique name for the stage and the plugin being used is required, with any additional properties required dependent on the particular plugin used.

For information about the properties required and supported for each plugin, see the Data Pipeline Plugin Reference

Schema

Each stage of a pipeline that emits data (basically, all stages except for pre-run operations and data publishing) emits data with a schema that is set for that stage. Schemas need to match appropriately from stage to stage, and controls within the CDAP Pipeline Studio allow the propagation of a schema to subsequent stages.

The schema allows you to control which fields and their types are used in all stages of pipeline. Certain plugins require specific schemas, and transform plugins are available to convert data to required formats and schemas.

Created in 2020 by Google Inc.