A "behind-the-scenes" look at CDAP Pipelines

CDAP data pipelines is a capability of CDAP and combines a user interface with back-end services to enable the building, deploying, and managing of data pipelines. It has no dependencies outside of CDAP, and all pipelines run within a Hadoop cluster.

Architecture

CDAP data pipelines allow users to build complex data pipelines, either simple ETL (extract-transform-load) or more complicated data pipelines on Hadoop.

Data pipelines, unlike the linear ETL pipelines, are often not linear in nature and require the performing of more complex transformations including forks and joins at the record and feed level. They can be configured to perform various functions at different times, including machine-learning algorithms and custom processing.

Data pipelines need to support the creation of complex processing workloads that are repeatable, high-available and easily maintainable.

Logical versus Physical Pipelines

Logical and Physical Pipelines, converted by a Planner

Within CDAP, there is the concept of logical and physical pipelines, converted by a planner, and then run in an execution environment.

logical pipeline is the view of the pipeline as seen in the CDAP Pipeline Studio. It is the view composed of sources, sinks, and other plugins, and does not show the underlying technology used to actually manifest and run the pipeline.

This view of a pipeline focuses on the functional requirements of the pipeline, rather than the physical runtime. It’s closer to the inherent nature of processing as viewed by a user. This view isolates it from the volatile physical pipeline, which can be operated in different runtime environments.

physical pipeline is the manifestation of a logical pipeline as a CDAP application, which is a collection of programs and services that read and write through the data abstraction layer in CDAP. Physical view elements are those elements that actually run during the execution of a data pipeline on a Hadoop cluster. They execute the MapReduce Programs, Spark, Spark Streaming, Workflows, and so on. The physical pipeline view is based on the particular underlying technologies used and, as such, can be changed dynamically.

planner is responsible for converting the logical pipeline to a physical pipeline. The planner analyzes the logical view of the pipeline and converts it to a physical execution plan, performing optimizations, and bundling functions into one or more jobs.

Execution Environment

The execution environment is the actual runtime environment where all the components of the data pipeline are executed on the Hadoop cluster by CDAP. MapReduce, Spark, and Spark Streaming are part of this environment that allows the execution of the data pipeline. The planner maps the logical pipeline to physical pipeline using the environment runtimes available.

Functional Components

These are the different functional components that are utilized within CDAP pipelines:

Functional Architecture of CDAP Data Pipelines

Application

An application is a standardized container framework for defining all services. It is responsible for managing the lifecycle of programs within an application. Each CDAP data pipeline is converted into a CDAP application, and deployed and managed independently.

Application Template

An application template is a user-defined, reusable, reconfigurable pattern of an application. It is parameterized by a configuration that can be reconfigured upon deployment. It provides a generic version of an application which can be repurposed, instead of requiring the ongoing creation of specialized applications. The re-configurability and modularization of the application is exposed through plugins. CDAP provides its own, system-defined application templates, though new user-defined ones can be added that can use the DAG interface of the Pipeline Studio. The application templates are configured using the Pipeline Studio and deployed as applications into a Hadoop cluster.

Application templates consist of a definition of its different components, processing, and workflow, in the form of a configuration. Once a configuration is passed to the template, a CDAP application is constructed by combining the necessary pieces to form an executable pipeline. An application template consists of:

Plugin

plugin is a customizable module, exposed and used by an application template. It simplifies adding new features or extending the capability of an application. Plugin implementations are based on interfaces exposed by the application templates. CDAP data pipeline application templates expose Source, Transform, and Sink interfaces, which have multiple implementations.

Artifact

An artifact is a versioned packaging format used to aggregate applications, custom datasets, and plugins along with associated metadata. It is a JAR (Java Archive) containing Java classes and resources.

Pipeline Studio

Pipeline Studio is a visual development environment for building data pipelines on Hadoop. It has a click-and-drag interface for building and configuring data pipelines. It also supports the ability to develop, run, automate, and operate data pipelines from within the CDAP UI. The pipeline interface integrates with the CDAP interface, allowing drill-down debugging of pipelines and can build metrics dashboards to closely monitor pipelines through CDAP. The Pipeline Studio integrates with other capabilities.

Testing and Automation Framework

An end-to-end JUnit framework (written in Java) is available in CDAP that allows developers to test their application templates and plugins during development. It is built as a modular framework that allows for the testing of individual components. It runs in-memory in CDAP, as the abstracting to in-memory structures makes for easier debugging (shorter stack traces). The tests can be integrated with continuous integration (CI) tools such as Bamboo, Jenkins, and TeamCity.

Implementation of CDAP Pipelines

CDAP data pipelines are built as a CDAP capability, with three major components:

The Pipeline Studio interfaces with CDAP using Microservices.

The application templates, Batch Data Pipeline and Realtime Data Pipeline (which uses Spark Streaming), are available by default in the Pipeline Studio.

The Batch Data Pipeline and Realtime Data Pipeline application templates expose three plugin types: source, transform, and sink. The Batch Data Pipeline application template exposes additional plugin types: aggregate, compute, and model, etc.

There are many different plugins that implement each of these types available "out-of-the-box" in CDAP. New plugins can be implemented using the public APIs exposed by the application templates. When an application template or a plugin is deployed within CDAP, it is referred to as an artifact. CDAP provides capabilities to manage the different versions of both the application templates and the plugins.

Internals of CDAP Data Pipelines

Building a Data Pipeline

Here is how the Pipeline Studio works with CDAP to build a data pipeline, beginning with a user creating a new data pipeline in the Pipeline Studio. First, the components of the Pipeline Studio:

Pipeline Studio, showing different UI components