Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CDAP data pipelines is a capability of CDAP and combines a user interface with back-end services to enable the building, deploying, and managing of data pipelines. It has no dependencies outside of CDAP, and all pipelines run within a Hadoop cluster.

Architecture

CDAP data pipelines allows allow users to build complex data pipelines, either simple ETL (extract-transform-load) or more complicated Data Pipelines data pipelines on Hadoop.

Data pipelines, unlike the linear ETL pipelines, are often not linear in nature and require the performing of more complex transformations including forks and joins at the record and feed level. They can be configured to perform various functions at different times, including machine-learning algorithms and custom processing.

Pipelines Data pipelines need to support the creation of complex processing workloads that are repeatable, high-available and easily maintainable.

...

Pipeline Studio is a visual development environment for building data pipelines on Hadoop. It has a click-and-drag interface for building and configuring data pipelines. It also supports the ability to develop, run, automate, and operate data pipelines from within the CDAP UI. The pipeline interface integrates with the CDAP interface, allowing drill-down debugging of pipelines and can build metrics dashboards to closely monitor pipelines through CDAP. The Pipeline Studio integrates with other capabilities.

...

The application templates, Batch Data Pipeline - Batch and Realtime Data Pipeline - Realtime (which uses Spark Streaming), are available by default from within in the Pipeline Studio.

The ETL Batch and ETL Real-time Data Pipeline and Realtime Data Pipeline application templates expose three plugin types: source, transform, and sink. The Batch Data Pipeline Batch application template exposes three additional plugin types: aggregate, compute, and model. Additional plugin types can be created and will be added in upcoming releases., etc.

There are many different plugins that implement each of these types available "out-of-the-box" in CDAP. New plugins can be implemented using the public APIs exposed by the application templates. When an application template or a plugin is deployed within CDAP, it is referred to as an artifact. CDAP provides capabilities to manage the different versions of both the application templates and the plugins.

...

  • User Selects an Application Template

    A user building a pipeline within the Pipeline Studio will select a pipeline type, which is essentially picking an application template.

  • Retrieve the Plugins types supported by the selected Application Template

    Once a user has selected an application template, the Pipeline Studio makes a request to CDAP for the different plugin types supported by the application template. In the case of the ETL Batch pipelineData Pipeline, CDAP will return Source, Transform, and Sink as plugin types. This allows the Pipeline Studio to construct the selection drawer in the left sidebar of the UI.

  • Retrieve the Plugin definitions for each Plugin type

    Pipeline Studio then makes a request to CDAP for each plugin type, requesting all plugin implementations available for each plugin type.

  • User Builds the CDAP Pipeline

    The user then uses the Pipeline Studio canvas to create a pipeline with the available plugins.

  • Validation of the CDAP Pipeline

    The user can request at any point that the pipeline be validated. This request is translated into a Microservices call to CDAP, which is then passed to the application template, which validates whether the pipeline is valid.

  • Application Template Configuration Generation

    As the user is building a pipeline, the Pipeline Studio is building a JSON configuration that, when completed, will be passed to the application template to configure and create an application that is deployed to the cluster.

  • Converting a logical into a physical Pipeline and registering the Application

    When the user publishes the pipeline, the configuration generated by the Pipeline Studio is passed to the application template as part of the creation of the Application. The application template takes the configuration, passes it through a planner to create a physical layout, appropriately generates an application specification and registers the specification with CDAP as an application.

  • Managing the physical Pipeline

    Once the application is registered with CDAP, the pipeline is ready to be started. If it was scheduled, the schedule is ready to be enabled. The CDAP UI then uses the CDAP Microservices to manage the pipeline's lifecycle. The pipeline can be managed from CDAP through the CDAP UI, by using the CDAP CLI, or by using the Microservices.

  • Monitoring the physical Pipeline

    As CDAP pipelines are run as CDAP applications, their logs and metrics are aggregated by the CDAP system and available using Microservices.