Pipeline Studio overview

Pipeline Studio is a visual tool that helps you design and build data pipelines.

A data pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate and load data with connectivity to a variety of on-premise and cloud data sources. Data pipelines can help users (ETL developers, business data analysts, data scientists, data engineers and developers) solve their data ingestion, integration and migration problems. They also help users build and manage data lakes easily and effectively.

The Pipeline Studio has built-in validation logic in each stage of the data pipeline to ensure your transformations will execute successfully.

In the Pipeline Studio, at the top left hand side, you can select whether you want to create a batch or realtime data pipeline.

Below it, is the plugin palette that lists all the available plugins to help you build pipelines.

The Pipeline Studio also helps you with other ETL development tasks such as:

Debugging pipelines by using Preview
Deploying pipelines
Saving pipelines
Configuring pipeline resources
Scheduling pipeline jobs
Importing and exporting pipelines

Plugins

The objects in a pipeline are called plugins. For example, sources, transformations, and sinks are all plugins. From the plugin palette, you can click on any plugin to add it to a pipeline.

Each plugin has a few required fields that are marked with an asterisk *. The configuration section of each plugin contains a Label field. Use this label to make the description of the plugin as descriptive as possible.

Depending on the plugin you are using, you might see an Input Schema on the left, a Configuration section in the middle, and an Output Schema on the right.

For plugins that contain an input schema, these are the fields that get passed into the plugin to be processed. Once they are processed by the plugin, outgoing data might be sent out in the output schema to the next plugin in the data pipeline, or in the case of a sink written to a dataset. You will notice that Source plugins don’t have an input schema and Sink plugins don’t have an output schema.

In addition to the configuration settings available in each plugin, you can also edit the output schema. You can allow or disallow nulls for each field, change the data type of any field, and remove unnecessary fields from the output schema.

Each plugin has embedded documentation. Just click the Documentation tab to review configuration information for the plugin.

Data pipeline modes

Data pipelines have two modes:

Draft
Deployed

Draft mode

From within the Pipeline Studio, you can save a pipeline you are working on at any time as a draft. The pipeline configuration is saved, and you can resume editing later.

To create a draft, give your data pipeline a unique name, and then click the Save button.

The pipeline displays under Drafts on the List page. To continue working on the pipeline, click the name under Drafts and it opens in the Pipeline Studio canvas.

Deployed mode

After you finish designing and debugging a pipeline and are satisfied with the data you see in Preview, you are ready to deploy the pipeline. Deploying converts the pipeline into an immutable entity, which is not versioned. Once deployed, you cannot edit the pipeline. You can run the pipeline, delete it, or create a duplicate pipeline that displays in draft mode in the Pipeline Studio canvas.

Overview of the data pipeline development process

Create a pipeline. Add sources, transformations, and sinks.
Preview output. You might need to modify the pipeline to get the desired output.
Deploy the pipeline.
Configure pipeline resources.
Run the pipeline.
Validate the output in the sinks.

CDAP Documentation