Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Abstraction of the actual representation of data in storage.

Data PipelineA

type of pipeline, often not linear in nature and require the performing of complex transformations including forks and joins at the record and feed level. They can be configured to perform various functions at different times, including machine-learning algorithms and custom processingCDAP provides an easy method of configuring data pipelines using a visual editor, called Pipeline Studio. You click and drag sources, transformations, and sinks, configuring a pipeline within minutes. It provides an operational view of the resulting pipeline that allows for monitoring of metrics, logs, and other run-time information.

Dataset

Datasets store and retrieve data and are a high-level abstraction of the underlying data storage with generic reusable implementations of common data patterns.

...

Abbreviation for extract, transform, and loading of data.

ETL

Refers to the ExtractTransform and Load of data.

Exploring

Datasets in CDAP can be explored through ad-hoc SQL-like queries. To enable exploration, you must set several properties when creating the dataset, and the files in a dataset must meet certain requirements.

...

A namespace is a logical grouping of application, data and its metadata in CDAP. Conceptually, namespaces can be thought of as a partitioning of a CDAP instance. Any application or data (referred to here as an “entity”) can exist independently in multiple namespaces at the same time. The data and metadata of an entity is stored independent of another instance of the same entity in a different namespace. The primary motivation for namespaces in CDAP is to achieve application and data isolation.

...

A physical pipeline is the manifestation of a logical pipeline as a CDAP application, which is a collection of programs and services that read and write through the data abstraction layer in CDAP.Pipeline

CDAP provides an easy method of configuring pipelines using a visual editor, called CDAP Studio. You click and drag sources, transformations, and sinks, configuring an pipeline within minutes. It provides an operational view of the resulting pipeline that allows for monitoring of metrics, logs, and other run-time information.

Pipeline

A pipeline is a series of stages—linked usages of individual programs—configured together into an application.

Plugin

A plugin extends an application by implementing an interface expected by the application. Plugins are packaged in an artifact.

Plugin

A plugin extends an application template by implementing an interface expected by the application template. Plugins are packaged in an artifact.

Real-time Pipeline

A type of CDAP Pipeline that runs continuously, performing actions on a distinct set of data.

Route Config

See route configuration.

Route Configuration

Also known as a route config, a map that allocates requests for a service between different versions of the service.

...

For datasets, a storage provider is the underlying system that CDAP uses for persistence. Examples include HDFS, HBase, and Hive.

Structured Record

The data format used to exchange events between most of the pre-built CDAP ETL plugins.

Structured Record

A data format, defined in CDAP, that can be used to exchange events between plugins. Used by many of the CDAP pipeline plugins included in CDAP.

...