Glossary

Apache Hadoop

See Hadoop.

Apache Spark

See Spark Program.

Application

A collection of programs and services that read and write through the data abstraction layer in CDAP.

Application Abstraction

Application abstraction allows the same application to run in multiple environments without modification.

Application Template

An artifact, that with the addition of a configuration file, can be used to create manifestations of applications.

Artifact

A JAR file containing Java classes and resources required to create and run an Application. Multiple applications can be created from the same artifact.

Avro

Refers to the Apache Avro™ project, which is a data serialization system that provides rich data structures and a compact, fast, binary data format.

Batch Pipeline

A type of CDAP Pipeline that runs on a schedule, performing actions on a distinct set of data.

CDAP

The Cask Data Application Platform; refers to both the platform, and an installed instance of it.

CDAP Application

See Application.

CDAP CLI

See Command Line Interface.

CDAP Console

See CDAP UI.

CDAP Pipeline

A CDAP application; created from an application template, generally one of the system artifacts shipped with CDAP; defines a source to read from, zero or more transformations or other steps to perform on the data that was read from the source, and one or more sinks to write the transformed data to.

CDAP Pipeline Plugin

A plugin of type BatchSource, RealtimeSource, BatchSink, RealtimeSink, or Transformation, packaged in a JAR file format, for use as a plugin in a CDAP pipeline.

CDAP Sandbox

A version of the Cask Data Application Platform, supplied as a downloadable archive, that runs on a single machine in a single Java Virtual Machine (JVM). It provides all of the CDAP APIs without requiring a Hadoop cluster, using alternative, fully-functional implementations of CDAP features. For example, application containers are implemented as Java threads instead of YARN containers. Formerly known as the Standalone CDAP.

CDAP UI

The CDAP UI is a web-based application used to deploy CDAP applications, create pipelines using the CDAP Studio, and query and manage the Cask Data Application Platform instance.

Command Line Interface

The Command Line Interface (CLI) provides methods to interact with CDAP from within a shell, similar to the HBase or bash shells.

DAG

A directed acyclic graph. A Pipeline is displayed as a DAG in the CDAP UI.

Data Abstraction

Abstraction of the actual representation of data in storage.

Data Pipeline

CDAP provides an easy method of configuring data pipelines using a visual editor, called Pipeline Studio. You click and drag sources, transformations, and sinks, configuring a pipeline within minutes. It provides an operational view of the resulting pipeline that allows for monitoring of metrics, logs, and other run-time information.

Dataset (Deprecated)

Note: Datasets are deprecated and will be removed in CDAP 7.0.0.

Datasets store and retrieve data and are a high-level abstraction of the underlying data storage with generic reusable implementations of common data patterns.

Distributed CDAP

A version of the Cask Data Application Platform, supplied as either Yum .rpm or APT .deb packages, that runs on a Hadoop cluster. Packages are available for Ubuntu 12 and CentOS 6.

ELT execution

ELT (Extract Load Transform) is a shorthand for a pattern of execution that involves pushing transformations defined in a data pipeline to BigQuery. It pushes transformations to the data, and thereby has significant potential performance benefits.

ETL

Abbreviation for extract, transform, and loading of data.

ETL execution

ETL (Extract Transform Load) is shorthand for a pattern of execution that involves extracting data out of a storage engine into a staging area for execution using a scalable data processing engine (such as Spark), and then loading it into the appropriate destination.

Exploring

Datasets in CDAP can be explored through ad-hoc SQL-like queries. To enable exploration, you must set several properties when creating the dataset, and the files in a dataset must meet certain requirements.

FileSet

A dataset composed of collections of files in the file system that share some common attributes such as the format and schema, which abstracts from the actual underlying file system interfaces.

Hadoop

Refers to the Apache Hadoop® project, which describes itself as:

"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."

Hybrid ETL/ELT

Hybrid ETL/ELT is a pattern of execution in which the pipeline execution planner pushes down as many transformations as possible to BigQuery (ELT), then stages the data to execute pushdown-incompatible transformations in a scalable processing engine like Spark (ETL), and then pushes down the rest of the pipeline to BigQuery (ELT).

Logical Pipeline

A view of a pipeline composed of sources, sinks, and other plugins, and does not show the underlying technology used to actually manifest and run the pipeline.

MapReduce

MapReduce is a processing model used to process data in batch. MapReduce programs can be written as in a conventional Apache Hadoop system. CDAP datasets can be accessed from MapReduce programs as both input and output.

Master Services

CDAP system services that are run in YARN containers, such as the Transaction Service, Dataset Executor, Log Saver, Metrics Processor, etc.

Namespace

A namespace is a logical grouping of application, data and its metadata in CDAP. Conceptually, namespaces can be thought of as a partitioning of a CDAP instance. Any application or data (referred to as an “entity”) can exist independently in multiple namespaces at the same time. The data and metadata of an entity is stored independent of another instance of the same entity in a different namespace. The primary motivation for namespaces in CDAP is to achieve application and data isolation.

Physical Pipeline

A physical pipeline is the manifestation of a logical pipeline as a CDAP application, which is a collection of programs and services that read and write through the data abstraction layer in CDAP.

Pipeline Studio

A visual editor, part of the CDAP UI, for creating and configuring pipelines. You click and drag sources, transformations, and sinks, and can name and configure the pipelines. It provides an operational view of the resulting pipeline that allows for monitoring of metrics, logs, and other run-time information.

Plugin

A plugin extends an application by implementing an interface expected by the application. Plugins are packaged in an artifact.

Pushdown-incompatible sources

These are transformations/aggregations in the pipeline that cannot be pushed down to a given pushdown engine. Data Fusion’s execution plan stages the input to such a node temporarily, then executes the logic of such a node as ETL (using the relevant engine), then stages the output of the ETL temporarily, and then potentially resumes the processing as ELT.

Pushdown log

These are statements in the pipeline log, that indicate the portion of the pipeline that is pushed down to one or more SQL Compatible storage systems corresponding to sources or sinks in the pipeline. These statements include the SQL query, as well as the nodes in the pipeline that are not pushed down. If transformation pushdown is a business-critical requirement for a user, these statements can help them ascertain the portion of the pipeline that is pushed down. It can also help them adjust the pipeline if possible to ensure that a larger portion of the pipeline is pushed down.

Real-time Pipeline

A type of CDAP Pipeline that runs continuously, performing actions on a distinct set of data.

Route Configuration

Also known as a route config, a map that allocates requests for a service between different versions of the service.

Secure Key

An identifier or an alias for an entry in Secure Storage. An entry in secure storage can be referenced and retrieved using a Secure Key using programmatic or Microservices.

Secure Storage

Encrypted storage for sensitive data using Secure Keys. CDAP supports File-backed (for CDAP Sandbox) as well as Apache Hadoop KMS-backed (for Distributed CDAP) Secure Storage.

Service

Services can be run in a Cask Data Application Platform (CDAP) application to serve data to external clients. Services run in containers and the number of running service instances can be dynamically scaled. Developers can implement custom services to interface with a legacy system and perform additional processing beyond the CDAP processing paradigms. Examples could include running an IP-to-Geo lookup and serving user-profiles.

Spark

Spark is a fast and general processing engine, compatible with Hadoop data, used for in-memory cluster computing. It lets you load large sets of data into memory and query them repeatedly, making it suitable for both iterative and interactive programs. Similar to MapReduce, Spark can access datasets as both input and output. Spark programs in CDAP can be written in either Java or Scala.

Standalone CDAP

See CDAP Sandbox.

Storage Provider

For datasets, a storage provider is the underlying system that CDAP uses for persistence. Examples include HDFS, HBase, and Hive.

Structured Record

A data format, defined in CDAP, that can be used to exchange events between plugins. Used by many of the CDAP pipeline plugins included in CDAP.

System Artifact

An application template, shipped with CDAP, that with the addition of a configuration file, can be used to create manifestations of applications.

Time-partitioned FileSet

A FileSet dataset that uses a timestamp as the partitioning key to split the data into individual files. Though it is not required that data in each partition be organized by time, each partition is assigned a logical time. Typically written to in batch mode, at a set time interval.

Timeseries Dataset

A dataset where time is the primary means of how data is organized, and both the data model and the schema that represents the data are optimized for querying and aggregating over time ranges.

Transformation pushdown

This configuration determines if transformations in an ETL pipeline should be pushed down to BigQuery. It is also a container for more specific settings regarding transformation pushdown.

Worker

Workers are typically long-running background programs that can be used to execute tasks. Each instance of a worker runs either in its own YARN container (Distributed CDAP mode) or a single thread (CDAP Sandbox or in-memory mode) and the number of instances may be updated via Microservices or the CLI. Datasets can be accessed from inside workers.

Workflow

A workflow is used to execute a series of MapReduce programs, with an optional schedule to run itself periodically.