Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A version of the Cask Data Application Platform, supplied as either Yum .rpm or APT .deb packages, that runs on a Hadoop cluster. Packages are available for Ubuntu 12 and CentOS 6.

ELT execution

ELT (Extract Load Transform) is a shorthand for a pattern of execution that involves pushing transformations defined in a data pipeline to BigQuery. It pushes transformations to the data, and thereby has significant potential performance benefits.

ETL

Abbreviation for extract, transform, and loading of data.

ETL execution

ETL (Extract Transform Load) is shorthand for a pattern of execution that involves extracting data out of a storage engine into a staging area for execution using a scalable data processing engine (such as Spark), and then loading it into the appropriate destination.

Exploring

Datasets in CDAP can be explored through ad-hoc SQL-like queries. To enable exploration, you must set several properties when creating the dataset, and the files in a dataset must meet certain requirements.

...

"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."

Hybrid ETL/ELT

Hybrid ETL/ELT is a pattern of execution in which the pipeline execution planner pushes down as many transformations as possible to BigQuery (ELT), then stages the data to execute pushdown-incompatible transformations in a scalable processing engine like Spark (ETL), and then pushes down the rest of the pipeline to BigQuery (ELT).

Logical Pipeline

A view of a pipeline composed of sources, sinks, and other plugins, and does not show the underlying technology used to actually manifest and run the pipeline.

...

A plugin extends an application by implementing an interface expected by the application. Plugins are packaged in an artifact.

Pushdown-incompatible sources

These are transformations/aggregations in the pipeline that cannot be pushed down to a given pushdown engine. Data Fusion’s execution plan stages the input to such a node temporarily, then executes the logic of such a node as ETL (using the relevant engine), then stages the output of the ETL temporarily, and then potentially resumes the processing as ELT.

Pushdown log

These are statements in the pipeline log, that indicate the portion of the pipeline that is pushed down to one or more SQL Compatible storage systems corresponding to sources or sinks in the pipeline. These statements include the SQL query, as well as the nodes in the pipeline that are not pushed down. If transformation pushdown is a business-critical requirement for a user, these statements can help them ascertain the portion of the pipeline that is pushed down. It can also help them adjust the pipeline if possible to ensure that a larger portion of the pipeline is pushed down.

Real-time Pipeline

A type of CDAP Pipeline that runs continuously, performing actions on a distinct set of data.

...

A dataset where time is the primary means of how data is organized, and both the data model and the schema that represents the data are optimized for querying and aggregating over time ranges.

Transformation pushdown

This configuration determines if transformations in an ETL pipeline should be pushed down to BigQuery. It is also a container for more specific settings regarding transformation pushdown.

Worker

Workers are typically long-running background programs that can be used to execute tasks. Each instance of a worker runs either in its own YARN container (Distributed CDAP mode) or a single thread (CDAP Sandbox or in-memory mode) and the number of instances may be updated via Microservices or the CLI. Datasets can be accessed from inside workers.

...