CDAP Modes and Components

Runtime Modes

The Cask Data Application Platform (CDAP) can be run in three different runtime modes:

  • Distributed CDAP for staging and production.

  • CDAP Sandbox for testing and development on a developer's laptop.

  • In-Memory CDAP for unit testing and continuous integration pipelines.

Regardless of the runtime mode, CDAP is fully-functional and the code you develop never changes. However, performance and scale are limited when using in-memory or local sandbox CDAPs. CDAP Applications are developed against the CDAP APIs, making the switch between these modes seamless. An application developed using a given mode can easily run in another mode.

Distributed CDAP

The Distributed CDAP runs in fully distributed mode. In addition to the system components of the CDAP, distributed and highly available (HA) deployments of the underlying Hadoop infrastructure are included. Production applications should always be run on a Distributed CDAP.

See the instructions for either a distribution-specific or generic Apache Hadoop cluster for more information.

Features:

  • A production, staging, and QA mode. Runs in a distributed environment

  • Uses Apache HBase and HDFS as the underlying storage technology

  • Uses Apache YARN Containers as the processing abstraction (via Apache™ Twill®)

CDAP Sandbox

The CDAP Sandbox allows you to run the entire CDAP stack in a single Java Virtual Machine on your local machine and includes a local version of the CDAP UI. The underlying Big Data infrastructure is emulated on top of your local file system. All data is persisted.

For information on how to start and manage your CDAP Sandbox, see Getting Started Developing and CDAP Sandbox.

Features:

  • Designed to run in a sandbox environment, for development and testing

  • Uses LevelDB/Local File System as the storage technology

  • Uses Java Threads as the processing abstraction (via Apache™ Twill®)

In-Memory CDAP

The In-Memory CDAP allows you to easily run CDAP for use in unit tests and continuous integration (CI) pipelines. In this mode, the underlying Big Data infrastructure is emulated using in-memory data structures and there is no persistence. The CDAP UI is not available in this mode.

For information and examples on using this mode, see Testing a CDAP Application.

Features:

  • Purpose-built for writing unit tests and CI pipelines

  • Mimics storage technologies as in-memory data structures, for example, Java NavigableMap

  • Uses Java Threads as the processing abstraction (via Apache™ Twill®)

Components

This diagram illustrates the components that comprise Distributed CDAP and shows some of their interactions, with CDAP system components in orange and non-CDAP system components in yellow and grey:

CDAP consists chiefly of these components:

  • The Router is the only public access point into CDAP for external clients. It forwards client requests to the appropriate system service or application. In a secure setup, the router also performs authentication. It is then complemented by an authentication service that allows clients to obtain access tokens for CDAP.

  • The Master controls and manages all services and applications.

  • System Services provide vital platform features such datasets, transactions, service discovery logging, and metrics collection. System services run in application containers.

  • Application Containers provide abstraction and isolation for execution of application code (and, as a special case, system services). Application containers scale linearly and elastically with the underlying infrastructure.

As described above, in a Hadoop Environment, application containers are implemented as YARN containers and datasets use HBase and HDFS for actual storage. In other environments, the implementation can be different. For example, in CDAP Sandbox, all services run in a single JVM, application containers are implemented as threads, and data is stored in the local file system.

Created in 2020 by Google Inc.