CDAP Components and Functional Responsibilities

CDAP Components and Functional Responsibilities

Infrastructure components used by Cask Data Application Platform (CDAP)

Following are the underlying infrastructure components used by CDAP and/or CDAP Applications running in CDAP.  The components presented below are in no priority order. 

  • HDFS

  • HBase

  • Hive

  • Kafka

  • YARN and

  • Zookeeper

  • KMS

  • Sentry ???

Functional use of infrastructure components

This section provides information about how and for what the components underneath are used. 

HDFS

  • CDAP Stream

  • Apache Tephra WAL

  • Deployed Application Artifact and Dataset Artifact

  • Aggregated Logs

  • CDAP Fileset Dataset

  • YARN distributed cache 

  • Coprocessor jars 

HBase

  • CDAP System data/metadata (ex: Preferences, Application, Namespace, Artifact…)

  • Metrics Cube

  • Lineage

  • Workflow Statistics

  • Run Record and Statistics

  • Checkpoint information

  • CDAP Table Dataset

Kafka

  • Logs

  • Metrics

  • Audit Logs (Will be moved to HBase in 4.0)

  • Metadata updates (Will be moved to HBase in 4.0)

  • Notifications (Will be moved to HBase in 4.x)

YARN

  • System Services

  • User applications

Zookeeper

  • Routing Tables

  • Coordination

  • Secret keys 

    • Auth keys

Hive

  • Dataset integration 

    • Schema

    • Properties

    • Serde

KMS

  • User Secrets (Ex: Password, access tokens etc..) 

Created in 2020 by Google Inc.