CDAP Components and Functional Responsibilities

Infrastructure components used by Cask Data Application Platform (CDAP)

Following are the underlying infrastructure components used by CDAP and/or CDAP Applications running in CDAP.  The components presented below are in no priority order. 
  • HDFS
  • HBase
  • Hive
  • Kafka
  • YARN and
  • Zookeeper
  • KMS
  • Sentry ???

Functional use of infrastructure components

This section provides information about how and for what the components underneath are used. 

HDFS

  • CDAP Stream
  • Apache Tephra WAL
  • Deployed Application Artifact and Dataset Artifact
  • Aggregated Logs
  • CDAP Fileset Dataset
  • YARN distributed cache 
  • Coprocessor jars 

HBase

  • CDAP System data/metadata (ex: Preferences, Application, Namespace, Artifact…)
  • Metrics Cube
  • Lineage
  • Workflow Statistics
  • Run Record and Statistics
  • Checkpoint information
  • CDAP Table Dataset

Kafka

  • Logs
  • Metrics
  • Audit Logs (Will be moved to HBase in 4.0)
  • Metadata updates (Will be moved to HBase in 4.0)
  • Notifications (Will be moved to HBase in 4.x)

YARN

  • System Services
  • User applications

Zookeeper

  • Routing Tables
  • Coordination
  • Secret keys 
    • Auth keys

Hive

  • Dataset integration 
    • Schema
    • Properties
    • Serde

KMS

  • User Secrets (Ex: Password, access tokens etc..) 

Created in 2020 by Google Inc.