Before installing the CDAP components, you must first install (or have access to) a Hadoop cluster with HBase, HDFS, YARN, and ZooKeeperwith HBase, HDFS, Spark, YARN, and ZooKeeper. Hive and Spark are optional components is an optional component; Hive is required to enable CDAP's ad-hoc querying capabilities (CDAP Explore) and Spark is required if a CDAP application uses the Spark program.
All CDAP components can be installed on the same boxes as your Hadoop cluster, or on separate boxes that can connect to the Hadoop services.
...
HBase: For system runtime storage and queues
HDFS: The backing file system for distributed storage
Spark: For running Spark programs within CDAP applications
YARN: For running system services in containers on cluster NodeManagers
MapReduce2: For batch operations in workflows and data exploration (included with YARN)
ZooKeeper: For service discovery and leader election
...
Hive: For data exploration using SQL queries via the CDAP Explore system service
Spark: For running Spark programs within CDAP applications
Hadoop/HBase Environment
For a Distributed CDAP cluster, version 6.2.0 and later, you must install these Hadoop components (see notes following the tables):
Component | Source | Supported Versions |
---|---|---|
Hadoop | various | 2.0 and higher |
HBase | Apache | 0.98.x and 1.2 |
Amazon Hadoop (EMR) | 4.6 through 4.8 (with Apache HBase) | |
HDFS | Apache Hadoop | 2.0.2-alpha through 2.6 |
Amazon Hadoop (EMR) | 4.6 through 4.8 | |
Spark | Apache | Versions 2.4+ running on Scala 2.12 |
Amazon Hadoop (EMR) | 4.6 through 4.8 | |
YARN and MapReduce2 | Apache Hadoop | 2.0.2-alpha through 2.7 |
Amazon Hadoop (EMR) | 4.6 through 4.8 | |
ZooKeeper | Apache | Version 3.4.3 through 3.4 |
Amazon Hadoop (EMR) | 4.6 through 4.8 |
...