Manual Installation using Packages

The overall installation process is:

  1. Prepare the cluster.

  2. Download and distribute packages.

  3. Install CDAP services.

  4. Start CDAP services.

  5. Verify the installation.

This section describes installing CDAP on Hadoop clusters that are:

  • Generic Apache Hadoop distributions;

  • CDH (Cloudera Distribution of Apache Hadoop) clusters not managed with Cloudera Manager; or

  • HDP (Hortonworks Data Platform) clusters not managed with Apache Ambari.

Cloudera Manager (CDH) and Apache Ambari (HDP) distributions should be installed with our other distribution instructions.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.

  • The CDAP Master Service must be co-located on a cluster host with an HDFS client, a YARN client, an HBase client, and, optionally, Hive or Spark clients.

  • Note that these clients are redundant if you are co-locating the CDAP Master on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS NameNode, the YARN resource manager, or the HBase Master.

  • You can download the Hadoop client and HBase client libraries, and then install them on the hosts running CDAP services. No Hadoop or HBase services need be running.

  • All services run as the cdap user installed by the package manager. See “Create the cdap User” below.

  • If you are installing CDAP with the intention of using replication, see these instructions on CDAP Replication before installing or starting CDAP.

Advanced Topics

Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

Enabling Kerberos

When running CDAP on top of a secure Hadoop cluster (using Kerberos authentication), the CDAP processes will need to obtain Kerberos credentials in order to authenticate with Hadoop, HBase, ZooKeeper, and (optionally) Hive. In this case, the setting for hdfs.user in cdap-site.xml will be ignored and the CDAP processes will be identified by the default authenticated Kerberos principal.

Note: CDAP support for secure Hadoop clusters is limited to the latest versions of CDH, HDP, and Apache BigTop; currently, Amazon EMR is not supported on secure Hadoop clusters.

  1. In order to configure CDAP for Kerberos authentication:

    1. Create a Kerberos principal for the user running CDAP. The principal name should be in the form username/hostname@REALM, creating a separate principal for each host where a CDAP service will run. This prevents simultaneous login attempts from multiple hosts from being mistaken for a replay attack by the Kerberos KDC.

    2. Generate a keytab file for each CDAP Master Kerberos principal, and place the file as /etc/security/keytabs/cdap.keytab on the corresponding CDAP Master host. The file should be readable only by the user running the CDAP Master service.

    3. Edit /etc/cdap/conf/cdap-site.xml on each host running a CDAP service, substituting the Kerberos primary (user) for <cdap-principal>, and your Kerberos authentication realm for EXAMPLE.COM, when adding these two properties:

      <property> <name>cdap.master.kerberos.keytab</name> <value>/etc/security/keytabs/cdap.service.keytab</value> </property> <property> <name>cdap.master.kerberos.principal</name> <value><cdap-principal>/_HOST@EXAMPLE.COM</value> </property>
    4. The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

    5. The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user (change the ownership in the command from cdap to whatever is the <cdap-principal>):

      $ |su_hdfs| && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap
    6. When running on a secure HBase cluster, as the hbase user, issue the command:

      $ echo "grant 'cdap', 'RWCA'" | hbase shell
    7. When CDAP Master is started, it will login using the configured keytab file and principal.

2. In order to configure YARN for secure Hadoop: the <cdap-principal> user must be able to launch YARN containers, either by adding it to the YARN allowed.system.users whitelist (preferred) or by adjusting the YARN min.user.id to include the <cdap-principal> user.

3. In order to configure CDAP Explore Service for secure Hadoop:

a. To allow CDAP to act as a Hive client, it must be given proxyuser permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml, where cdap is a system group to which the cdap user is a member:

b. To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer service. Consult your distribution documentation on the proper configuration of this service.

c. To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.

With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.

Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.

  • While the default bind.address settings (0.0.0.0, used for app.bind.addressdata.tx.bind.addressrouter.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, as a result, they will be different on different hosts.

  • Starting services is described in Starting CDAP Services.

CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Install the cdap-master package on different nodes.

  • Ensure they are configured identically (/etc/cdap/conf/cdap-site.xml).

  • Start the cdap-master service on each node.

CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Install the cdap-gateway package on different nodes.

  • The router.bind.address may need to be customized on each box if it is not set to the default wildcard address (0.0.0.0).

  • Start the cdap-router service on each node.

CDAP Kafka

  • Install the cdap-kafka package on different nodes.

  • Two properties need to be set in the cdap-site.xml files on each node:

    • The Kafka seed brokers list is a comma-separated list of hosts, followed by /${root.namespace}:

      kafka.seed.brokersmyhost.example.com:9092,.../${root.namespace}

      Substitute appropriate addresses for myhost.example.com in the above example.

    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure:

      kafka.server.default.replication.factor: 2

  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.

  • Start the cdap-kafka service on each node.

CDAP UI

  • Install the cdap-ui package on different nodes.

  • Start the cdap-ui service on each node.

CDAP Authentication Server

  • Install the cdap-security package (the CDAP Authentication Server) on different nodes.

  • Start the cdap-security service on each node.

  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.

Hive Execution Engines

CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. Details on specifying these engines and configuring CDAP are in the Developer Manual section on Date Exploration, Hive Execution Engines.

Created in 2020 by Google Inc.