The overall installation process is:

  1. Prepare the cluster.

  2. Download and distribute packages.

  3. Install CDAP services.

  4. Start CDAP services.

  5. Verify the installation.

Notes

Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

CDAP Security is configured by setting the appropriate settings under Ambari for your environment.

Enabling Kerberos

Kerberos support in CDAP is automatically enabled when enabling Kerberos security on your cluster via Ambari. Consult the appropriate Ambari documentation for instructions on enabling Kerberos support for your cluster.

The cdap user must be able to launch YARN containers, which can be accomplished by adjusting the YARN min.user.id (to 500) to include the cdap user. (As Ambari does not have a mechanism for setting the YARN allowed.system.users—the preferred method of enabling the cdap user as it is more precise and limited—the setting of min.user.id needs to be used instead.)

  1. If you are adding CDAP to an existing Kerberos cluster, in order to configure CDAP for Kerberos authentication:

    1. The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

    2. When running on a secure HBase cluster, as the hbase user, issue the command:

      $ echo "grant 'cdap', 'RWCA'" | hbase shell
      
    3. In order to configure CDAP Explore Service for secure Hadoop:

      1. To allow CDAP to act as a Hive client, it must be given proxyuser permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml, where cdap is a system group to which the cdap user is a member:

        <property>
          <name>hadoop.proxyuser.hive.groups</name>
          <value>cdap,hadoop,hive</value>
        </property>
        <property>
          <name>hadoop.proxyuser.hive.hosts</name>
          <value>*</value>
        </property>
        
      2. To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer service. Consult your distribution documentation on the proper configuration of this service.

      3. To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.

      With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.

  2. If you are adding Kerberos to an existing cluster, in order to configure CDAP for Kerberos authentication:

    1. The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user (change the ownership in the command from cdap to whatever is the <cdap-principal>):

      $ su hdfs && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap
      
    2. When converting an existing CDAP cluster to being Kerberos-enabled, you may run into YARN usercache directory permission problems. A non-Kerberos cluster with default settings will run CDAP containers as the user yarn. A Kerberos cluster will run them as the user cdap. When converting, the usercache directory that YARN creates will already exist and be owned by a different user. On all datanodes, run this command, substituting in the correct value of the YARN parameter yarn.nodemanager.local-dirs:

      $ rm -rf <YARN.NODEMANAGER.LOCAL-DIRS>/usercache/cdap
      

      (As yarn.nodemanager.local-dirs can be a comma-separated list of directories, you may need to run this command multiple times, once for each entry.)

      If, for example, the setting for yarn.nodemanager.local-dirs is /yarn/nm, you would use:

      $ rm -rf /yarn/nm/usercache/cdap
      

      Restart CDAP after removing the usercache(s).

Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

CDAP Kafka

CDAP UI

CDAP Authentication Server

Hive Execution Engines

CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. For details on specifying these engines and configuring CDAP, see Hive Execution Engines.

Enabling Spark2

In order to use Spark2, you must first install Spark2 on your cluster. If both Spark1 and Spark2 are installed, you must modify cdap-env to set SPARK_MAJOR_VERSION and SPARK_HOME:

export SPARK_MAJOR_VERSION=2
export SPARK_HOME=/usr/hdp/{{hdp_version}}/spark2

When Spark2 is in use, Spark1 programs cannot be run. Similarly, when Spark1 is in use, Spark2 programs cannot be run.

When CDAP starts up, it detects the spark version and uploads the corresponding pipeline system artifact. If you have already started CDAP with Spark1, you will also need to delete the pipeline system artifacts, then reload them in order to use the spark2 versions. After CDAP has been restarted with Spark2, use the Microservices:

$ DELETE /v3/namespaces/system/artifacts/cdap-data-pipeline/versions/6.2.0
$ DELETE /v3/namespaces/system/artifacts/cdap-data-streams/versions/6.2.0
$ POST /v3/namespaces/system/artifacts