Step 3. Installing CDAP Services (Manual)

Package Installation

Install the CDAP packages by using one of the following methods. Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.

This will download and install the latest version of CDAP with all of its dependencies.

To install the optional CDAP CLI on a node, add the cdap-cli package to the list of packages in the commands below.

Using Chef

If you are using Chef to install CDAP, an official cookbook is available.

To install the optional CDAP CLI on a node, use the fullstack recipe.

On RPM using Yum

$ sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

On Debian using APT

$ sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

Using Tar

Having previously downloaded and unpacked the appropriate tar file to a directory $dir, use:

$sudo yum localinstall $dir/*.rpm

Create Required Directories

To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap directory in HDFS, owned by an HDFS user yarn:

In the CDAP packages, the default property hdfs.namespace is /cdap and the default property hdfs.user is yarn.

Also, create a tx.snapshot subdirectory:

If you have customized (or will be customizing) the property data.tx.snapshot.dir in your CDAP configuration, use that value instead for /cdap/tx.snapshot.

If your cluster is not setup with these defaults, you'll need to edit your CDAP configuration prior to starting services.

CDAP Configuration

This section describes how to configure the CDAP components so they work with your existing Hadoop cluster. Certain Hadoop components may need changes, as described below, for CDAP to run successfully.

  1. CDAP packages utilize a central configuration, stored by default in /etc/cdap.

    When you install the CDAP base package, a default configuration is placed in /etc/cdap/conf.dist. The cdap-site.xml file is a placeholder where you can define your specific configuration for all CDAP components. The cdap-site.xml.example file shows the properties that usually require customization for all installations.

    Similar to Hadoop, CDAP utilizes the alternatives framework to allow you to easily switch between multiple configurations. The alternatives system is used for ease of management and allows you to to choose between different directories to fulfill the same purpose.

    Simply copy the contents of /etc/cdap/conf.dist into a directory of your choice (such as /etc/cdap/conf.mycdap) and make all of your customizations there. Then run the alternatives command to point the /etc/cdap/conf symlink to your custom directory /etc/cdap/conf.mycdap:

  2. Configure the cdap-site.xml after you have installed the CDAP packages.

    To configure your particular installation, modify cdap-site.xml, using cdap-site.xml.example as a model. (See the appendix for a listing of cdap-site.xml.example, the minimal cdap-site.xml file required.)

    Customize your configuration by creating (or editing if existing) an .xml file conf/cdap-site.xml and set appropriate properties:

  3. If necessary, customize the file cdap-env.sh after you have installed the CDAP packages.

    Environment variables that will be included in the environment used when launching CDAP and can be set in the cdap-env.sh file, usually at /etc/cdap/conf/cdap-env.sh.

    This is only necessary if you need to customize the environment launching CDAP, such as described below under Local Storage Configuration.

  4. Depending on your installation, you may need to set these properties:

    1. Check that the zookeeper.quorum property in conf/cdap-site.xml is set to the ZooKeeper quorum string, a comma-delimited list of fully-qualified domain names for the ZooKeeper quorum:

    2. Check that the router.server.address property in conf/cdap-site.xml is set to the hostname of the CDAP Router. The CDAP UI uses this property to connect to the Router:

    3. Check that there exists in HDFS a user directory for the hdfs.user property of conf/cdap-site.xml. By default, the HDFS user is yarn. If necessary, create the directory:

    4. If you want to use an HDFS directory with a name other than /cdap:

      1. Create the HDFS directory you want to use, such as /myhadoop/myspace.

      2. Create an hdfs.namespace property for the HDFS directory in conf/cdap-site.xml:

      3. Check that the default HDFS user yarn owns that HDFS directory.

    5. If you want to use an HDFS user other than yarn, such as my_username:

      1. Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).

      2. Create an hdfs.user property for that user in conf/cdap-site.xml:

      3. Check that the HDFS user owns the HDFS directory described by hdfs.namespace on all machines.

      4. Check that there exists in HDFS a /user/ directory for that HDFS user, as described above, such as:

      5. If you use an HDFS user other than yarn, you must use either a secure cluster or use the LinuxContainerExecutor instead of the DefaultContainerExecutor. (Because of how DefaultContainerExecutor works, other containers will launch as yarn rather than the specified hdfs.user.) On Kerberos-enabled clusters, you must use LinuxContainerExecutor as the DefaultContainerExecutor will not work correctly.

    6. To use the ad-hoc querying capabilities of CDAP, ensure the cluster has a compatible version of Hive installed. See the section on Hadoop Compatibility. To use this feature on secure Hadoop clusters, please see the instructions on configuring secure Hadoop.

      Note: Some versions of Hive contain a bug that may prevent the CDAP Explore Service from starting up. If the CDAP Explore Service fails to start and you see a javax.jdo.JDODataStoreException: Communications link failure in the log, try adding this property to the Hive hive-site.xml file:

    7. If Hive is not going to be installed, disable the CDAP Explore Service in conf/cdap-site.xml (by default, it is enabled):

    8. If you'd like to publish metadata updates to an external Apache Kafka instance, CDAP has the capability of publishing notifications upon metadata updates. For details on the configuration settings and an example output, see Audit logging.

ULIMIT Configuration

When you install the CDAP packages, the ulimit settings for the CDAP user are specified in the /etc/security/limits.d/cdap.conf file. On Ubuntu, they won't take effect unless you make changes to the /etc/pam.d/common-session file. You can check this setting with the command ulimit -n when logged in as the CDAP user. For more information, refer to the ulimit discussion in the Apache HBase Reference Guide.

Local Storage Configuration

Local storage directories, depending on the distribution, are utilized by CDAP for deploying applications and operating CDAP.

The CDAP user (the cdap system user) must be able to write to all of these directories, as they are used for deploying applications and for operating CDAP.

  • List of local storage directories

    • Properties specified in the cdap-site.xml file, as described in the Appendix: cdap-site.xml, cdap-default.xml:

      • app.temp.dir (default: /tmp)

      • kafka.server.log.dirs (default: /tmp/kafka-logs)

      • local.data.dir (default: data; if this is instead an absolute path, needs to be writable)

    • Additional directories:

      • /var/cdap/run (used as a PID directory, created by the packages)

      • /var/log/cdap (used as log directory, created by the packages)

      • /var/run/cdap (default CDAP user's home directory, created by the packages)

      • /var/tmp/cdap (default LOCAL_DIR—see below—defined and created in the CDAP init scripts)

  • Note that local.data.dir—which defines the directory for program jar storage when deploying to YARN—is set in the cdap-site.xml and defaults to the relative path data. If the value of local.data.dir is relative, it is put under LOCAL_DIR, such as /var/tmp/cdap/data. However, if instead it is an absolute path, that alone is used as the value. This is desirable so you can easily configure this directory to be elsewhere.

  • The CDAP Master service is governed by environment variables, which set the directories it uses:

    • TEMP_DIR (default: /tmp): The directory serving as the java.io.tmpdir directory

    • LOCAL_DIR (default: /var/tmp/cdap): The directory serving as the user directory for CDAP Master

    These variables can be set in the file /etc/cdap/conf/cdap-env.sh and will be included in the environment when launching CDAP. See “CDAP Configuration” for details of the central configuration used by CDAP and how to implement this.

  • As in all installations, the kafka.server.log.dirs may need to be created locally. If you configure kafka.server.log.dirs (or any of the other settable parameters) to a particular directory or directories, you need to make sure that the directories exist and that they are writable by the CDAP user.

 

Created in 2020 by Google Inc.