Step 3. Installing CDAP Services (Manual)
Package Installation
Install the CDAP packages by using one of the following methods. Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.
This will download and install the latest version of CDAP with all of its dependencies.
To install the optional CDAP CLI on a node, add the cdap-cli
package to the list of packages in the commands below.
Using Chef
If you are using Chef to install CDAP, an official cookbook is available.
To install the optional CDAP CLI on a node, use the fullstack
recipe.
On RPM using Yum
$ sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui
On Debian using APT
$ sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui
Using Tar
Having previously downloaded and unpacked the appropriate tar file to a directory $dir
, use:
Create Required Directories
To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap
directory in HDFS, owned by an HDFS user yarn
:
In the CDAP packages, the default property hdfs.namespace
is /cdap
and the default property hdfs.user
is yarn
.
Also, create a tx.snapshot
subdirectory:
If you have customized (or will be customizing) the property data.tx.snapshot.dir
in your CDAP configuration, use that value instead for /cdap/tx.snapshot
.
If your cluster is not setup with these defaults, you'll need to edit your CDAP configuration prior to starting services.
CDAP Configuration
This section describes how to configure the CDAP components so they work with your existing Hadoop cluster. Certain Hadoop components may need changes, as described below, for CDAP to run successfully.
CDAP packages utilize a central configuration, stored by default in
/etc/cdap
.When you install the CDAP base package, a default configuration is placed in
/etc/cdap/conf.dist
. Thecdap-site.xml
file is a placeholder where you can define your specific configuration for all CDAP components. Thecdap-site.xml.example
file shows the properties that usually require customization for all installations.Similar to Hadoop, CDAP utilizes the
alternatives
framework to allow you to easily switch between multiple configurations. Thealternatives
system is used for ease of management and allows you to to choose between different directories to fulfill the same purpose.Simply copy the contents of
/etc/cdap/conf.dist
into a directory of your choice (such as/etc/cdap/conf.mycdap
) and make all of your customizations there. Then run thealternatives
command to point the/etc/cdap/conf
symlink to your custom directory/etc/cdap/conf.mycdap
:Configure the
cdap-site.xml
after you have installed the CDAP packages.To configure your particular installation, modify
cdap-site.xml
, usingcdap-site.xml.example
as a model. (See the appendix for a listing ofcdap-site.xml.example
, the minimalcdap-site.xml
file required.)Customize your configuration by creating (or editing if existing) an .xml file
conf/cdap-site.xml
and set appropriate properties:If necessary, customize the file
cdap-env.sh
after you have installed the CDAP packages.Environment variables that will be included in the environment used when launching CDAP and can be set in the
cdap-env.sh
file, usually at/etc/cdap/conf/cdap-env.sh
.This is only necessary if you need to customize the environment launching CDAP, such as described below under Local Storage Configuration.
Depending on your installation, you may need to set these properties:
Check that the
zookeeper.quorum
property inconf/cdap-site.xml
is set to the ZooKeeper quorum string, a comma-delimited list of fully-qualified domain names for the ZooKeeper quorum:Check that the
router.server.address
property inconf/cdap-site.xml
is set to the hostname of the CDAP Router. The CDAP UI uses this property to connect to the Router:Check that there exists in HDFS a user directory for the
hdfs.user
property ofconf/cdap-site.xml
. By default, the HDFS user isyarn
. If necessary, create the directory:If you want to use an HDFS directory with a name other than
/cdap
:Create the HDFS directory you want to use, such as
/myhadoop/myspace
.Create an
hdfs.namespace
property for the HDFS directory inconf/cdap-site.xml
:Check that the default HDFS user
yarn
owns that HDFS directory.
If you want to use an HDFS user other than
yarn
, such asmy_username
:Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).
Create an
hdfs.user
property for that user inconf/cdap-site.xml
:Check that the HDFS user owns the HDFS directory described by
hdfs.namespace
on all machines.Check that there exists in HDFS a
/user/
directory for that HDFS user, as described above, such as:If you use an HDFS user other than
yarn
, you must use either a secure cluster or use the LinuxContainerExecutor instead of theDefaultContainerExecutor
. (Because of howDefaultContainerExecutor
works, other containers will launch asyarn
rather than the specifiedhdfs.user
.) On Kerberos-enabled clusters, you must useLinuxContainerExecutor
as theDefaultContainerExecutor
will not work correctly.
To use the ad-hoc querying capabilities of CDAP, ensure the cluster has a compatible version of Hive installed. See the section on Hadoop Compatibility. To use this feature on secure Hadoop clusters, please see the instructions on configuring secure Hadoop.
Note: Some versions of Hive contain a bug that may prevent the CDAP Explore Service from starting up. If the CDAP Explore Service fails to start and you see a
javax.jdo.JDODataStoreException: Communications link failure
in the log, try adding this property to the Hivehive-site.xml
file:If Hive is not going to be installed, disable the CDAP Explore Service in
conf/cdap-site.xml
(by default, it is enabled):If you'd like to publish metadata updates to an external Apache Kafka instance, CDAP has the capability of publishing notifications upon metadata updates. For details on the configuration settings and an example output, see Audit logging.
ULIMIT Configuration
When you install the CDAP packages, the ulimit
settings for the CDAP user are specified in the /etc/security/limits.d/cdap.conf
file. On Ubuntu, they won't take effect unless you make changes to the /etc/pam.d/common-session file
. You can check this setting with the command ulimit -n
when logged in as the CDAP user. For more information, refer to the ulimit
discussion in the Apache HBase Reference Guide.
Local Storage Configuration
Local storage directories, depending on the distribution, are utilized by CDAP for deploying applications and operating CDAP.
The CDAP user (the cdap
system user) must be able to write to all of these directories, as they are used for deploying applications and for operating CDAP.
List of local storage directories
Properties specified in the
cdap-site.xml
file, as described in the Appendix: cdap-site.xml, cdap-default.xml:app.temp.dir
(default:/tmp
)kafka.server.log.dirs
(default:/tmp/kafka-logs
)local.data.dir
(default:data
; if this is instead an absolute path, needs to be writable)
Additional directories:
/var/cdap/run
(used as a PID directory, created by the packages)/var/log/cdap
(used as log directory, created by the packages)/var/run/cdap
(default CDAP user's home directory, created by the packages)/var/tmp/cdap
(defaultLOCAL_DIR
—see below—defined and created in the CDAP init scripts)
Note that
local.data.dir
—which defines the directory for program jar storage when deploying to YARN—is set in thecdap-site.xml
and defaults to the relative pathdata
. If the value oflocal.data.dir
is relative, it is put underLOCAL_DIR
, such as/var/tmp/cdap/data
. However, if instead it is an absolute path, that alone is used as the value. This is desirable so you can easily configure this directory to be elsewhere.The CDAP Master service is governed by environment variables, which set the directories it uses:
TEMP_DIR
(default:/tmp
): The directory serving as thejava.io.tmpdir
directoryLOCAL_DIR
(default:/var/tmp/cdap
): The directory serving as the user directory for CDAP Master
These variables can be set in the file
/etc/cdap/conf/cdap-env.sh
and will be included in the environment when launching CDAP. See “CDAP Configuration” for details of the central configuration used by CDAP and how to implement this.As in all installations, the
kafka.server.log.dirs
may need to be created locally. If you configurekafka.server.log.dirs
(or any of the other settable parameters) to a particular directory or directories, you need to make sure that the directories exist and that they are writable by the CDAP user.
Created in 2020 by Google Inc.