The overall installation process is:
Prepare the cluster.
Download and distribute packages.
Install CDAP services.
Start CDAP services.
Verify the installation.
Notes
Apache Ambari can only be used to add CDAP to an existing Hadoop cluster, one that already has the required services (Hadoop: HDFS, YARN, HBase, ZooKeeper, and, optionally, Hive and Spark) installed.
Ambari is for setting up HDP (Hortonworks Data Platform) on bare clusters; it can't be used for clusters with HDP already installed, where the original installation was not with Ambari.
A number of features are currently planned to be added, including:
select CDAP metrics and
a full smoke test of CDAP functionality after installation.
If you are installing CDAP with the intention of using replication, see these instructions on CDAP Replication before installing or starting CDAP.
Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.
Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.
Authorization provides a way of enforcing access control on CDAP entities.
Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.
We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.
For instructions on enabling CDAP Security, see CDAP Security.
CDAP Security is configured by setting the appropriate settings under Ambari for your environment.
Kerberos support in CDAP is automatically enabled when enabling Kerberos security on your cluster via Ambari. Consult the appropriate Ambari documentation for instructions on enabling Kerberos support for your cluster.
The cdap
user must be able to launch YARN containers, which can be accomplished by adjusting the YARN min.user.id
(to 500) to include the cdap
user. (As Ambari does not have a mechanism for setting the YARN allowed.system.users
—the preferred method of enabling the cdap
user as it is more precise and limited—the setting of min.user.id
needs to be used instead.)
If you are adding CDAP to an existing Kerberos cluster, in order to configure CDAP for Kerberos authentication:
The <cdap-principal>
is shown in the commands that follow as cdap
; however, you are free to use a different appropriate name.
When running on a secure HBase cluster, as the hbase
user, issue the command:
$ echo "grant 'cdap', 'RWCA'" | hbase shell |
In order to configure CDAP Explore Service for secure Hadoop:
To allow CDAP to act as a Hive client, it must be given proxyuser
permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml
, where cdap
is a system group to which the cdap
user is a member:
<property> <name>hadoop.proxyuser.hive.groups</name> <value>cdap,hadoop,hive</value> </property> <property> <name>hadoop.proxyuser.hive.hosts</name> <value>*</value> </property> |
To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer
service. Consult your distribution documentation on the proper configuration of this service.
To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.
With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.
If you are adding Kerberos to an existing cluster, in order to configure CDAP for Kerberos authentication:
The /cdap
directory needs to be owned by the <cdap-principal>
; you can set that by running the following command as the hdfs
user (change the ownership in the command from cdap
to whatever is the <cdap-principal>
):
$ su hdfs && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap |
When converting an existing CDAP cluster to being Kerberos-enabled, you may run into YARN usercache directory permission problems. A non-Kerberos cluster with default settings will run CDAP containers as the user yarn
. A Kerberos cluster will run them as the user cdap
. When converting, the usercache directory that YARN creates will already exist and be owned by a different user. On all datanodes, run this command, substituting in the correct value of the YARN parameter yarn.nodemanager.local-dirs
:
$ rm -rf <YARN.NODEMANAGER.LOCAL-DIRS>/usercache/cdap |
(As yarn.nodemanager.local-dirs
can be a comma-separated list of directories, you may need to run this command multiple times, once for each entry.)
If, for example, the setting for yarn.nodemanager.local-dirs
is /yarn/nm
, you would use:
$ rm -rf /yarn/nm/usercache/cdap |
Restart CDAP after removing the usercache(s).
In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:
For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:
Sync the configuration files (such as cdap-site.xml
and cdap-security.xml
) on all the nodes.
While the default bind.address settings (0.0.0.0
, used for app.bind.address
, data.tx.bind.address
, router.bind.address
, and so on) can be synced across hosts, if you customize them to a particular IP address, they will—as a result—be different on different hosts. This can be controlled by the settings for an individual Role Instance.
The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.
Using the Ambari UI, add additional hosts for the CDAP Master Service
to additional machines.
The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.
Using the Ambari UI, add additional hosts for the CDAP Router Service
to additional machines.
Start each CDAP Router Service
role.
Using the Ambari UI, add additional hosts for the CDAP Kafka Service
to additional machines.
Two properties govern the Kafka setting in the cluster:
The list of Kafka seed brokers is generated automatically, but the replication factor (kafka.server.default.replication.factor
) is not set automatically. Instead, it needs to be set manually.
The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.
Using the Ambari UI, add additional hosts for the CDAP UI Service
to additional machines.
Using the Ambari UI, add additional hosts for the CDAP Security Auth Service
(the CDAP Authentication Server) to additional machines.
Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.
CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. For details on specifying these engines and configuring CDAP, see Hive Execution Engines.
In order to use Spark2, you must first install Spark2 on your cluster. If both Spark1 and Spark2 are installed, you must modify cdap-env to set SPARK_MAJOR_VERSION and SPARK_HOME:
export SPARK_MAJOR_VERSION=2 export SPARK_HOME=/usr/hdp/{{hdp_version}}/spark2 |
When Spark2 is in use, Spark1 programs cannot be run. Similarly, when Spark1 is in use, Spark2 programs cannot be run.
When CDAP starts up, it detects the spark version and uploads the corresponding pipeline system artifact. If you have already started CDAP with Spark1, you will also need to delete the pipeline system artifacts, then reload them in order to use the spark2 versions. After CDAP has been restarted with Spark2, use the Microservices:
$ DELETE /v3/namespaces/system/artifacts/cdap-data-pipeline/versions/6.2.0 $ DELETE /v3/namespaces/system/artifacts/cdap-data-streams/versions/6.2.0 $ POST /v3/namespaces/system/artifacts |