Installing CDAP on Kubernetes
CDAP installation on Kubernetes was introduced in CDAP 6.2.3.
This document describes how to install CDAP on a Kubernetes cluster.
Dependencies
This section describes the infrastructure and software dependencies for operating CDAP in Kubernetes.
Kubernetes cluster
CDAP supports using Kubernetes (k8s) as the distributed resource manager. When CDAP is deployed to a k8s cluster, it spawns multiple Deployments and StatefulSets for running various CDAP services. The following diagram shows each of the CDAP services in the Kubernetes cluster:
The CDAP operator is responsible for deploying and managing all the CDAP services inside the cluster. The CDAP operator also supports managing multiple CDAP instances within the same k8s cluster. If multiple CDAP instances are deployed to the same k8s cluster, It is recommended to deploy them to different namespaces to provide better isolation.
Limitations
Currently CDAP only supports running one replica (pod) per service, except for the Preview Runner. Failure resiliency is handled by k8s to have pod restart upon failure. For pods created by StatefulSets, it relies on the infrastructure to have persistent volumes being re-mountable to the new pod, which potentially could be on a different machine.
Another limitation of operating CDAP in Kubernetes is that it does not support native compute profile. This means all user program executions are external to the Kubernetes cluster, and require a Hadoop cluster for program executions.
PostgreSQL database
CDAP needs a shared storage for its own metadata, such as deployed artifacts and applications, run histories, preferences, lineage information, and many more. Currently, CDAP supports both PostgreSQL and HBase as the metadata store. When running CDAP in Kubernetes, we recommend using PostgreSQL.
Elasticsearch
CDAP has support for metadata search, and it is backed by either Elasticsearch or HBase. In the Kubernetes environment, Elasticsearch is recommended. You can either configure CDAP to use an existing Elasticsearch cluster or run an Elasticsearch in Kubernetes by using the Elasticsearch Operator.
Hadoop Compatible File System (HCFS)
CDAP stores artifacts and runtime information through the HDFS API. Any of the HCFS implementations is supported.
Installation
This section describes the steps to deploy CDAP on Kubernetes.
Prerequisites
An operational Kubernetes cluster.
Recommended to have 64 GB of memory resources and 20 available virtual CPU for production deployment.
For better security, the Kubernetes cluster should have RBAC enabled.
Have kubectl set up to connect to the Kubernetes cluster.
A PostgreSQL database that is reachable from the Kubernetes cluster.
An Elasticsearch instance that is reachable from the Kubernetes cluster.
Refer to the Appendix section on how to set up an Elasticsearch instance inside the Kubernetes cluster.
Deploy CDAP Operator
CDAP provides a CDAP operator for easy deployment and management of CDAP in Kubernetes. You can deploy the following YAML to create all the necessary resources to have the operator running in the Kubernetes cluster, inside the cdap-system namespace.
Create RBAC Roles and RoleBinding
CDAP interacts with Kubernetes for configuration, service discovery, and also workload management. Deploying the following YAML file will create the necessary set of RBAC Roles and RoleBinding to the service account called cdap.
Prepare the secret token for CDAP
We need to set up a secret in Kubernetes to provide the cdap-security.xml file to CDAP, which will contain the PostgreSQL and Elasticsearch password. The following command assumes the database username and password are in the environment variables DB_USER and DB_PASS respectively. For Elasticsearch authentication, it expects that the username and password comes from the ES_USER and ES_PASS environment variables.
# Create the content of the cdap-security.xml
export CDAP_SECURITY=$(cat << EOF | base64 | tr -d '\n'
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>data.storage.sql.jdbc.username</name>
<value>${DB_USER}</value>
</property>
<property>
<name>data.storage.sql.jdbc.password</name>
<value>${DB_PASS}</value>
</property>
<property>
<name>metadata.elasticsearch.credentials.username</name>
<value>${ES_USER}</value>
</property>
<property>
<name>metadata.elasticsearch.credentials.password</name>
<value>${ES_PASS}</value>
</property>
</configuration>
EOF
)
# Create the secret
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: cdap-security
type: Opaque
data:
cdap-security.xml: $CDAP_SECURITY
EOF
Deploy CDAP
Finally we are ready to deploy CDAP into the Kubernetes cluster. The following YAML provides a simple example. You will need to replace the locationURI
with an HCFS compatible file system (e.g. HDFS, Google Cloud Storage, or Amazon AWS). Also, the data.storage.sql.jdbc.connection.url
should be configured to point to a PostgreSQL database. Refer to cdap-default.xml for an explanation about the configurations.
You can also configure each of the CDAP services with different cpu, memory, storage, and environments. The following is a simple example that shows how to change the memory and disk size for the appFabric service.
appFabric:
env:
- name: OPTS
value: -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError
resources:
requests:
memory: 8000Mi
storageSize: 200Gi
Refer to the Custom Resource Definition (CRD) for all the supported settings.
You can verify CDAP is running correctly by listing out the pods in the Kubernetes cluster.
kubectl get pods -l cdap.instance
NAME READY STATUS RESTARTS AGE
cdap-cdap-appfabric-0 1/1 Running 0 114s
cdap-cdap-logs-0 1/1 Running 0 2m6s
cdap-cdap-messaging-0 1/1 Running 0 2m6s
cdap-cdap-metadata-54db5876dc-kplkw 1/1 Running 0 2m6s
cdap-cdap-metrics-0 1/1 Running 0 2m6s
cdap-cdap-preview-0 1/1 Running 0 119s
cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-0 1/1 Running 0 79s
cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-1 1/1 Running 0 79s
cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-2 1/1 Running 0 79s
cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-3 1/1 Running 0 79s
cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-4 1/1 Running 0 79s
cdap-cdap-router-987b785c9-dfcqx 1/1 Running 0 2m6s
cdap-cdap-runtime-0 1/1 Running 0 2m6s
cdap-cdap-service-system-pipeline-studio-6e8722c4-a064-4b2gx2tc 1/1 Running 0 17s
cdap-cdap-userinterface-877f4555d-qvqmw 1/1 Running 0 2m5s
After CDAP is fully up and running, both the UI and REST can be accessed via the user-interface and router services exposed by CDAP.
kubectl get service -l cdap.instance
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cdap-cdap-router NodePort 10.192.14.102 <none> 11015:31426/TCP 4m41s
cdap-cdap-userinterface NodePort 10.192.14.106 <none> 11011:30504/TCP 4m41s
For quick testing, you can use kubectl port-foward to provide access to the CDAP service. For example, you can expose the user interface and then access it through localhost:11011 from the browser.
kubectl port-forward service/cdap-cdap-userinterface 11011
For production use cases, it is better to expose the CDAP services through a load balancer. Consult with your Kubernetes provider for how to deploy a load balancer.
Enable Authentication Service
To enable the Authentication Service in K8s environment to provide Perimeter Security, extra configurations are needed in the CDAP YAML file.
Set the following configurations in the CDAP YAML file "config:" section.
# Enable perimeter security security.enabled: "true" # A key file generated by the AuthenticationTool that is mapped into the pod via k8s secret (see below for instructions) security.data.keyfile.path: "/etc/cdap/auth/auth.key" # Disable kerberos (it is defaulted to true when security.enabled is true) kerberos.auth.enabled: "false"
Add configurations for the the authentication handler based on Configuring Managed Authentication under the "config:" section.
Use the CDAP docker image to generate an "auth.key" file.
docker run -it --rm \ --mount type=bind,source=$(pwd),target=/auth \ gcr.io/cdapio/cdap:latest \ io.cdap.cdap.security.tools.AuthenticationTool -g /auth/auth.key
Create a k8s secret from the "auth.key" file.
kubectl create secret generic cdap-auth --from-file=auth.key
Add the secret to the CDAP YAML file to map the secret into CDAP pods by adding a "secretVolumes" (same level as other options, like "config").
config: .... secretVolumes: cdap-auth: "/etc/cdap/auth"
Now, you can start CDAP with security enabled, without needing Zookeeper.
Running CDAP Programs
Starting in CDAP 6.7.0, you can run CDAP programs on Kubernetes using Spark.
Note: MapReduce and Spark Streaming engines are not supported.
To run CDAP programs on Kubernetes, as a prerequisite the following service account and role binding needs to be created as a requirement from Spark.
Create service account
kubectl create serviceaccount spark
Create role binding
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
Verify by running a pipeline
Run a pipeline using CDAP UI:
Limitations
coming soon
Appendix
This section describes how to create the resources required for the CDAP installation using Google Cloud Platform.
Preparation
We will be using the standard bash shell and gcloud command line tool to perform the setup. Install Google Cloud SDK before you proceed.
Set up the following environment variables for using the gcloud command:
export PROJECT=<project-name>
export REGION=<region>
export NAME=<cluster-name>
export PROJECT_NUM=$(gcloud projects describe ${PROJECT} \
--format="value(projectNumber)")
export DB_USER=cdap
export DB_PASS=$(openssl rand -base64 14)
export DB_NAME=cdap
Kubernetes
Create a Google Container Engine (GKE) as the Kubernetes cluster. Make sure the GKE API is enabled before executing the following commands.
# Create a 6 nodes regional GKE cluster in the default network
gcloud container clusters create "${NAME}" --region "${REGION}" \
--no-enable-basic-auth --machine-type "e2-standard-4" --num-nodes "2" --scopes \
"https://www.googleapis.com/auth/cloud-platform" --enable-ip-alias --subnetwork \
"projects/${PROJECT}/regions/${REGION}/subnetworks/default" --release-channel \
"regular" --project "${PROJECT}"
# Connect to the GKE cluster
gcloud container clusters get-credentials "${NAME}" --region "${REGION}" \
--project "${PROJECT}"
Postgresql Database
Create a Google Cloud SQL instance to serve as the PostgreSQL database. Make sure the Cloud SQL API is enabled before executing the following commands:
# Create a private PostgreSQL CloudSQL instance in the default network
gcloud beta sql instances create "${NAME}" --database-version=POSTGRES_12 \
--cpu=1 --memory=4GiB --region="${REGION}" --no-assign-ip --network="default" \
--project "${PROJECT}"
# Create a PostgreSQL user
gcloud sql users create "${DB_USER}" --instance="${NAME}" --password="${DB_PASS}" \
--project "${PROJECT}"
# Create a PostgreSQL database
gcloud sql databases create "${DB_NAME}" --instance="${NAME}" --project="${PROJECT}"
# Export the DB IP address
export DB_IP=`gcloud sql instances describe "${NAME}" --project="${PROJECT}" \
--format="value(ipAddresses.ipAddress)"`
Elasticsearch in Kubernetes
We are using the Elasticsearch Operator to operate an Elasticsearch instance inside the Kubernetes cluster. You can deploy the following YAML to create all the necessary resources to have the operator running in the Kubernetes cluster, inside the elastic-system namespace:
After deploying the Elasticsearch operator, you can deploy the following custom resource to start an Elasticsearch instance inside the Kubernetes cluster.
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: es-cluster
spec:
version: 6.8.13
nodeSets:
- name: default
count: 1
config:
node.store.allow_mmap: false
You can validate that the Elasticsearch instance is up and running correctly by observing an Elasticsearch pod is in the RUNNING state.
kubectl get pods -l elasticsearch.k8s.elastic.co/cluster-name
NAME READY STATUS RESTARTS AGE
es-cluster-es-default-0 1/1 Running 0 5m
After the Elasticsearch instance, you need to get the default user password from the secret created by the operator. This password is needed in the cdap-security.xml file for CDAP to authenticate itself to Elasticsearch.
export ES_USER=elastic
export ES_PASS=$(kubectl get secret es-cluster-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')
Created in 2020 by Google Inc.