Installing CDAP on Kubernetes

Installing CDAP on Kubernetes

CDAP installation on Kubernetes was introduced in CDAP 6.2.3.

This document describes how to install CDAP on a Kubernetes cluster.

Dependencies

This section describes the infrastructure and software dependencies for operating CDAP in Kubernetes.

Kubernetes cluster

CDAP supports using Kubernetes (k8s) as the distributed resource manager. When CDAP is deployed to a k8s cluster, it spawns multiple Deployments and StatefulSets for running various CDAP services. The following diagram shows each of the CDAP services in the Kubernetes cluster:

The CDAP operator is responsible for deploying and managing all the CDAP services inside the cluster. The CDAP operator also supports managing multiple CDAP instances within the same k8s cluster. If multiple CDAP instances are deployed to the same k8s cluster, It is recommended to deploy them to different namespaces to provide better isolation.

Limitations

Currently CDAP only supports running one replica (pod) per service, except for the Preview Runner. Failure resiliency is handled by k8s to have pod restart upon failure. For pods created by StatefulSets, it relies on the infrastructure to have persistent volumes being re-mountable to the new pod, which potentially could be on a different machine.

Another limitation of operating CDAP in Kubernetes is that it does not support native compute profile. This means all user program executions are external to the Kubernetes cluster, and require a Hadoop cluster for program executions.

PostgreSQL database

CDAP needs a shared storage for its own metadata, such as deployed artifacts and applications, run histories, preferences, lineage information, and many more. Currently, CDAP supports both PostgreSQL and HBase as the metadata store. When running CDAP in Kubernetes, we recommend using PostgreSQL.

Elasticsearch

CDAP has support for metadata search, and it is backed by either Elasticsearch or HBase. In the Kubernetes environment, Elasticsearch is recommended. You can either configure CDAP to use an existing Elasticsearch cluster or run an Elasticsearch in Kubernetes by using the Elasticsearch Operator.

Hadoop Compatible File System (HCFS)

CDAP stores artifacts and runtime information through the HDFS API. Any of the HCFS implementations is supported.

Installation

This section describes the steps to deploy CDAP on Kubernetes.

Prerequisites

  • An operational Kubernetes cluster.

    • Recommended to have 64 GB of memory resources and 20 available virtual CPU for production deployment.

    • For better security, the Kubernetes cluster should have RBAC enabled.

    • Have kubectl set up to connect to the Kubernetes cluster.

  • A PostgreSQL database that is reachable from the Kubernetes cluster.

  • An Elasticsearch instance that is reachable from the Kubernetes cluster.

    • Refer to the Appendix section on how to set up an Elasticsearch instance inside the Kubernetes cluster.

Deploy CDAP Operator

CDAP provides a CDAP operator for easy deployment and management of CDAP in Kubernetes. You can deploy the following YAML to create all the necessary resources to have the operator running in the Kubernetes cluster, inside the cdap-system namespace.

Create RBAC Roles and RoleBinding

CDAP interacts with Kubernetes for configuration, service discovery, and also workload management. Deploying the following YAML file will create the necessary set of RBAC Roles and RoleBinding to the service account called cdap.

Prepare the secret token for CDAP

We need to set up a secret in Kubernetes to provide the cdap-security.xml file to CDAP, which will contain the PostgreSQL and Elasticsearch password. The following command assumes the database username and password are in the environment variables DB_USER and DB_PASS respectively. For Elasticsearch authentication, it expects that the username and password comes from the ES_USER and ES_PASS environment variables.

# Create the content of the cdap-security.xml export CDAP_SECURITY=$(cat << EOF | base64 | tr -d '\n' <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>data.storage.sql.jdbc.username</name> <value>${DB_USER}</value> </property> <property> <name>data.storage.sql.jdbc.password</name> <value>${DB_PASS}</value> </property> <property> <name>metadata.elasticsearch.credentials.username</name> <value>${ES_USER}</value> </property> <property> <name>metadata.elasticsearch.credentials.password</name> <value>${ES_PASS}</value> </property> </configuration> EOF ) # Create the secret cat << EOF | kubectl apply -f - apiVersion: v1 kind: Secret metadata: name: cdap-security type: Opaque data: cdap-security.xml: $CDAP_SECURITY EOF

Deploy CDAP

Finally we are ready to deploy CDAP into the Kubernetes cluster. The following YAML provides a simple example. You will need to replace the locationURI with an HCFS compatible file system (e.g. HDFS, Google Cloud Storage, or Amazon AWS). Also, the data.storage.sql.jdbc.connection.url should be configured to point to a PostgreSQL database. Refer to cdap-default.xml for an explanation about the configurations.

You can also configure each of the CDAP services with different cpu, memory, storage, and environments. The following is a simple example that shows how to change the memory and disk size for the appFabric service.

appFabric: env: - name: OPTS value: -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError resources: requests: memory: 8000Mi storageSize: 200Gi

Refer to the Custom Resource Definition (CRD) for all the supported settings.

You can verify CDAP is running correctly by listing out the pods in the Kubernetes cluster.

kubectl get pods -l cdap.instance
NAME READY STATUS RESTARTS AGE cdap-cdap-appfabric-0 1/1 Running 0 114s cdap-cdap-logs-0 1/1 Running 0 2m6s cdap-cdap-messaging-0 1/1 Running 0 2m6s cdap-cdap-metadata-54db5876dc-kplkw 1/1 Running 0 2m6s cdap-cdap-metrics-0 1/1 Running 0 2m6s cdap-cdap-preview-0 1/1 Running 0 119s cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-0 1/1 Running 0 79s cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-1 1/1 Running 0 79s cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-2 1/1 Running 0 79s cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-3 1/1 Running 0 79s cdap-cdap-preview-runner-a590df9d-6673-4c-4ead35346f-4 1/1 Running 0 79s cdap-cdap-router-987b785c9-dfcqx 1/1 Running 0 2m6s cdap-cdap-runtime-0 1/1 Running 0 2m6s cdap-cdap-service-system-pipeline-studio-6e8722c4-a064-4b2gx2tc 1/1 Running 0 17s cdap-cdap-userinterface-877f4555d-qvqmw 1/1 Running 0 2m5s

After CDAP is fully up and running, both the UI and REST can be accessed via the user-interface and router services exposed by CDAP.

kubectl get service -l cdap.instance
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE cdap-cdap-router NodePort 10.192.14.102 <none> 11015:31426/TCP 4m41s cdap-cdap-userinterface NodePort 10.192.14.106 <none> 11011:30504/TCP 4m41s

For quick testing, you can use kubectl port-foward to provide access to the CDAP service. For example, you can expose the user interface and then access it through localhost:11011 from the browser.

kubectl port-forward service/cdap-cdap-userinterface 11011

For production use cases, it is better to expose the CDAP services through a load balancer. Consult with your Kubernetes provider for how to deploy a load balancer.

Enable Authentication Service

To enable the Authentication Service in K8s environment to provide Perimeter Security, extra configurations are needed in the CDAP YAML file.

  1. Set the following configurations in the CDAP YAML file "config:" section.

    # Enable perimeter security  security.enabled: "true" # A key file generated by the AuthenticationTool that is mapped into the pod via k8s secret (see below for instructions) security.data.keyfile.path: "/etc/cdap/auth/auth.key" # Disable kerberos (it is defaulted to true when security.enabled is true) kerberos.auth.enabled: "false"
  2. Add configurations for the the authentication handler based on Configuring Managed Authentication under the "config:" section.

  3. Use the CDAP docker image to generate an "auth.key" file.

    docker run -it --rm \ --mount type=bind,source=$(pwd),target=/auth \ gcr.io/cdapio/cdap:latest \ io.cdap.cdap.security.tools.AuthenticationTool -g /auth/auth.key
  4. Create a k8s secret from the "auth.key" file.

    kubectl create secret generic cdap-auth --from-file=auth.key
  5. Add the secret to the CDAP YAML file to map the secret into CDAP pods by adding a "secretVolumes" (same level as other options, like "config").

    config: .... secretVolumes:   cdap-auth: "/etc/cdap/auth"

Now, you can start CDAP with security enabled, without needing Zookeeper.

Running CDAP Programs

Starting in CDAP 6.7.0, you can run CDAP programs on Kubernetes using Spark.

Note: MapReduce and Spark Streaming engines are not supported.

To run CDAP programs on Kubernetes, as a prerequisite the following service account and role binding needs to be created as a requirement from Spark.

  • Create service account

kubectl create serviceaccount spark

  • Create role binding

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Verify by running a pipeline

Run a pipeline using CDAP UI:

Limitations

coming soon

Appendix

This section describes how to create the resources required for the CDAP installation using Google Cloud Platform.

Preparation

We will be using the standard bash shell and gcloud command line tool to perform the setup. Install Google Cloud SDK before you proceed.

Set up the following environment variables for using the gcloud command:

export PROJECT=<project-name> export REGION=<region> export NAME=<cluster-name> export PROJECT_NUM=$(gcloud projects describe ${PROJECT} \  --format="value(projectNumber)") export DB_USER=cdap export DB_PASS=$(openssl rand -base64 14) export DB_NAME=cdap

Kubernetes

Create a Google Container Engine (GKE) as the Kubernetes cluster. Make sure the GKE API is enabled before executing the following commands.

# Create a 6 nodes regional GKE cluster in the default network gcloud container clusters create "${NAME}" --region "${REGION}" \  --no-enable-basic-auth --machine-type "e2-standard-4" --num-nodes "2" --scopes \  "https://www.googleapis.com/auth/cloud-platform" --enable-ip-alias --subnetwork \  "projects/${PROJECT}/regions/${REGION}/subnetworks/default" --release-channel \  "regular" --project "${PROJECT}" # Connect to the GKE cluster gcloud container clusters get-credentials "${NAME}" --region "${REGION}" \  --project "${PROJECT}"

Postgresql Database

Create a Google Cloud SQL instance to serve as the PostgreSQL database. Make sure the Cloud SQL API is enabled before executing the following commands:

# Create a private PostgreSQL CloudSQL instance in the default network gcloud beta sql instances create "${NAME}" --database-version=POSTGRES_12 \  --cpu=1 --memory=4GiB --region="${REGION}" --no-assign-ip --network="default" \  --project "${PROJECT}" # Create a PostgreSQL user gcloud sql users create "${DB_USER}" --instance="${NAME}" --password="${DB_PASS}" \  --project "${PROJECT}" # Create a PostgreSQL database gcloud sql databases create "${DB_NAME}" --instance="${NAME}" --project="${PROJECT}" # Export the DB IP address export DB_IP=`gcloud sql instances describe "${NAME}" --project="${PROJECT}" \  --format="value(ipAddresses.ipAddress)"`

Elasticsearch in Kubernetes

We are using the Elasticsearch Operator to operate an Elasticsearch instance inside the Kubernetes cluster. You can deploy the following YAML to create all the necessary resources to have the operator running in the Kubernetes cluster, inside the elastic-system namespace:

After deploying the Elasticsearch operator, you can deploy the following custom resource to start an Elasticsearch instance inside the Kubernetes cluster.

apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: es-cluster spec: version: 6.8.13 nodeSets: - name: default count: 1 config: node.store.allow_mmap: false

You can validate that the Elasticsearch instance is up and running correctly by observing an Elasticsearch pod is in the RUNNING state.

kubectl get pods -l elasticsearch.k8s.elastic.co/cluster-name
NAME READY STATUS RESTARTS AGE es-cluster-es-default-0 1/1 Running 0 5m

After the Elasticsearch instance, you need to get the default user password from the secret created by the operator. This password is needed in the cdap-security.xml file for CDAP to authenticate itself to Elasticsearch.

export ES_USER=elastic export ES_PASS=$(kubectl get secret es-cluster-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')

 

Created in 2020 by Google Inc.