Instance ID generation in Kubernetes

Summary

Among the program types supported by CDAP, both Worker and Service types support having multiple instances to run concurrently in the same run. The idea behind this is to support horizontal scaling by partitioning the job (Worker case) or load balancing requests (Service case). The number of instances can be scaled up or down dynamically without restarting the running program. Each of the running instances has access to a unique instance ID and also the total number of instances through the context object provided by the CDAP platform.

In Hadoop YARN environment, the instance ID for each container is assigned by the Application Master, which is trivial to implement since it is a single process. The total number of instances is communicated through the ZooKeeper between the Application Master and YARN containers.

When a program runs in the Kubernetes environment, it is more complicated since there is no Application Master nor ZooKeeper. We need a different approach than in Hadoop YARN to achieve this.

Design

In Kubernetes, resource objects are stored in the API server, backed by etcd, which provides strong consistency. We can leverage this property by to provide unique instance ID assignment that is consistent among all running instances. Also, CDAP always deploys with higher-level resources, such as Deployment, StatefulSet, or Job. All of them carry the number of replicas / instances status. By watching for changes in those higher-level resources, we can always reflect the latest number of instances.

Assignment of Instance ID

Once the total number of instances is known, each pod can try to acquire a unique instance ID in the range of [0, instanceCount). Each pod can try to take an instance ID and assign it to itself. The updated assignment will be stored in the annotation of the owner object by using the Kubernetes replace API. If multiple pods are trying to update the owner object at the same time, only one will succeed. Those who failed can retry the process by fetching the existing annotation and perform the instance ID assignment for itself again. Each pod will repeat this process until it successfully assigned an instance ID and updated the annotation in the owner object.

The assignment is stored in the annotation as a JSON serialized map, with the key as the instance ID and the value as the pod name. For example, if we run multiple instances for the data pipeline studio service, the assignment map will look something like this:

{ "0":"cdap-cdap-service-system-pipeline-studio-5d9dedc7-779hfpst", "1":"cdap-cdap-service-system-pipeline-studio-5d9dedc7-779hghyz", "2":"cdap-cdap-service-system-pipeline-studio-5d9dedc7-786ujxzd", }

With the above assignment map, we can get a complete view of what is the instance ID owned by each pod. This helps in the future when we need to implement inspection tools or for directly communicating with the pod based on instance ID (e.g. implementation of instance restart).

Annotation Maximum Size

In Kubernetes, annotations has a maximum size limit of 256KB. Given the maximum length of pod name is 253 characters, we can roughly support ~1000 instances. The is far more sufficient than what we are expecting to support.

Handling of Pod restarts

When a pod restarted, the pod name may be changed. When it tries to assign itself an instance ID, it will fail due to no more empty slot is available in the assignment map. All staled pods in the assignment must be removed to give room for the restarted pod. We can insert a logic to query for all active pods and remove all inactive ones from the assignment map if the assignment is full.

Created in 2020 by Google Inc.