Configuring cluster resources for Replication Pipeline

Replication pipelines are deployed on Dataproc cluster by default. This guide will show how resources for replication pipelines are requested and enable users to configure their clusters accordingly.

Cluster sizing for Spark based pipelines in CDAP is described in https://cdap.atlassian.net/wiki/spaces/DOCS/pages/818970653. Since replication pipelines are not executed as Spark programs, there are a few differences.

spark.yarn.executor.memoryOverhead is not applicable for tasks (yarn executors) in replication pipelines
500 MB is required for the one Application Master(AM) process per pipeline.

This document primarily focusses on sizing of Worker nodes, refer to https://cdap.atlassian.net/wiki/spaces/DOCS/pages/818970653 for recommendations on Master node sizing.

Parameters that affect the resource configuration:

Task
- Number of tasks
- Task memory : memory allocation to each task
Cluster
- Number of workers in the cluster
- Memory of each worker

Calculating resource requirements

The resources requested by the tasks should fit in the given cluster

Basically :

no of tasks * task memory + 500 MB < 0.75 * memory of each worker * no of workers

Example :

no of tasks = 10

task memory = 4 GB

Here we will have 10 yarn executors in the cluster, each executor should have 4 GB memory each . Over all we need at least 40.5 GB of memory in our cluster to accommodate 10 tasks and 1 Application Master(AM). Note that it’s not enough for the cluster to have the required memory overall, but each task and application master should be able to fit on a node.

So for this scenario, we can configure the following worker nodes:

cpu of each worker = 2

memory of each worker = 12 GB

no of workers = 5

Cluster has 60 GB memory over all, out of which 45 GB is usable (75% of 60GB) and it should fit all the 10 processes + AM properly. Each worker node will have 9 GB usable memory (75% of 12 GB) for the pipeline, so each node can accommodate 2 tasks and leaving enough space to fit in Application Master as well.

YARN Cluster Memory Allocation

Incase of unavailability of resources in the cluster:

Only few of the processes will run which it could accommodate while rest of processes will fail after a timeout and hence failing the pipeline.
As of CDAP 6.7 no error is thrown in such cases, the job shuts down after this timeout in around 6-7 minutes without any error

Each of these parameter and concepts are explained in detail and how to configure each of them.

Task

A Task can be considered to a separate process which runs some or all of the tables in a replication job. The role of a task is to increase the parallelism of our job for faster processing.

For example : If we have a total of 100 tables , and we configure our job to have 10 tasks. Then each task would get 10 tables each and each task will process their 10 tables individually. Each task runs parallely.

How to configure no of tasks ?

One can configure the no of tasks while creating the replication pipeline in UI , in the Configure advanced properties section

[Left image]One way is to select the amount data flowing in. In this case based on your selection, the number of tasks will vary from 1 - 18 .
[Right image] Or One can manually set the number of tasks ( based on on number of tables and data )

Where to view the number of tasks for a pipeline ?

On the replication job UI, Click on view details and check Number of tasks

We can only configure number of tasks only once for a replication job.

Task Memory

The amount of memory in GB we want to allocate to each task. By default it’s 2Gb.

We can configure this value on the replication pipeline page UI : configure

The value is generally set only if the tables are large and we need to increase the memory to handle it.

Cluster

The default cluster used is Dataproc and this guide will focus on that. But even for other customized clusters the concept and calculation will be similar. A cluster consists of multiple machines called as Workers where the actual processing happens.

In the replication job, we create a new cluster based on the given configurations. Below are the important configurations related to this guide :

Worker memory and core

The memory allocated to each Machine/ worker of the cluster is the worker memory. The number of cores allocated to each Machine/worker is worker cores. We can configure this params in replication pipeline page > configure > customize of the desired cluster (generally dataproc) > Worker node configuration > Worker Cores and Worker memory

Note about cores:

It is generally safe to assume a single core can handle 2 processes. So, in the calculation, it is better keep the no of cores overall ( no of workers * worker cores) to be close to or at least more than half of the total number of tasks

Number of Workers

We can configure the number of workers in a similar manner : replication pipeline page > configure > customize of the desired cluster (generally Dataproc) > Number of cluster workers