Configuring cluster resources for Replication Pipeline
Replication pipelines are deployed on Dataproc cluster by default. This guide will show how resources for replication pipelines are requested and enable users to configure their clusters accordingly.
Cluster sizing for Spark based pipelines in CDAP is described in Cluster Sizing. Since replication pipelines are not executed as Spark programs, there are a few differences.
spark.yarn.executor.memoryOverhead
is not applicable for tasks (yarn executors) in replication pipelines500 MB is required for the one Application Master(AM) process per pipeline.
This document primarily focusses on sizing of Worker nodes, refer to Cluster Sizing for recommendations on Master node sizing.
Parameters that affect the resource configuration:
Task
Number of tasks
Task memory : memory allocation to each task
Cluster
Number of workers in the cluster
Memory of each worker
Calculating resource requirements
The resources requested by the tasks should fit in the given cluster
Basically :
no of tasks
* task memory
+ 500 MB
< 0.75
* memory of each worker
* no of workers
Example :
no of tasks
= 10
task memory
= 4 GB
Here we will have 10 yarn executors in the cluster, each executor should have 4 GB memory each . Over all we need at least 40.5 GB of memory in our cluster to accommodate 10 tasks and 1 Application Master(AM). Note that it’s not enough for the cluster to have the required memory overall, but each task and application master should be able to fit on a node.
So for this scenario, we can configure the following worker nodes:
cpu of each worker
= 2
memory of each worker
= 12 GB
no of workers
= 5
Cluster has 60 GB memory over all, out of which 45 GB is usable (75% of 60GB) and it should fit all the 10 processes + AM properly. Each worker node will have 9 GB usable memory (75% of 12 GB) for the pipeline, so each node can accommodate 2 tasks and leaving enough space to fit in Application Master as well.
Incase of unavailability of resources in the cluster:
Only few of the processes will run which it could accommodate while rest of processes will fail after a timeout and hence failing the pipeline.
As of CDAP 6.7 no error is thrown in such cases, the job shuts down after this timeout in around 6-7 minutes without any error
Each of these parameter and concepts are explained in detail and how to configure each of them.
Task
A Task can be considered to a separate process which runs some or all of the tables in a replication job. The role of a task is to increase the parallelism of our job for faster processing.
For example : If we have a total of 100 tables , and we configure our job to have 10 tasks. Then each task would get 10 tables each and each task will process their 10 tables individually. Each task runs parallely.
How to configure no of tasks ?
One can configure the no of tasks while creating the replication pipeline in UI , in the Configure advanced properties
section
[Left image]One way is to select the amount data flowing in. In this case based on your selection, the number of tasks will vary from 1 - 18 .
[Right image] Or One can manually set the number of tasks ( based on on number of tables and data )
Where to view the number of tasks for a pipeline ?
On the replication job UI, Click on view details
and check Number of tasks
We can only configure number of tasks only once for a replication job.
Task Memory
The amount of memory in GB we want to allocate to each task. By default it’s 2Gb.
We can configure this value on the replication pipeline page UI : configure
The value is generally set only if the tables are large and we need to increase the memory to handle it.
Cluster
The default cluster used is Dataproc and this guide will focus on that. But even for other customized clusters the concept and calculation will be similar. A cluster consists of multiple machines called as Workers where the actual processing happens.
In the replication job, we create a new cluster based on the given configurations. Below are the important configurations related to this guide :
Worker memory and core
The memory allocated to each Machine/ worker of the cluster is the worker memory. The number of cores allocated to each Machine/worker is worker cores. We can configure this params in replication pipeline page > configure
> customize
of the desired cluster (generally dataproc) > Worker node configuration
> Worker Cores
and Worker memory
Note about cores:
It is generally safe to assume a single core can handle 2 processes. So, in the calculation, it is better keep the no of cores overall ( no of workers * worker cores) to be close to or at least more than half of the total number of tasks
Number of Workers
We can configure the number of workers in a similar manner : replication pipeline page > configure
> customize
of the desired cluster (generally Dataproc) > Number of cluster workers
Related content
Created in 2020 by Google Inc.