Pipeline Performance Tuning Guide

This document describes how you can tune your pipelines to achieve better performance. Performance depends on the size and characteristics of the data being processed, the structure of the pipeline, and the plugins being used. By understanding a little about how a pipeline is executed, you will be able to understand what settings can be adjusted and what impact they will have.

Terminology

There are a number of different terms that are used in pipeline processing. This section gives a short definition for main terms and outlines synonyms that are used in different software. More details can be found in the following sections.

Splits or partitions. A data set is split into splits (other name partitions) to process it in parallel. If you don’t have enough splits you can’t utilize the who cluster.
Nodes (driver and worker). Physical or virtual machines that can do processing. Driver nodes run driver processes that coordinate work, worker nodes run executors that process data. They have machine characteristics (amount of memory and number of vCores available for processes).
vCores, Cores or CPUs. A resource that does computing. Usually your nodes provide a certain amount of Cores and your Executors request 1 or few CPUs. You want this to be balanced along with memory otherwise you will end up underutilizing your cluster.

Executor processes. The processes that do actual data processing. You define how much memory and how many CPUs those need and they are scheduled on your nodes accordingly. You want to have enough splits to have all of the executors utilized.