Terminology

There are a number of different terms that are used in pipeline processing. This section gives a short definition for main terms and outlines synonyms that are used in different software. More details can be found in the following sections.

  • Splits or partitions. A dataset is split into splits (other name partitions) to process it in parallel. If you don’t have enough splits, you can’t utilize the whole cluster.

  • Nodes (master and worker). Physical or virtual machines that can do processing. Master nodes usually coordinate work. Worker nodes run executors that process data. They have machine characteristics (amount of memory and number of vCores available for processes).

  • vCores, Cores or CPUs. A resource that does computing. Usually your nodes provide a certain amount of Cores and your Executors request 1 or a few CPUs. You want this to be balanced along with memory otherwise you will end up underutilizing your cluster.

  • Executor processes. The processes that do actual data processing. You define how much memory and how many CPUs those need and they are scheduled on your nodes accordingly. You want to have enough splits to have all of the executors utilized.