Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Pipelines are executed on clusters of machines. They achieve high throughput by splitting up the work that needs to be done, and then running the work in parallel on the multiple executors spread out across the cluster. In general, the greater the number of splits (also called partitions), the faster the pipeline can be run. The level of parallelism in your pipeline is determined by the sources and shuffle stages in the pipeline.

Autotuning

With each version, CDAP does more and more tuning for you. To get the most benefits from it, you need both the latest CDAP and the latest compute stack. For example, if you are using Dataproc, CDAP 6.4.0 supports Dataproc images version 2.0 and provides adaptive execution tuning. Similarly adaptive execution can be enabled on Hadoop clusters with Spark 3.

For details on how to set the dataproc image version, see this document. Set it to “2.0-debian10” to run Dataproc version 2.0 on Debian 10. 

With adaptive execution tuning, you generally need only to specify the range of partitions to use, not the exact partition number. The exact partition number, even if set in pipeline configuration, is when this feature is enabled.

If you are using an ephemeral Dataproc cluster, CDAP sets proper configuration automatically, but for static Dataproc or Hadoop clusters, the next two configuration parameters can be set:

  • spark.default.parallelism - set it to the total number of vCores available in the cluster. This ensures your cluster is not underloaded and defines the lower bound for the number of partitions.

  • spark.sql.adaptive.coalescePartitions.initialPartitionNum - set it to 32x of the number of vCores available in the cluster. This defines the upper bound for the number of partitions.

  • spark.sql.adaptive.enabled - ensure it’s set to true to enable the optimizations. Dataproc sets it automatically, but you need to ensure it’s enabled if you are using generic Hadoop clusters. 

These parameters can be set in the Engine configuration of a specific pipeline or in the cluster properties of a static Dataproc cluster.

You can find more details at https://spark.apache.org/docs/latest/sql-performance-tuning.html

Clusters

Pipelines are executed on Hadoop clusters. Hadoop is a collection of open source technologies that read, process, and store data in a distributed fashion. Understanding a bit about Hadoop clusters will help you understand how pipelines are executed in parallel. Hadoop is associated with many different systems, but the two used by pipelines are the two most basic ones: HDFS (Hadoop Distributed File System) and YARN (Yet Another Resource Negotiator). HDFS is used to store intermediate data and YARN is used to coordinate workloads across the cluster.

...