Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For each pipeline, you can choose Spark or MapReduce as the execution engine. Spark is the default execution engine. You can also add custom configurations, which allow you to add additional engine configurations. Typically, custom configurations apply to Spark and not as frequently to MapReduce. 

Spark engine configs

Here are some examples of custom configurations that are commonly used for Spark:

...

To improve pipeline performance, add spark.serializer as the Name and org.apache.spark.serializer.KryoSerializer as the Value.

...

To workaround a bug in Spark versions prior to version 2.4, enter spark.maxRemoteBlockSizeFetchToMem as the Name and 2147483135 as the Value.

...

if you don’t want Spark to retry upon failure, enter spark.yarn.maxAppAttempts as the Name and 1 as the Value. If you want Spark to retry multiple times, set the value to the number of retries you want Spark to perform. 

...

To turn off auto-caching in Spark, enter spark.cdap.pipeline.autocache.enable as the Name, and false as the Value. By default, pipelines will cache intermediate data in the pipeline in order to prevent Spark from re-computing data. This requires a substantial amount of memory, so pipelines that process a large amount of data will often need to turn this off.

To disable pipeline task retries, enter spark.task.maxFailures as the Name and 1 as the Value.

...

For common Spark settings, see Parallel Processing.

MapReduce engine configs

Here’s an example of a custom configuration used for MapReduce

...