...
For each pipeline, you can choose Spark or MapReduce as the execution engine. Spark is the default execution engine. You can also add custom configurations, which allow you to add additional engine configurations. Typically, custom configurations apply to Spark and not as frequently to MapReduce.
Spark engine configs
Here are some examples of custom configurations that are commonly used for Spark:
...
To improve pipeline performance, add spark.serializer
as the Name and org.apache.spark.serializer.KryoSerializer
as the Value.
...
To workaround a bug in Spark versions prior to version 2.4, enter spark.maxRemoteBlockSizeFetchToMem
as the Name and 2147483135
as the Value.
...
if you don’t want Spark to retry upon failure, enter spark.yarn.maxAppAttempts
as the Name and 1
as the Value. If you want Spark to retry multiple times, set the value to the number of retries you want Spark to perform.
...
To turn off auto-caching in Spark, enter spark.cdap.pipeline.autocache.enable
as the Name, and false
as the Value. By default, pipelines will cache intermediate data in the pipeline in order to prevent Spark from re-computing data. This requires a substantial amount of memory, so pipelines that process a large amount of data will often need to turn this off.
To disable pipeline task retries, enter spark.task.maxFailures
as the Name and 1
as the Value.
...
For common Spark settings, see Parallel Processing.
MapReduce engine configs
Here’s an example of a custom configuration used for MapReduce
...