We're updating the issue view to help you get more done. 

Support for arbitrary spark configs in streaming pipelines

Description

There are a lot of knobs you can use in Spark Streaming (http://spark.apache.org/docs/latest/configuration.html). Hydrator pipelines should let advanced users set any of these.

Release Notes

Added the ability to set properties in the SparkConf or Hadoop Configuration of a pipeline run.

Activity

Show:
Albert Shau
March 24, 2017, 6:31 PM

One possibility is to add another section to the config for spark properties.

Like other things in the configuration, these settings would be constant and can never change. This would avoid the spark streaming checkpointing issue where the conf is frozen after the first run.

Any approach involving runtime arguments would not work for realtime pipelines due to the fact that Spark serializes the config.

Terence Yim
April 13, 2017, 10:54 PM
Edited

It doesn't matter whether the properties comes from runtime arguments or pipeline configs because at the end Spark Streaming checkpoint (serialize) the DStream states, which contains the configurations, right?

Albert Shau
April 17, 2017, 10:58 PM

'would not work' was probably the wrong phrase. It would 'work' for the very first run, but subsequent runs would ignore anything in the runtime args because the spark conf is serialized in the checkpoint. Therefore, it seems better to make it part of the app config, at least for now, since the config does not change between runs.

Albert Shau
April 18, 2017, 6:36 PM
Edited

Seems like we should support two new capabilities. The first capability is to set engine specific properties at the pipeline level. This translates to settings in the SparkConf, or settings in the MapReduce Configuration, and would be used by advanced users who want to tweak the properties of their runs. To support this, I propose that we add a "properties" map to the app config:

Properties prefixed with 'system.spark' will be added to the SparkConf. Properties prefixed with 'system.mapreduce' will be added to the Hadoop Configuration.

The second capability is to allow plugins to set these engine specific properties. For example, the Kafka Streaming Source should be able to rate limit the number of records read per partition by setting the spark.streaming.kafka.maxRatePerPartition property. To support this, I propose we add a couple methods to PipelineConfigurer:

Each plugin will be able to access these at configure time. If multiple plugins set different values for the same key, pipeline creation will fail. This would also require creating an Engine enum in cdap-etl-api, and deprecating the one in cdap-etl-proto in favor of the one in api.

Albert Shau
April 19, 2017, 8:58 PM
Fixed

Assignee

Albert Shau

Reporter

Albert Shau

Labels

None

Docs Impact

None

UX Impact

None

Components

Fix versions

Priority

Major
Configure