Added the ability to set properties in the SparkConf or Hadoop Configuration of a pipeline run.
One possibility is to add another section to the config for spark properties.
Like other things in the configuration, these settings would be constant and can never change. This would avoid the spark streaming checkpointing issue where the conf is frozen after the first run.
Any approach involving runtime arguments would not work for realtime pipelines due to the fact that Spark serializes the config.
It doesn't matter whether the properties comes from runtime arguments or pipeline configs because at the end Spark Streaming checkpoint (serialize) the DStream states, which contains the configurations, right?
'would not work' was probably the wrong phrase. It would 'work' for the very first run, but subsequent runs would ignore anything in the runtime args because the spark conf is serialized in the checkpoint. Therefore, it seems better to make it part of the app config, at least for now, since the config does not change between runs.
Seems like we should support two new capabilities. The first capability is to set engine specific properties at the pipeline level. This translates to settings in the SparkConf, or settings in the MapReduce Configuration, and would be used by advanced users who want to tweak the properties of their runs. To support this, I propose that we add a "properties" map to the app config:
Properties prefixed with 'system.spark' will be added to the SparkConf. Properties prefixed with 'system.mapreduce' will be added to the Hadoop Configuration.
The second capability is to allow plugins to set these engine specific properties. For example, the Kafka Streaming Source should be able to rate limit the number of records read per partition by setting the spark.streaming.kafka.maxRatePerPartition property. To support this, I propose we add a couple methods to PipelineConfigurer:
Each plugin will be able to access these at configure time. If multiple plugins set different values for the same key, pipeline creation will fail. This would also require creating an Engine enum in cdap-etl-api, and deprecating the one in cdap-etl-proto in favor of the one in api.