Pipelines with many sinks may exceed the Variable substitution depth in the Hadoop conf

Description

If a pipeline has many sinks that can all be written by a single mapper/reducer, then CDAP stores the configuration for all the outputs of that MapReduce in a property of the Hadoop configuration, in a stringified JSON list. Each of these configurations repeats the entire Hadoop configuration (why?) and because that can contain variables such as  

these get repeated for each sink in that JSON string. This can easily exceed the hardcoded number of substitutions allowed for a property in the hadoop conf, leading to pipeline failure with the following error:

There does not seem to be an easy work-around, because the error happens in Hadoop's Configuration class, which hardcodes the number allowed substitutions to 20. 

Note that the error is slightly misleading, because the depth of substitutions is really 1. But Hadoop's implementation simply counts the number of substitutions regardless of nesting. 

The fix might be to store these output configurations in a different way, or retrieve them from the Hadoop conf with substitution (getRaw()). 

 

 

Release Notes

Fixed an issue that caused pipelines with too many macros to fail when running in MapReduce

Activity

Show:

Albert Shau December 4, 2018 at 1:11 AM

Albert Shau December 3, 2018 at 6:29 PM

Need to cherry pick the fix onto develop and verify there aren't other places in the code that use get() where getRaw() is required.

Albert Shau December 3, 2018 at 6:28 PM

Andreas Neumann July 30, 2018 at 6:29 PM

The same issue would happen outside of pipelines, if a map/reduce uses MultiOutput with a large number of outputs

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Affects versions

Components

Fix versions

Priority

Created July 30, 2018 at 6:25 PM
Updated February 26, 2019 at 11:21 PM
Resolved February 26, 2019 at 11:21 PM