StreamingSource cannot evaluate macros with checkpointing

Description

This is due to limitations in Spark Streaming checkpoints.

The limitation is due to how Spark implements checkpointing and due to how Spark builds its dag.

With checkpointing enabled, Spark will serialize all code into its checkpoint. This means that the StreamingSource that gets serialized is the one that has it macros already substituted in. This means that if you run the pipeline once and the macro evaluates to XYZ, it is forever frozen at that value because it is serialized in the checkpoint. If you stop the pipeline, update the runtime arguments so that the macro should evaluate to ABC, then run the pipeline again, Spark will simply pick up the StreamingSource from the checkpoint and macro evaluation will not occur. We work around this in other plugins by wrapping them in classes that dynamically instantiate the plugin and evaluate macros at runtime. However, this does not work for a StreamingSource because of how Spark builds its dag.

When you create an InputDStream (Spark's class), it adds itself to the dag in the spark context. This means there isn't any way for us to use a dynamic version of the input dstream that performs the trick of using a wrapper class that instantiates the actual class and evaluates macros at runtime. It's possible we might be able to really hack into things and change the spark dag, but it would be complex and very brittle.

If it can't be fixed, we should check for macros at configure time and fail pipeline creation.

Release Notes

None

Activity

Show:
Bhooshan Mogal
February 14, 2017, 12:31 AM

Added the fixVersion to 4.2, for consideration during prioritization.

Albert Shau
February 14, 2017, 12:34 AM

I want to reiterate that this likely cannot be fixed, due to Spark limitations.

Fixed

Assignee

Albert Shau

Reporter

Albert Shau

Labels

Docs Impact

None

UX Impact

None

Components

Priority

Critical
Configure