Lifecycle methods should be able use the same or a different transaction

Description

(Batch) Programs like Spark and MapReduce have a beforeSubmit() and a inFinish() method.

In some use cases, it makes sense to run beforeSubmit() in the same transaction. For example if a MR processes a range in a stream, and needs to persist a "high watermark" of how far it has read. If beforeSubmit() persists that in its own transaction, and the MR fails subsequently, then the persisted mark is wrong. What we would want is that the mark only gets updated in case of success.

However, it can also make sense to persists this in its own transaction. For example, if we expect concurrent runs of the same MR (say it runs for 20 minutes, but the next run is scheduled 15 minutes later), then the next run should see the updated mark, so that it processes a disjoint range of events.

We can find similar reasoning for onFinish().

Conclusion is that these methods should support running either in their own transaction, or in the same transaction as the main program. Or we could introduce two pairs of methods, either or each of which can be implemented by a program.

Release Notes

None

Activity

Show:

Bhooshan Mogal September 9, 2015 at 11:08 PM

Moving this out of 3.2 because of a lack of bandwidth.

Andreas Neumann August 5, 2015 at 1:38 AM

We have seen multiple use cases where this was needed. Upping it to Critical

Won't Fix
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Affects versions

Components

Priority

Created July 21, 2015 at 7:05 PM
Updated June 9, 2020 at 1:27 AM
Resolved June 9, 2020 at 1:27 AM