We're updating the issue view to help you get more done. 

DynamicPartitioner should have a way to close a writer when it is known to be done

Description

DynamicPartitioner will open multiple record writers, one for each dynamic partition it is creating. Some output formats (for example, ORC) require a lot of memory when writing, and having multiple writers open can cause out-of-memory issues. To avoid that, there should be an option to keep at most one writer open at any given time. However, that will only work if the reducer (or mapper for a map-only job) emits all records for a partition consecutively. This cannot be generally assumed. Thus:

  • add an option to specify that only one writer should be open at once

  • close the current partition as soon as a key for a different partition is seen

  • if a partition is closed and we see another key for that partition, throw an error

Release Notes

DynamicPartitioner can now limit the number of open RecordWriters to one, if the output partition keys are grouped.

Activity

Show:
Ali Anwar
November 1, 2016, 10:28 PM
Edited

Instead of just limiting to one, perhaps the upper limit of writers can be configurable.
This would help in the case that the user's data is nearly completely ordered, but not completely.

Edit: There doesn't seem to be a strong use case for this, and this it would be difficult to enforce the data in such a way that this is useful, so not pursuing this.

Ali Anwar
November 3, 2016, 9:14 PM
Fixed

Assignee

Ali Anwar

Reporter

Andreas Neumann

Labels

Docs Impact

None

UX Impact

None

Components

Fix versions

Affects versions

Priority

Major
Configure