We're updating the issue view to help you get more done. 

Metrics for mapreduce jobs are not correct while multiple sinks are used

Description

While using multiple sinks - the ones that use Outputformats (eg: s3). The mapper output metrics are incorrect. The map input records show the right number but the map output records are not.

To reproduce - create an ETL batch reading from stream, writing to S3 and Table.

Release Notes

Fix a bug where certain MapReduce metrics were not being properly emitted when using multiple outputs.

Activity

Show:
Ali Anwar
December 22, 2015, 11:23 PM

DataCleansing example application can also be used to reproduce this, as that MapReduce job has two output datasets.

Ali Anwar
January 6, 2016, 7:29 PM

Normally, the user calls context#write(key, value) - context in this case is a Hadoop class, which automatically increments the counter as well as writing the record.
In the case of multiple outputs, the user calls context#write(outputName, key, value) - in this case, the context
is a CDAP class, and this doesn't at all translate into a call to Hadoop's context#write. Because of that, the
metrics for output records aren't automatically incremented.

To fix this, use a MeteredRecordWriter. Merged https://github.com/caskdata/cdap/pull/4799

Assignee

Ali Anwar

Reporter

Sreevatsan Raman

Labels

None

Docs Impact

None

UX Impact

None

Components

Fix versions

Priority

Major
Configure