PartitionConsumer should have a way to limit the life time of its scanner
Description
Release Notes
is duplicated by
Activity

Andreas Neumann June 24, 2017 at 12:59 AM
Tested this on a cluster by modifying the SpoortResults.ScoreCounter to do this in initialize():
Adding 1005 partitions and running the mapreduce. With default settings, this caches 1000 items and sleeps 1000x200ms = 200 seconds before it hits the region server again, then it gets the expected exception:
Next, I updated the dataset to set hbase.client.scanner.caching to 50. In CLI:
And now, running again (with system.data.tx.timeout=600, to prevent the transaction timeout), it passes and successfully starts the job.

Andreas Neumann June 20, 2017 at 2:59 PM
Not sure about Flow and TMS. But for other Table scans, user code may iterate over the results and do some processing on each one. Such user needs to able to configure the scan appropriately.

Terence Yim June 20, 2017 at 4:48 AM
This also applies to normal usages of Scanner (a pure scanner that doesn't do separate Get), e.g. in TMS and Flow queue?
When consuming partitions, we do a scan on an indexed table. This scan
hardcodes the HBase scanner client cache to 1000
scans all 1000 items and performs a Get for each of them - until the working set is full.
1000 Gets normally only take about 10 seconds, which is way below the typical timeout values. However, when HBase is under heavy load or responds slowly for other reasons (for example, slow HDFS, slow network, majpr compaction, HDFS rebalance), this can easily take minutes and exceed both the HBase RPC and Scanner timeout, as well as the transaction timeout.
Reducing the working set can mitigate this (it limit the number of Gets), but that is not acceptable in all scenarios.
PartitionedFileSet (or the partition consumer) should have a way to renew the scanner lease, or two close the scanner before it times out, to be resilient in such situations.
Possibly this should be an option for indexed table, or even for scans on any Table.