PartitionConsumer should have a way to limit the life time of its scanner

Description

When consuming partitions, we do a scan on an indexed table. This scan

hardcodes the HBase scanner client cache to 1000
scans all 1000 items and performs a Get for each of them - until the working set is full.

1000 Gets normally only take about 10 seconds, which is way below the typical timeout values. However, when HBase is under heavy load or responds slowly for other reasons (for example, slow HDFS, slow network, majpr compaction, HDFS rebalance), this can easily take minutes and exceed both the HBase RPC and Scanner timeout, as well as the transaction timeout.

Reducing the working set can mitigate this (it limit the number of Gets), but that is not acceptable in all scenarios.

PartitionedFileSet (or the partition consumer) should have a way to renew the scanner lease, or two close the scanner before it times out, to be resilient in such situations.

Possibly this should be an option for indexed table, or even for scans on any Table.

Release Notes

Adds the ability to configure the HBase client scanner cache for a dataset.

Linked issues

is duplicated by

CDAP-11908

Table.scan() should have a way to set the client cache size

CDAP-19

Table does not allow setting of scanner caching or batching per request

relates to

CDAP-11954

Datasets should have a generic way to extract configuration from properties, arguments, or operations

Activity

Show:

Andreas Neumann June 24, 2017 at 12:59 AM

Tested this on a cluster by modifying the SpoortResults.ScoreCounter to do this in initialize():

Adding 1005 partitions and running the mapreduce. With default settings, this caches 1000 items and sleeps 1000x200ms = 200 seconds before it hits the region server again, then it gets the expected exception:

Next, I updated the dataset to set hbase.client.scanner.caching to 50. In CLI:

And now, running again (with system.data.tx.timeout=600, to prevent the transaction timeout), it passes and successfully starts the job.

Andreas Neumann June 20, 2017 at 2:59 PM

Not sure about Flow and TMS. But for other Table scans, user code may iterate over the results and do some processing on each one. Such user needs to able to configure the scan appropriately.