Pending Events metric is always wrong

Description

The backend does not have a metric for pending events in a queue. The UI currently shows the rate of events emitted by the upstream flowlet.

This metric is not correct. Pending events are built up over time as an absolute value, they are not a current rate. Also, the metric used is the system.process.events.out metric, which counts emitted events across all output queues of a flowlet.

We need to come up with a new way to do this.

Release Notes

None

Subtasks

100% Done

Linked issues

relates to

CDAP-2191

After deleting queues, the pending events for the queues are incorrect.

CDAP-1184

DELETE namespaces/namespace-id/queues API should clear queue metrics.

CDAP-2312

Pending Events in a queue should be implemented by the queue system

Activity

Show:

Sreevatsan Raman May 2, 2015 at 10:53 PM

All the dependent JIRAs are resolved. Resolving this.

Andreas Neumann April 26, 2015 at 7:04 AM
Edited

So this metric is implemented and the PR is out (see https://cdap.atlassian.net/browse/CDAP-2259#icft=CDAP-2259). Now I have one doubt: can this metric be queried as a time series? It seems that the metric system will only allow us to query the total value as an aggregate, but not over time. But that is what we would like to show in UI - how the queue size (pending events in the queue) is evolving. And if it keeps going up, that is an indication to scale out the flowlet to more instances.

@Alex Baranau can you advise how to do that best?

Andreas Neumann April 22, 2015 at 7:54 PM

OK let's try to get this done using a metric in 3.0.

Terence Yim April 22, 2015 at 7:48 PM

I have a suggestion on how to do it transactionally.

For queue producer, along with enqueuing entries, increment a value per partition (basically groupId + instanceId) and store them in a separate row in the queue table (prefix with something unique to avoid conflict with queue entries).
- This can records the number of entries enqueue for each partition
- We can use readless increment to make the impact on performance minimal
- Since writing to queue table is transactional, the metrics is always correct
Similarly for queue consumer, each consumer performs a readless increment on number of dequeued entries, which get stored in the same row, but different column as the one used by the producer mentioned above.
- The performance impact should be small, since it's just an extra write to HBase in the same batch (the queue consumer is already doing write to queue table to record the CONSUMED state).
- Queue table is transactional, hence the metrics is always correct
To query for the queue pending metrics, simply subtract the sum of all consumer metrics from the sum of all producer metrics
- The sums can be computed by a single transactional get of the metric row in the queue table.

The change I mentioned above, however, is risky to be done in 3.0 timeframe (it involves modifying queue producer, consumer as well as queue eviction coprocessor). I think it make sense to use metrics system as an intermediate solution to keep pending metrics closer to what it is in 3.0 first.

Andreas Neumann April 22, 2015 at 6:54 PM
Edited

Here is my suggested design:

We add a new metric for each queue. Since each queue name already has the producing and consuming flowlet in its name, this will be a unique metric for each consumer group (flowlet) of a queue.
The queue producer emits increments for this metric
The queue consumer emits decrements for this metric
When we clear the queues (reset queue state for a flow of for an app or namespace), we reset the value of the metric 0.
Hence this metric always represents the number of unconsumed events per consumer, and the UI can query that.

We need to make sure that the increments and decrements are emitted exactly once for each event. That can be a little tricky, because metrics emission is not transactional.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Andreas Neumann

Reporter

Andreas Neumann

Priority

Major

Created April 20, 2015 at 9:18 PM

Updated August 4, 2015 at 7:57 PM

Resolved May 2, 2015 at 10:52 PM

Configure