We're updating the issue view to help you get more done. 

Metrics Processor shut down, without an apparent reason

Description

Right after CDAP master starts and launches metrics processor, the metrics processor shows 'shutdown triggered'. The container is not restarted, and no metrics are processed.
Attached is the program logs for the metrics processor container.

HDP 2.4
Singlenode CDAP with security, and secure hadoop
CDAP autobuilt off commit: 5a77a926637a3e0004f5e8a8d05b1011c10870ee

Release Notes

Fixed a race condition bug in ResourceCoordinator that prevented performing partition assignment in the correct order. It affects the metrics processor and stream coordinator.

Activity

Show:
Shankar Selvam
April 13, 2016, 9:44 PM

Resource balancer service got a partitions changed notification with empty partition list causing the immediate shutdown, this could be a race condition, we need to investigate why we received empty list. the above PR was not relevant so closed it.

Priyanka Nambiar
April 13, 2016, 9:49 PM

moving to 3.5, but should definitely do it in 3.5 - marking it a blocker.

Shankar Selvam
May 2, 2016, 10:23 PM

To clarify further,

typically we get the partitions change in this order

this is fine as the service is not started when partitions size is empty and started when the partitions have changed and its non empty. This is the working case scenario.

In the failure case, we get these partitions changed message in reverse order,

in this case, the metrics processor is started for the first message, and when we get the empty partitions list message next, we stop the running metrics.processor service.

since after this we have not gotten any partitions changed message, the metrics.processor service remains stopped.

this is most likely an race condition in partitions message delivery order.

Terence Yim
May 4, 2016, 12:20 AM

So there is a race condition in the usage of the `ZKExtOperations.setOrCreate` method that causes later update being overwritten by earlier update to ZK node.

Terence Yim
May 6, 2016, 5:10 PM

Assignee

Terence Yim

Reporter

Ali Anwar

Labels

None

Docs Impact

None

UX Impact

None

Components

Fix versions

Priority

Blocker
Configure