Right after CDAP master starts and launches metrics processor, the metrics processor shows 'shutdown triggered'. The container is not restarted, and no metrics are processed.
Attached is the program logs for the metrics processor container.
Singlenode CDAP with security, and secure hadoop
CDAP autobuilt off commit: 5a77a926637a3e0004f5e8a8d05b1011c10870ee
Fixed a race condition bug in ResourceCoordinator that prevented performing partition assignment in the correct order. It affects the metrics processor and stream coordinator.
Resource balancer service got a partitions changed notification with empty partition list causing the immediate shutdown, this could be a race condition, we need to investigate why we received empty list. the above PR was not relevant so closed it.
moving to 3.5, but should definitely do it in 3.5 - marking it a blocker.
To clarify further,
typically we get the partitions change in this order
this is fine as the service is not started when partitions size is empty and started when the partitions have changed and its non empty. This is the working case scenario.
In the failure case, we get these partitions changed message in reverse order,
in this case, the metrics processor is started for the first message, and when we get the empty partitions list message next, we stop the running metrics.processor service.
since after this we have not gotten any partitions changed message, the metrics.processor service remains stopped.
this is most likely an race condition in partitions message delivery order.
So there is a race condition in the usage of the `ZKExtOperations.setOrCreate` method that causes later update being overwritten by earlier update to ZK node.