Steps to reproduce -
1. Start Workflow.
2. Once the driver is running stop the underlying hbase master/region server processes.
3. Start hbase processes before timeout.
4. WorkflowDriver container will hang indefinitely.
Fixed an issue where Workflow driver container was hanging indefinitely if the exception is thrown by listener attached to its controller.
Following exceptions can be seen in the log -
This exception is the result of not retrying the remote operation service which is tracked https://issues.cask.co/browse/CDAP-8473. However there is a possibility that retry will eventually fail, in which case we will still see this issue.
jstack shows non-daemon thread is waiting -
Following seems to be root cause of this issue -
1. When run in distributed mode. we attach two listeners to the Workflow program controller. One controller is attached in AbstractProgramTwillRunnable and another listener is attached in the WorkflowProgramRunner.
2. Controller attached in the WorkflowProgramRunner is executed before the one in AbstractProgramTwillRunnable.
3. When hbase is down, error() method on the controller gets called. In WorkflowProgramRunner, this error is used to update the store. However since hbase is down this update throws an exception.
4. Since this exception is not caught anywhere, the error callback in the AbstractProgramTwillRunnable never gets called and container keeps hanging.
One possible fix would be when we call the error callback on listeners here - https://github.com/caskdata/cdap/blob/release/4.1/cdap-app-fabric/src/main/java/co/cask/cdap/internal/app/runtime/AbstractProgramController.java#L401, we can catch Exception, log and continue so that next listener gets called.