WorkflowDriver container keeps hanging if hbase master/region server is restarted.

Description

Steps to reproduce -
1. Start Workflow.
2. Once the driver is running stop the underlying hbase master/region server processes.
3. Start hbase processes before timeout.
4. WorkflowDriver container will hang indefinitely.

Release Notes

Fixed an issue where Workflow driver container was hanging indefinitely if the exception is thrown by listener attached to its controller.

Activity

Show:
Sagar Kapare
February 14, 2017, 1:07 AM
Edited

Following exceptions can be seen in the log -

This exception is the result of not retrying the remote operation service which is tracked https://issues.cask.co/browse/CDAP-8473. However there is a possibility that retry will eventually fail, in which case we will still see this issue.

Sagar Kapare
February 14, 2017, 1:08 AM

jstack shows non-daemon thread is waiting -

Sagar Kapare
February 22, 2017, 8:38 PM

Following seems to be root cause of this issue -
1. When run in distributed mode. we attach two listeners to the Workflow program controller. One controller is attached in AbstractProgramTwillRunnable and another listener is attached in the WorkflowProgramRunner.
2. Controller attached in the WorkflowProgramRunner is executed before the one in AbstractProgramTwillRunnable.
3. When hbase is down, error() method on the controller gets called. In WorkflowProgramRunner, this error is used to update the store. However since hbase is down this update throws an exception.
4. Since this exception is not caught anywhere, the error callback in the AbstractProgramTwillRunnable never gets called and container keeps hanging.

One possible fix would be when we call the error callback on listeners here - https://github.com/caskdata/cdap/blob/release/4.1/cdap-app-fabric/src/main/java/co/cask/cdap/internal/app/runtime/AbstractProgramController.java#L401, we can catch Exception, log and continue so that next listener gets called.

Sagar Kapare
February 23, 2017, 3:32 AM
Fixed

Assignee

Sagar Kapare

Reporter

Sagar Kapare

Labels

Docs Impact

None

UX Impact

None

Fix versions

Affects versions

Priority

Critical
Configure