We're updating the issue view to help you get more done. 

DefaultNamespaceEnsurer doesn't let CDAP Master shutdown

Description

DefaultNamespaceEnsurer makes sure that default namespace exists. If it fails, it silently retries. We do this because maybe dataset service isn't ready yet, and will be in another few minutes. This can lead to infinite retries in the following cases -

  1. CDAP does not have permissions to create default namespace

  2. CDAP is stopped soon after startup (before all services have started successfully)

The infinite retries do not let CDAP Master shutdown completely. This can lead to CDAP Master hanging when shutting down, or CDAP Master hanging when it becomes a follower on losing Zookeeper session.

Release Notes

Fixed an issue where DefaultNamespaceEnsurer sometimes prevented CDAP Master shutdown.

Activity

Show:
Ali Anwar
November 9, 2016, 11:10 PM

Allow RetryOnStartFailureService to be interrupted, instead of suppressing an interrupt and continuing to retry: https://github.com/caskdata/cdap/pull/7085

Ali Anwar
December 12, 2016, 10:22 PM

I've observed the same issue still happening (on 4.0.0-SNAPSHOT), so reopening.

Leonid Fedotov
January 23, 2017, 7:06 PM

We should be able to explicitly kill master services and master process after defined timeout (like 1 - 2 minutes).

Leonid Fedotov
January 23, 2017, 8:50 PM

are the plans to backport to 3.5.3?

Ali Anwar
February 6, 2017, 11:46 PM
Edited

Even though we are interrupting and appropriately handling it in RetryOnStartFailureService.java, a class in tephra is catching and suppressing the InterruptedException, which is why it is not being handled as expected:
https://github.com/apache/incubator-tephra/blob/release/0.10.0-incubating/tephra-core/src/main/java/org/apache/tephra/distributed/RetryWithBackoff.java#L67

Fixing this case by using a 'stopped' flag:
Against release/3.5 for 3.5.4: https://github.com/caskdata/cdap/pull/7826
Against release/4.0 for 4.0.2: https://github.com/caskdata/cdap/pull/7990
Fix brought into release/4.1 via: https://github.com/caskdata/cdap/pull/7997

Fixed

Assignee

Ali Anwar

Reporter

Ali Anwar

Labels

Docs Impact

None

UX Impact

None

Components

Fix versions

Affects versions

Priority

Major
Configure