DefaultNamespaceEnsurer makes sure that default namespace exists. If it fails, it silently retries. We do this because maybe dataset service isn't ready yet, and will be in another few minutes. This can lead to infinite retries in the following cases -
CDAP does not have permissions to create default namespace
CDAP is stopped soon after startup (before all services have started successfully)
The infinite retries do not let CDAP Master shutdown completely. This can lead to CDAP Master hanging when shutting down, or CDAP Master hanging when it becomes a follower on losing Zookeeper session.
Fixed an issue where DefaultNamespaceEnsurer sometimes prevented CDAP Master shutdown.
Allow RetryOnStartFailureService to be interrupted, instead of suppressing an interrupt and continuing to retry: https://github.com/caskdata/cdap/pull/7085
I've observed the same issue still happening (on 4.0.0-SNAPSHOT), so reopening.
We should be able to explicitly kill master services and master process after defined timeout (like 1 - 2 minutes).
are the plans to backport to 3.5.3?
Even though we are interrupting and appropriately handling it in RetryOnStartFailureService.java, a class in tephra is catching and suppressing the InterruptedException, which is why it is not being handled as expected:
https://github.com/apache/incubator-tephra/blob/release/0.10.0-incubating/tephra-core/src/main/java/org/apache/tephra/distributed/RetryWithBackoff.java#L67
Fixing this case by using a 'stopped' flag:
Against release/3.5 for 3.5.4: https://github.com/caskdata/cdap/pull/7826
Against release/4.0 for 4.0.2: https://github.com/caskdata/cdap/pull/7990
Fix brought into release/4.1 via: https://github.com/caskdata/cdap/pull/7997