With impersonation enabled, there is a failure to stop/kill a program that has been started more than X hours ago. This value X is the kerberos ticket lifetime.
The reason is that the YarnClient we use to launch the Yarn application is the same one we use to try to kill/stop the application.
Avoid the caching of YarnClient in order to fix a problem that occurred in namespaces with impersonation configured.
This can be reproduced by:
1. Set the principal's maxlife to a short duration (5 minutes). Restart CDAP, just to be sure the change is picked up.
2. Start a program (I used Flow).
3. Wait 5+ minutes, stop the flow. The above exception and stack trace will be in master logs.
PR to fix this by avoiding caching the YarnClient and performing impersonation upon program stop/kill:
https://github.com/caskdata/cdap/pull/6926
Edit: pull/6926 was closed in favor of https://github.com/caskdata/cdap/pull/6931
When CDAP Master restarts, it creates ProgramControllers for each of the running programs. However, it does not do this under impersonation, and so each YarnClient used in the ProgramControllers has the UGI of the cdap system principal.
If there's a Yarn Application that the cdap system principal does not have VIEW access to, this YarnClient will return an ApplicationReport that returns "NA" for the getHost method, and -1 for the getPort method, but the Application Status is still correct (likely a gap in YARN security):
https://github.com/apache/hadoop/blob/branch-2.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java#L460-L528
Because of that, the URL that we create is malformed:
https://github.com/caskdata/cdap/blob/f80013ec41991eccc765290dea96ee1ba5dc1c83/cdap-app-fabric/src/main/java/org/apache/twill/yarn/YarnTwillController.java#L142
This results in the following logs periodically appearing for each such app. The fix would be to perform impersonation while creating these YarnClients.