CDAP handling HDFS HA incorrectly

Description

In HA HDFS setup, if nn1 is in standby mode, Hydrator failed to deploy pipeline withe the following error:
2017-02-08 23:48:56,047 - WARN [appfabric-executor-19:o.a.h.i.r.RetryInvocationHandler@217] - Exception while invoking ClientNamenodeProtocolTranslatorPB.getServerDefaults over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException: Operation category READ is not supported in state standby
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979)

Failing NameNode over fixing the problem.

Release Notes

None

Activity

Show:

Ali AnwarFebruary 20, 2017 at 11:01 PM

Updated the affectsVersions and the fixVersions, as those other versions are being tracked by the JIRAs mentioned in my previous comment.

Ali AnwarFebruary 16, 2017 at 9:36 PM

This is effectively a duplicate of CDAP-4739.
The task to backport it is tracked in CDAP-8343, which should then resolve this JIRA.

Ali AnwarFebruary 16, 2017 at 9:13 PM
Edited

I deployed PurchaseApp and was able to run PurchaseWorkflow consistently, whereas any workflows based off system artifacts (such as hydrator pipelines) consistently fail to start.

I scanned 'artifact.meta' in hbase and found a difference in how the uri of the artifacts were being stored:
PurchaseApp (deployed after enabling NameNode HA):

"locationURI":"hdfs://<dfs.nameservices>/cdap/namespaces/default/artifacts/Purchase/4.0.0-SNAPSHOT.14f8b638-2116-43d2-bad6-d15af75d93eb.jar"

System artifact (deployed before enabling NameNode HA):

"locationURI":"hdfs://<HOSTNAME-OF-NN1>:8020/cdap/namespaces/default/artifacts/hive-plugins/1.5.0.06631113-55b7-4d89-958f-956e64c69a7c.jar"

So, applications/artifacts deployed after enabling NameNode HA work fine.
A workaround for this issue is to delete system artifacts and reload them.

Ali AnwarFebruary 10, 2017 at 2:12 AM
Edited

Oddly, stream create and stream ingest work against HDFS, but pipeline start doesn't work:

2017-02-10 02:11:32,152 - WARN [appfabric-executor-1130:o.a.h.i.r.RetryInvocationHandler@217] - Exception while invoking ClientNamenodeProtocolTranslatorPB.getServerDefaults over null. Not retrying because try once and fail. org.apache.hadoop.ipc.RemoteException: Operation category READ is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1979) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1345) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getServerDefaults(FSNamesystem.java:1657) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getServerDefaults(NameNodeRpcServer.java:700) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getServerDefaults(ClientNamenodeProtocolServerSideTranslatorPB.java:391) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1552) ~[hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.ipc.Client.call(Client.java:1496) ~[hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.ipc.Client.call(Client.java:1396) ~[hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) ~[hadoop-common-2.7.3.2.5.3.0-37.jar:na] at com.sun.proxy.$Proxy22.getServerDefaults(Unknown Source) ~[na:na] at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:283) ~[hadoop-hdfs-2.7.3.2.5.3.0-37.jar:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_60] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_60] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_60] at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_60] at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at com.sun.proxy.$Proxy23.getServerDefaults(Unknown Source) [na:na] at org.apache.hadoop.hdfs.DFSClient.getServerDefaults(DFSClient.java:1014) [hadoop-hdfs-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.Hdfs.getServerDefaults(Hdfs.java:158) [hadoop-hdfs-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.AbstractFileSystem.open(AbstractFileSystem.java:629) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:797) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:793) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.hadoop.fs.FileContext.open(FileContext.java:793) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at org.apache.twill.filesystem.FileContextLocation.getInputStream(FileContextLocation.java:108) [org.apache.twill.twill-yarn-0.9.0.jar:0.9.0] at co.cask.cdap.common.io.Locations.linkOrCopy(Locations.java:227) [na:na] at co.cask.cdap.internal.app.runtime.distributed.DistributedProgramRuntimeService$1.call(DistributedProgramRuntimeService.java:127) [na:na] at co.cask.cdap.internal.app.runtime.distributed.DistributedProgramRuntimeService$1.call(DistributedProgramRuntimeService.java:124) [na:na] at co.cask.cdap.common.security.ImpersonationUtils$1.run(ImpersonationUtils.java:46) [na:na] at java.security.AccessController.doPrivileged(Native Method) [na:1.8.0_60] at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_60] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) [hadoop-common-2.7.3.2.5.3.0-37.jar:na] at co.cask.cdap.common.security.ImpersonationUtils.doAs(ImpersonationUtils.java:43) [na:na] at co.cask.cdap.common.security.DefaultImpersonator.doAs(DefaultImpersonator.java:60) [na:na] at co.cask.cdap.internal.app.runtime.distributed.DistributedProgramRuntimeService.copyArtifact(DistributedProgramRuntimeService.java:124) [na:na] at co.cask.cdap.app.runtime.AbstractProgramRuntimeService.createPluginSnapshot(AbstractProgramRuntimeService.java:235) [na:na] at co.cask.cdap.app.runtime.AbstractProgramRuntimeService.run(AbstractProgramRuntimeService.java:116) [na:na] at co.cask.cdap.internal.app.services.ProgramLifecycleService.start(ProgramLifecycleService.java:322) [na:na] at co.cask.cdap.internal.app.services.ProgramLifecycleService.start(ProgramLifecycleService.java:295) [na:na] at co.cask.cdap.gateway.handlers.ProgramLifecycleHttpHandler.doPerformAction(ProgramLifecycleHttpHandler.java:348) [na:na] at co.cask.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performAction(ProgramLifecycleHttpHandler.java:312) [na:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_60] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_60] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_60] at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_60] at co.cask.http.HttpMethodInfo.invoke(HttpMethodInfo.java:80) [co.cask.http.netty-http-0.16.0.jar:na] at co.cask.http.HttpDispatcher.messageReceived(HttpDispatcher.java:38) [co.cask.http.netty-http-0.16.0.jar:na] at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [io.netty.netty-3.6.6.Final.jar:na] at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [io.netty.netty-3.6.6.Final.jar:na] at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) [io.netty.netty-3.6.6.Final.jar:na] at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) [io.netty.netty-3.6.6.Final.jar:na] at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) [io.netty.netty-3.6.6.Final.jar:na] at org.jboss.netty.handler.execution.OrderedMemoryAwareThreadPoolExecutor$ChildExecutor.run(OrderedMemoryAwareThreadPoolExecutor.java:314) [io.netty.netty-3.6.6.Final.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

Log cleanup also has the same issue:

... at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158) at org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:132) at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1171) at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1167) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1167) at org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1632) at org.apache.twill.filesystem.FileContextLocation.exists(FileContextLocation.java:62) at co.cask.cdap.logging.write.LogCleanup.cleanupFiles(LogCleanup.java:179) at co.cask.cdap.logging.write.LogCleanup.run(LogCleanup.java:94) ...

Poorna ChandraFebruary 9, 2017 at 12:20 AM

The WARN in the description is an HDFS client error. Do you have any more logs/stacktrace from CDAP?

Duplicate
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Affects versions

Components

Fix versions

Priority

Created February 8, 2017 at 11:59 PM
Updated May 2, 2017 at 10:53 PM
Resolved February 22, 2017 at 8:06 PM