Hadoop classes from hive-exec are leaked to programs in the sdk

Description

When we create the classloader used in programs, we filter out all classes except those in cdap-api, and those that start with org.apache.hadoop (except for everything under org.apache.hadoop.hbase). This is done to isolate programs from CDAP's own dependencies.

However, in the sdk, hive-exec is included in the sdk classpath because explore is run in the same jvm as everything else. This means that hive classes are exposed to user programs in the sdk, which can cause problems if different versions are being used. For example, if somebody wants to use the ORC file format and has orc-mapreduce 1.2.0 as a dependency in their pom, they will get an error:

java.lang.Exception: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.getDataColumnCount()I at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.runTasks(LocalJobRunnerWithFix.java:465) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] at org.apache.hadoop.mapred.LocalJobRunnerWithFix$Job.run(LocalJobRunnerWithFix.java:524) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch.getDataColumnCount()I at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1073) ~[orc-core-1.2.0.jar:1.2.0] at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:69) ~[orc-mapreduce-1.2.0.jar:1.2.0] at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:87) ~[orc-mapreduce-1.2.0.jar:1.2.0] at co.cask.cdap.internal.app.runtime.batch.dataset.input.DelegatingRecordReader.nextKeyValue(DelegatingRecordReader.java:84) ~[co.cask.cdap.cdap-app-fabric-3.5.0.jar:na] at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na] at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) ~[org.apache.hadoop.hadoop-mapreduce-client-core-2.3.0.jar:na]

This is because OrcMapreduceRecordReader uses VectorizedRowBatch. This class exists with the same classname in both hive-storage-api and hive-exec, and the one in hive-exec gets through the program filter because it starts with org.apache.hadoop, so it is the one that gets used instead of the one from the user's jar.

Release Notes

Avoid leaking hive classes to programs in CDAP SDK.

Activity

Show:

Ali Anwar December 6, 2016 at 7:23 PM

Our ProgramResources class only exposes packages classes prefixed by 'org.apache.hadoop.', which is why the hive classes were being exposed.
So, we don't need to worry about transitive dependencies (outside of hadoop packages).

Andreas Neumann December 3, 2016 at 5:22 PM

What about transitive dependencies that get pulled in via Hive?

Ali Anwar December 2, 2016 at 2:34 AM

Resolving https://cdap.atlassian.net/browse/CDAP-4151#icft=CDAP-4151 would likely resolve this issue also.
However, filtering hive classes from the program classloader (except if user packages hive themself) is a much simpler fix: https://github.com/caskdata/cdap/pull/7232

Jeff Dix December 1, 2016 at 9:59 PM

Yes, we are not excluding the hive-storage-api so it is getting pulled in. The distributed version works so this JIRA is just to fix the SDK and unit test framework if needed.

Ali Anwar December 1, 2016 at 9:44 PM

Thanks . That library has a dependency on 'hive-storage-api' v2.1.1-pre-orc. Are you packaging this with the application? Or how is this dependency satisfied in distributed CDAP?
Also, the fix for this JIRA is expected to not change any behavior in distributed CDAP, right?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Fix versions

Priority

Created September 13, 2016 at 6:27 PM
Updated December 6, 2016 at 7:49 PM
Resolved December 6, 2016 at 7:49 PM