Hadoop classes from hive-exec are leaked to programs in the sdk
Description
Release Notes
relates to
Activity

Ali Anwar December 6, 2016 at 7:23 PM
Our ProgramResources class only exposes packages classes prefixed by 'org.apache.hadoop.', which is why the hive classes were being exposed.
So, we don't need to worry about transitive dependencies (outside of hadoop packages).

Andreas Neumann December 3, 2016 at 5:22 PM
What about transitive dependencies that get pulled in via Hive?

Ali Anwar December 2, 2016 at 2:34 AM
Resolving would likely resolve this issue also.
However, filtering hive classes from the program classloader (except if user packages hive themself) is a much simpler fix: https://github.com/caskdata/cdap/pull/7232

Jeff Dix December 1, 2016 at 9:59 PM
Yes, we are not excluding the hive-storage-api so it is getting pulled in. The distributed version works so this JIRA is just to fix the SDK and unit test framework if needed.

Ali Anwar December 1, 2016 at 9:44 PM
Thanks . That library has a dependency on 'hive-storage-api' v2.1.1-pre-orc. Are you packaging this with the application? Or how is this dependency satisfied in distributed CDAP?
Also, the fix for this JIRA is expected to not change any behavior in distributed CDAP, right?
When we create the classloader used in programs, we filter out all classes except those in cdap-api, and those that start with org.apache.hadoop (except for everything under org.apache.hadoop.hbase). This is done to isolate programs from CDAP's own dependencies.
However, in the sdk, hive-exec is included in the sdk classpath because explore is run in the same jvm as everything else. This means that hive classes are exposed to user programs in the sdk, which can cause problems if different versions are being used. For example, if somebody wants to use the ORC file format and has orc-mapreduce 1.2.0 as a dependency in their pom, they will get an error:
This is because OrcMapreduceRecordReader uses VectorizedRowBatch. This class exists with the same classname in both hive-storage-api and hive-exec, and the one in hive-exec gets through the program filter because it starts with org.apache.hadoop, so it is the one that gets used instead of the one from the user's jar.