Common Dependencies with Hive cause issues

Description

Common dependencies between our code and Hive cause problems if the versions of those dependencies are incompatible. To reproduce:

1. create a cluster with a Hive version that uses guava11 (cdh5.1 for example).

2. Revert the fix at https://github.com/caskdata/cdap/pull/837.

3. install reverted cdap-master

4. deploy the purchase example application

5. run query "SELECT purchases FROM cdap_user_history"

The query should fail, with the yarn logs showing something like:

2014-12-15 23:05:52,887 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.IllegalAccessError: tried to access class com.google.common.hash.HashCodes from class co.cask.cdap.data2.datafabric.dataset.type.DistributedDatasetTypeClassLoaderFactory at co.cask.cdap.data2.datafabric.dataset.type.DistributedDatasetTypeClassLoaderFactory.create(DistributedDatasetTypeClassLoaderFactory.java:112) at co.cask.cdap.data2.datafabric.dataset.RemoteDatasetFramework.getDatasetType(RemoteDatasetFramework.java:274) at co.cask.cdap.data2.datafabric.dataset.RemoteDatasetFramework.getDataset(RemoteDatasetFramework.java:181) at co.cask.cdap.hive.datasets.DatasetAccessor.firstLoad(DatasetAccessor.java:207) at co.cask.cdap.hive.datasets.DatasetAccessor.instantiate(DatasetAccessor.java:186) at co.cask.cdap.hive.datasets.DatasetAccessor.instantiate(DatasetAccessor.java:157) at co.cask.cdap.hive.datasets.DatasetAccessor.getRecordScannable(DatasetAccessor.java:56) at co.cask.cdap.hive.datasets.DatasetInputFormat.getRecordReader(DatasetInputFormat.java:76) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:237) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:542) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:168) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

If you look at the classpath in the yarn container for the MR job that Hive runs, it includes a job.jar as the first item in the classpath, and that jar includes guava11. This is because Hive will create the MR job conf by doing:

job = new JobConf(conf, ExecDriver.class);

in ExecDriver.initialize(). That way of creating the job conf will examine jars for the ExecDriver.class and use that job as job.jar. Because of this, we have no control over what guava version is used in the job, as it will always pick up what is included in hive-exec.jar, which is a fat jar.

Release Notes

None

Activity

Show:

Albert Shau July 25, 2015 at 12:03 AM

This was fixed in

Albert Shau December 17, 2014 at 12:22 AM

One possible fix is to set mapreduce.user.classpath.first to false so that Hive classes are not loaded first by the MR framework. Then we could configure "mapreduce.application.classpath" and prepend the dependencies needed by Explore. See org.apache.hadoop.mapreduce.v2.util.MRApps.setClasspath and MRApps.setMRFrameworkClasspath.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Components

Priority

Created December 17, 2014 at 12:20 AM
Updated July 25, 2015 at 12:03 AM
Resolved July 25, 2015 at 12:03 AM