MapReduce can deadlock
Description
Release Notes
blocks
relates to
Activity

Andreas Neumann September 20, 2018 at 10:15 PM

Andreas Neumann August 17, 2018 at 6:10 PMEdited
This is a known issue in Hadoop 2.3. It was fixed in 2.5: https://jira.apache.org/jira/browse/HADOOP-10622
The right thing to do would probably be to update the Hadoop dependency to 2.5 where this issue is fixed. However, that requires quite some refactoring. For now, since this has only been observed in tests, a quick fix is to copy the Hadoop Shell class into the src/test/java and patch it. Will open another Jira for updating Hadoop to a more recent version

Andreas Neumann August 17, 2018 at 5:25 PM
It turns out that this can happen anytime we stop a MapReduce in a unit test.

Andreas Neumann August 10, 2018 at 6:13 PM
This is a bug in Hadoop and we cannot easily fix it. It only happens after some other, unrelated failure caused a timeout, and the hung build killer interrupts the running mapreduce.

Andreas Neumann August 10, 2018 at 4:32 PM
This is a bug in Hadoop's {{Shell.runCommand()}} (which is used to perform local file system operations). It starts a process and then starts a thread that reads that process' standard error"
The readLine()
will lock on the reader itself, then delegate the read to the input stream reader, which locks on the error stream itself.
The {{runCommand()}} method's finally clause, however, synchronizes on the error stream first and then closes the buffered reader, which synchronizes on itself.
In this case, if completed
were false, we'd have a problem, because {{errThread.interrupt()}} does not interrupt the readLine()
- it is blocking and not interruptible, and hence the error thread was not joined. So, it is still running and holding the lock when errReader.close()
is called.
However, I believe that this can only happen if the thread running the Shell itself is interrupted. Which means there was an earlier problem that caused a timeout or something similar. So, this is an issue in Hadoop's handling of interrupts, and then can cause tests to hang. But the actual test failure has a different root cause.
In a hung build, the following deadlock was detected:
This appears to be related to writing the transaction to a file, so that tasks can read it.
Opening as a blocker because it fails builds and impacts productivity.