Invalid transactions created by system services
Description
Release Notes
Activity
Andreas Neumann October 12, 2017 at 8:12 PM
Because TEPHRA-257 was closed without fix, there is no remaining work here.
Andreas Neumann September 12, 2017 at 12:16 AM
Moved this to 5.0.0 to keep it as a tracker for the Tephra improvement.
Andreas Neumann September 12, 2017 at 12:16 AM
I opened https://issues.apache.org/jira/browse/TEPHRA-257 to address this in Tephra. That will not be possible in 0-13, though.
Andreas Neumann September 11, 2017 at 10:18 PM
In this particular case, it was caused by slow HDFS, which had a ripple effect:
A client makes a call to start a transaction
Transaction service takes 70 sec to append to the transaction log
That exceeds the thrift RPC timeout of 30 sec, and the client times out the thrift call
Client now closes its Thrift connection and opens a new one, and retries starting a transaction, which succeeds
When Tx service responds to the initial request, the connection is already closed, but the transaction is started
Nobody knows about this transaction and hence it is never aborted or committed, so it times out after some time
One way to fix this is to increase all the time outs.
Another, better, way is to fix the slow HDFS (which was suggested in this case)
However, temporary slowness of HDFS, even over an extended period of time, should not have a lasting effect on the performance of the system. After the slowness improves, CDAP should also recover... which it does not, it piles up more and more invalid transactions and eventually performance degrades.
Terence Yim August 24, 2017 at 5:40 AM
I think they are using `Transactional` mostly, if not all of them. However, in a loaded system, can it be caused by repeated process termination? Do we have chance to look at the system YARN app containers' ids to see if they are high?
In clusters that experience slowness and general proneness to timeouts, we see that a lot of invalid transactions are generated. Many of these are from system services, namely dataset service, dataset.executor, explore service, log.saver.
It is not clear how that happens. The transaction is invalidated because it is a short transaction and reaches its timeout. At commit time, it throws TransactionNotInProgressExn, which fails the tx. However, TransactionContext catches that and will rollback the changes, then abort the tx, which removes it from the invalid list.
Perhaps these services use a different way to execute transactions that does not attempt to rollback. This needs investigation.