Invalid transactions created by system services

Description

In clusters that experience slowness and general proneness to timeouts, we see that a lot of invalid transactions are generated. Many of these are from system services, namely dataset service, dataset.executor, explore service, log.saver.

It is not clear how that happens. The transaction is invalidated because it is a short transaction and reaches its timeout. At commit time, it throws TransactionNotInProgressExn, which fails the tx. However, TransactionContext catches that and will rollback the changes, then abort the tx, which removes it from the invalid list.

Perhaps these services use a different way to execute transactions that does not attempt to rollback. This needs investigation.

Release Notes

None

Activity

Show:

Andreas Neumann October 12, 2017 at 8:12 PM

Because TEPHRA-257 was closed without fix, there is no remaining work here.

Andreas Neumann September 12, 2017 at 12:16 AM

Moved this to 5.0.0 to keep it as a tracker for the Tephra improvement.

Andreas Neumann September 12, 2017 at 12:16 AM

I opened https://issues.apache.org/jira/browse/TEPHRA-257 to address this in Tephra. That will not be possible in 0-13, though.

Andreas Neumann September 11, 2017 at 10:18 PM

In this particular case, it was caused by slow HDFS, which had a ripple effect:

  • A client makes a call to start a transaction

  • Transaction service takes 70 sec to append to the transaction log

  • That exceeds the thrift RPC timeout of 30 sec, and the client times out the thrift call

  • Client now closes its Thrift connection and opens a new one, and retries starting a transaction, which succeeds

  • When Tx service responds to the initial request, the connection is already closed, but the transaction is started

  • Nobody knows about this transaction and hence it is never aborted or committed, so it times out after some time

One way to fix this is to increase all the time outs.

Another, better, way is to fix the slow HDFS (which was suggested in this case)

However, temporary slowness of HDFS, even over an extended period of time, should not have a lasting effect on the performance of the system. After the slowness improves, CDAP should also recover... which it does not, it piles up more and more invalid transactions and eventually performance degrades.

Terence Yim August 24, 2017 at 5:40 AM

I think they are using `Transactional` mostly, if not all of them. However, in a loaded system, can it be caused by repeated process termination? Do we have chance to look at the system YARN app containers' ids to see if they are high?

Won't Fix
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Affects versions

Components

Fix versions

Priority

Created August 24, 2017 at 1:28 AM
Updated October 12, 2017 at 8:13 PM
Resolved October 12, 2017 at 8:13 PM