Programs stop getting scheduled after a few hours

Description

I have 25 pipelines scheduled to run every minute with a max concurrent runs constraint of 1. The pipelines get scheduled correctly for about 12 hours. After that the pipelines gets stuck in provisioning state, and the no more runs gets scheduled.

I have attached the stack trace and the logs of appfabric server (from 2019-03-24 06:20 to 2019-03-24 06:39), also the program log of the program that got stuck in provisioning state. The CDAP instance was started at 2019-03-23 18:22, the schedules stopped getting scheduled around 2019-03-24 06:30.

 

Release Notes

Fixed a race condition in the remote runtime scp implementation that can causing process hanging

Attachments

3

Activity

Show:

Terence YimMay 2, 2019 at 6:12 AM

Root causes should have been fixed by

Terence YimApril 5, 2019 at 9:38 PM

Also fixed SCP hanging issue in

Terence YimApril 3, 2019 at 9:50 PM

Make program start be an async call https://github.com/cdapio/cdap/pull/11260

The ProgramLifecycleService thread shouldn't be block for long.

Poorna ChandraApril 1, 2019 at 9:49 PM

Setting the {{SO_TIMEOUT}} did not help in timing out hanging SCP connections.

I put in another fix to run the copy in a separate thread and, then interrupt the thread if the copy takes longer than timeout. I deployed this fix on Friday and tested it over the weekend. I don't see any hanging SSH connections anymore, and the schedules are running fine.

 

Poorna ChandraMarch 28, 2019 at 9:50 PM
Edited

On further digging, I found that there are two timeout properties in JSch, the SSH library CDAP uses -

  1. A connection timeout value can be specified in method com.jcraft.jsch.Session.connect, and that also sets the underlying socket's {{SO_TIMEOUT}} value. However, this is not documented. I had to dig into the code to find this.

  2. ServerAliveInterval this is an SSH config, that sends keep-alive message if server sends no response. After a configurable number of tries, the connection will be disconnected.

To start off with, I am trying to see if setting the socket's SO_TIMEOUT value will timeout the read, and thus stop the thread from hanging.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Components

Fix versions

Priority

Created March 25, 2019 at 11:05 PM
Updated May 13, 2019 at 6:41 PM
Resolved May 2, 2019 at 6:12 AM