We're updating the issue view to help you get more done. 

Programs stop getting scheduled after a few hours

Description

I have 25 pipelines scheduled to run every minute with a max concurrent runs constraint of 1. The pipelines get scheduled correctly for about 12 hours. After that the pipelines gets stuck in provisioning state, and the no more runs gets scheduled.

I have attached the stack trace and the logs of appfabric server (from 2019-03-24 06:20 to 2019-03-24 06:39), also the program log of the program that got stuck in provisioning state. The CDAP instance was started at 2019-03-23 18:22, the schedules stopped getting scheduled around 2019-03-24 06:30.

 

Release Notes

Fixed a race condition in the remote runtime scp implementation that can causing process hanging

Activity

Show:
Poorna Chandra
March 28, 2019, 9:50 PM
Edited

On further digging, I found that there are two timeout properties in JSch, the SSH library CDAP uses -

  1. A connection timeout value can be specified in method com.jcraft.jsch.Session.connect, and that also sets the underlying socket's {{SO_TIMEOUT}} value. However, this is not documented. I had to dig into the code to find this.

  2. ServerAliveInterval this is an SSH config, that sends keep-alive message if server sends no response. After a configurable number of tries, the connection will be disconnected.

To start off with, I am trying to see if setting the socket's SO_TIMEOUT value will timeout the read, and thus stop the thread from hanging.

Poorna Chandra
April 1, 2019, 9:49 PM

Setting the {{SO_TIMEOUT}} did not help in timing out hanging SCP connections.

I put in another fix to run the copy in a separate thread and, then interrupt the thread if the copy takes longer than timeout. I deployed this fix on Friday and tested it over the weekend. I don't see any hanging SSH connections anymore, and the schedules are running fine.

 

Terence Yim
April 3, 2019, 9:50 PM

Make program start be an async call https://github.com/cdapio/cdap/pull/11260

The ProgramLifecycleService thread shouldn't be block for long.

Terence Yim
April 5, 2019, 9:38 PM

Also fixed SCP hanging issue in

Terence Yim
May 2, 2019, 6:12 AM

Root causes should have been fixed by

Fixed

Assignee

Terence Yim

Reporter

Poorna Chandra

Labels

None

Docs Impact

None

UX Impact

None

Components

Fix versions

Affects versions

Priority

Critical
Configure