When DefaultRuntimeJob looses communication with CDAP, it should exit after some finite retries

Description

There can be situations in which networking b/w CDAP and the DefaultRuntimeJob running remotely (eg: in a Dataproc cluster) is lost or replication is disabled while the message to stop existing replication job is lost. In such situations the DefaultRuntimeJob should stop after some finite time to free up resources. Presently it is seen that the RuntimeClientService exits after around 4 hours of failing to report runtime status to CDAP, but this doesn’t stop the DefaultRuntimeJob. Maybe DefaultRuntimeJob should listed to the state of the core services and gracefully exit if any of them fail because this would put the job in a state with undefined behaviour.

 

Steps to repro:

  • Start a replication job or a very long running pipeline with Dataproc provisioner.

  • Bring down the Runtime Service in CDAP, say by reducing it’s memory allocation which causes it to crashloop.

  • Observe the logs of the Dataproc job, after around 4hrs, the RuntimeClientService will exit, however the DefaultRuntimeJob will keep running indefinitely.

Release Notes

Fixed issue where Dataproc continued running a job when not able to communicate with the CDAP instance of if the replication job or pipeline was deleted in CDAP.

Activity

Show:

Sumit JainApril 19, 2023 at 4:56 AM

Merged to release branch with https://github.com/cdapio/cdap/pull/15076

Sumit JainApril 3, 2023 at 3:10 PM
Edited

Develop PR merged

Will be cherry picked to release/6.9 branch after 6.9.0 is released.

Sumit JainMarch 10, 2023 at 6:29 AM

As per discussion with this will be targeted for 6.9.1

Sanket SahuMarch 6, 2023 at 12:34 PM


Does fix this issue ?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Triaged

Yes

Components

Fix versions

Priority

Created December 26, 2022 at 6:54 AM
Updated May 25, 2023 at 7:24 AM
Resolved April 19, 2023 at 4:56 AM