Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Details
Assignee
Sumit Jain
Sumit JainReporter
Arjan Bal
Arjan BalAffects versions
Triaged
Yes
Components
Fix versions
Priority
Created December 26, 2022 at 6:54 AM
Updated May 25, 2023 at 7:24 AM
Resolved April 19, 2023 at 4:56 AM
There can be situations in which networking b/w CDAP and the DefaultRuntimeJob running remotely (eg: in a Dataproc cluster) is lost or replication is disabled while the message to stop existing replication job is lost. In such situations the DefaultRuntimeJob should stop after some finite time to free up resources. Presently it is seen that the RuntimeClientService exits after around 4 hours of failing to report runtime status to CDAP, but this doesn’t stop the DefaultRuntimeJob. Maybe DefaultRuntimeJob should listed to the state of the core services and gracefully exit if any of them fail because this would put the job in a state with undefined behaviour.
Steps to repro:
Start a replication job or a very long running pipeline with Dataproc provisioner.
Bring down the Runtime Service in CDAP, say by reducing it’s memory allocation which causes it to crashloop.
Observe the logs of the Dataproc job, after around 4hrs, the RuntimeClientService will exit, however the DefaultRuntimeJob will keep running indefinitely.