Pipeline fails due to Twill Program failure when appfabric restarts as soon as pipeline execution reaches STARTING state

Description

When appfabric service is restarted during pipeline execution as soon as pipeline enters STARTING state, in some scenarios, pipeline failures are observed with the following logs in appfabric

I0313 10:12:51.698863 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,079 - WARN [program.status.9:i.c.c.i.a.s.AppMetadataStore@1568] - Ignoring unexpected transition of program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from run state FAILED to RUNNING. I0313 10:12:51.699081 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,088 - INFO [program.status.9:i.c.c.i.a.s.RunRecordMonitorService@179] - Removed request with runId program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf. Counter has 10 concurrent launching requests. I0313 10:12:51.699251 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,099 - INFO [main:i.c.c.m.e.k.AbstractServiceMain@224] - Starting service MasterTaskExecutorService [NEW] for io.cdap.cdap.master.environment.k8s.AppFabricServiceMain I0313 10:12:51.699411 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,103 - INFO [main:i.c.c.m.e.k.AbstractServiceMain@227] - All services for io.cdap.cdap.master.environment.k8s.AppFabricServiceMain started I0313 10:12:51.699576 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,118 - DEBUG [master-env-task-PodKillerTask:i.c.c.m.e.k.PodKillerTask@52] - Terminating pods using selector cdap.container=cdap-prb-fi-e19beef1-preview I0313 10:12:51.699756 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,153 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.699938 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,157 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700118 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,161 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700275 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,165 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700437 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,169 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700598 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,173 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700777 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,177 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.700947 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,180 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.701122 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,184 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.701300 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,191 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state. I0313 10:12:51.702085 857 step.go:630] > bash [stdout]: 2024-03-13 17:01:31,196 - DEBUG [program.status.9:i.c.c.i.a.s.ProgramNotificationSingleTopicSubscriberService@504] - Ignoring unexpected request to transition program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.workflow.DataPipelineWorkflow.8ed1e294-e15a-11ee-ab42-6a2b08ea5caf from FAILED state to program STARTING state.

Eventually, the twill program fails:

I0313 10:12:51.715259 857 step.go:630] > bash [stdout]: 2024-03-13 17:06:27,848 - INFO [OkHttp https://10.70.240.1/...:i.c.c.i.a.r.d.AbstractTwillProgramController@77] - Twill program terminated: program_run:system.dataprep.-SNAPSHOT.service.service.2c6ef702-e15a-11ee-87c8-6a2b08ea5caf, twill runId: 2c6ef702-e15a-11ee-87c8-6a2b08ea5caf, status: FAILED

Later, we see that RunRecordCorrectorService modifies the run record:

I0313 10:12:51.717683 857 step.go:630] > bash [stdout]: 2024-03-13 17:12:27,432 - WARN [run-corrector:i.c.c.i.a.s.RunRecordCorrectorService@145] - Fixed RunRecord for program run program_run:default.DataFusionQuickStart.84af95f3-e15a-11ee-921f-6a2b08ea5caf.spark.phase-1.31cb88c1-e15b-11ee-a015-42010a960fc6 in STARTING state because it is actually not running I0313 10:12:51.721218 857 step.go:630] > bash [stdout]: 2024-03-13 17:12:27,472 - WARN [run-corrector:i.c.c.i.a.s.RunRecordCorrectorService@154] - Fixed 1 RunRecords with status in [STARTING, RUNNING, SUSPENDED], but the programs are not actually running

An issue of a similar nature was identified and fixed: https://github.com/cdapio/cdap/pull/15517. The current issue could be related to it.

Release Notes

None

Activity

Show:
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Triaged

No

Components

Fix versions

Priority

Created March 15, 2024 at 9:08 AM
Updated April 15, 2024 at 6:59 AM