Optimize jar uploading in pipeline launch flow to reduce diskIOs

Description

Issue:

During pipeline launching to Dataproc, AppFab uploads plugins and pipeline jars to a staging GCS for Dataproc to pick up. It seems we are uploading the same set of plugin jars to staging GCS twice: once as plugin jars, the second time as an archived plugin dir.

 

Sample debugging logs:

2022-07-01 22:51:36,366 - DEBUG [provisioning-context-0:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985
68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts_archive.jar
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-7:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 82074 bytes from file:/data/tmp/1656715881393-0/cConf.xml to gs://df-13098568299871872022-fcxrbmnpu
qi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/cConf.xml
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-1:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985
68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts

 

Request:

Evaluate if this can be optimized to avoid duplicated jar uploading.

 

More info:

In 6.5.1 codebase, this updateProgramOptions[1] creates a new bundle (e.g. /data/tmp/1656715881393-0/1656715882410-0/artifacts.jar) containing a copy of all plugin jars. Then this jar gets loaded twice to staging GCS bucket under different names: “artifacts_archive.jar” and “artifacts”. This is suboptimal when plugin jars are large, resulting in unnecessary disk IOs and slows down pipeline launches.

[1] io.cdap.cdap.internal.app.runtime.distributed.DistributedProgramRunner#updateProgramOptions

Release Notes

None

Activity

Show:

Masoud Saeida ArdekaniSeptember 16, 2022 at 5:31 PM

Discussed with Terence on historical reasons for why both artifacts and artifacts_archive.jar are needed. This is the side effect of the distributed program runner being environment agnostic as in yarn one expands, and one does not.

With caching artifacts in GCS , this feature is no longer priority.

Vitalii TymchyshynJuly 6, 2022 at 3:39 PM

I think this is a bigger change to take in 6.7.1. Moving to 6.8.

Wangyuan ZhangJuly 2, 2022 at 12:09 AM

Assign to Masoud for him to delegate to others if needed

Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Triaged

No

Components

Priority

Created July 1, 2022 at 11:56 PM
Updated October 27, 2023 at 4:19 PM