Optimize jar uploading in pipeline launch flow to reduce diskIOs
Description
Issue:
During pipeline launching to Dataproc, AppFab uploads plugins and pipeline jars to a staging GCS for Dataproc to pick up. It seems we are uploading the same set of plugin jars to staging GCS twice: once as plugin jars, the second time as an archived plugin dir.
Sample debugging logs:
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-0:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985 68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts_archive.jar 2022-07-01 22:51:36,366 - DEBUG [provisioning-context-7:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 82074 bytes from file:/data/tmp/1656715881393-0/cConf.xml to gs://df-13098568299871872022-fcxrbmnpu qi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/cConf.xml 2022-07-01 22:51:36,366 - DEBUG [provisioning-context-1:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985 68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts
Request:
Evaluate if this can be optimized to avoid duplicated jar uploading.
More info:
In 6.5.1 codebase, this updateProgramOptions[1] creates a new bundle (e.g. /data/tmp/1656715881393-0/1656715882410-0/artifacts.jar) containing a copy of all plugin jars. Then this jar gets loaded twice to staging GCS bucket under different names: “artifacts_archive.jar” and “artifacts”. This is suboptimal when plugin jars are large, resulting in unnecessary disk IOs and slows down pipeline launches.
Masoud Saeida ArdekaniSeptember 16, 2022 at 5:31 PM
Discussed with Terence on historical reasons for why both artifacts and artifacts_archive.jar are needed. This is the side effect of the distributed program runner being environment agnostic as in yarn one expands, and one does not.
With caching artifacts in GCS , this feature is no longer priority.
Vitalii TymchyshynJuly 6, 2022 at 3:39 PM
I think this is a bigger change to take in 6.7.1. Moving to 6.8.
Wangyuan ZhangJuly 2, 2022 at 12:09 AM
Assign to Masoud for him to delegate to others if needed
Pinned fields
Click on the next to a field label to start pinning.
Issue:
During pipeline launching to Dataproc, AppFab uploads plugins and pipeline jars to a staging GCS for Dataproc to pick up. It seems we are uploading the same set of plugin jars to staging GCS twice: once as plugin jars, the second time as an archived plugin dir.
Sample debugging logs:
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-0:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985
68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts_archive.jar
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-7:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 82074 bytes from file:/data/tmp/1656715881393-0/cConf.xml to gs://df-13098568299871872022-fcxrbmnpu
qi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/cConf.xml
2022-07-01 22:51:36,366 - DEBUG [provisioning-context-1:i.c.c.r.s.r.DataprocRuntimeJobManager@335] - Uploading a file of size 277799659 bytes from file:/data/tmp/1656715881393-0/1656715882410-0/artifacts.jar to gs://df-130985
68299871872022-fcxrbmnpuqi6zp44aizbbqaaaa/cdap-job/4e909db1-f990-11ec-8650-0ad4d37c81bd/artifacts
Request:
Evaluate if this can be optimized to avoid duplicated jar uploading.
More info:
In 6.5.1 codebase, this updateProgramOptions[1] creates a new bundle (e.g. /data/tmp/1656715881393-0/1656715882410-0/artifacts.jar) containing a copy of all plugin jars. Then this jar gets loaded twice to staging GCS bucket under different names: “artifacts_archive.jar” and “artifacts”. This is suboptimal when plugin jars are large, resulting in unnecessary disk IOs and slows down pipeline launches.
[1] io.cdap.cdap.internal.app.runtime.distributed.DistributedProgramRunner#updateProgramOptions