Add a unique value in file system path for spark.yarn.archive property

Description

When we create a dataproc cluster with property core:fs.defaultFS and use GCS as file system instead of HDFS file system,
this is creating problem for parallel run because CDAP copies files to gcs folder and once the dataproc job is completed, the files/folders are deleted and pipeline fails with error:

2025-03-20 12:13:39,809 - ERROR [SparkRunner-phase-6:i.c.c.i.a.r.ProgramControllerServiceAdapter@98] - Spark Program 'phase-6' failed.
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File not found: gs://<bucket>>/cdap/framework/spark
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkJobFuture.get(AbstractSparkJobFuture.java:119)
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.run(SparkRuntimeService.java:444)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52)
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.lambda$null$2(SparkRuntimeService.java:525)
at java.lang.Thread.run(Thread.java:750)

while using HDFS, spark submit uses:
spark.yarn.archive=hdfs://ckuster-m:8020/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip

When using GCS as file system, spark submit uses:
spark.yarn.archive=gs://\<bucket>/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip

There is 1 shared gcs bucket for all ephemeral cluster. In case of HDFS, we have one HDFS for one ephemeral cluster which is why we don't encounter this issue here.

We have hardcoded this path cdap/framework/spark currently.
we should add something unique in the path so that for GCS there are separate folder for separate clusters. Eg: gs://\<bucket>/<pipeline-id>${logicalStartTime(yyyy-MM-dd'T'HH-mm-ss)}/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip

Release Notes

None

Activity

Show:

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Avinash Achar
Reporter
Adrika Gupta
Labels
Customer_request
Triaged
Yes
Components
Fix versions
6.11.1
Priority
Major

Created March 25, 2025 at 7:12 AM

Updated March 26, 2025 at 5:48 AM

Add a unique value in file system path for spark.yarn.archive property

Description

Release Notes

Activity

DetailsAssigneeAvinash AcharAvinash AcharReporterAdrika GuptaAdrika GuptaLabelsCustomer_requestTriagedYesComponentsFix versions6.11.1PriorityMajor

Details

Assignee

Reporter

Labels

Triaged

Components

Fix versions

Priority

Details
Assignee
Avinash Achar
Reporter
Adrika Gupta
Labels
Customer_request
Triaged
Yes
Components
Fix versions
6.11.1
Priority
Major