Add a unique value in file system path for spark.yarn.archive property

Description

When we create a dataproc cluster with property core:fs.defaultFS and use GCS as file system instead of HDFS file system,  
this is creating problem for parallel run because CDAP copies files to gcs folder and once the dataproc job is completed, the files/folders are deleted and pipeline fails with error:


2025-03-20 12:13:39,809 - ERROR [SparkRunner-phase-6:i.c.c.i.a.r.ProgramControllerServiceAdapter@98] - Spark Program 'phase-6' failed.  
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File not found: gs://<bucket>>/cdap/framework/spark  
    at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)  
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)  
    at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkJobFuture.get(AbstractSparkJobFuture.java:119)  
    at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.run(SparkRuntimeService.java:444)  
    at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52)  
    at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.lambda$null$2(SparkRuntimeService.java:525)  
    at java.lang.Thread.run(Thread.java:750)  

while using HDFS, spark submit uses:  
spark.yarn.archive=hdfs://ckuster-m:8020/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip

When using GCS as file system, spark submit uses:  
spark.yarn.archive=gs://\<bucket>/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip

There is 1 shared gcs bucket for all ephemeral cluster. In case of HDFS, we have one HDFS for one ephemeral cluster which is why we don't encounter this issue here.

We have hardcoded this path cdap/framework/spark currently.
we should add something unique in the path so that for GCS there are separate folder for separate clusters. Eg: gs://\<bucket>/<pipeline-id>${logicalStartTime(yyyy-MM-dd'T'HH-mm-ss)}/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip


Release Notes

None

Activity

Show:
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Triaged

Yes

Components

Fix versions

Priority

Created March 25, 2025 at 7:12 AM
Updated March 26, 2025 at 5:48 AM