Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Avinash AcharAvinash AcharReporter
Adrika GuptaAdrika GuptaLabels
Triaged
YesComponents
Fix versions
Priority
Major
Details
Details
Assignee
Avinash Achar
Avinash AcharReporter
Adrika Gupta
Adrika GuptaLabels
Triaged
Yes
Components
Fix versions
Priority
Created March 25, 2025 at 7:12 AM
Updated March 26, 2025 at 5:48 AM
When we create a dataproc cluster with property core:fs.defaultFS and use GCS as file system instead of HDFS file system,
this is creating problem for parallel run because CDAP copies files to gcs folder and once the dataproc job is completed, the files/folders are deleted and pipeline fails with error:
2025-03-20 12:13:39,809 - ERROR [SparkRunner-phase-6:i.c.c.i.a.r.ProgramControllerServiceAdapter@98] - Spark Program 'phase-6' failed.
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File not found: gs://<bucket>>/cdap/framework/spark
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at io.cdap.cdap.app.runtime.spark.submit.AbstractSparkJobFuture.get(AbstractSparkJobFuture.java:119)
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.run(SparkRuntimeService.java:444)
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52)
at io.cdap.cdap.app.runtime.spark.SparkRuntimeService.lambda$null$2(SparkRuntimeService.java:525)
at java.lang.Thread.run(Thread.java:750)
while using HDFS, spark submit uses:
spark.yarn.archive=hdfs://ckuster-m:8020/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip
When using GCS as file system, spark submit uses:
spark.yarn.archive=gs://\<bucket>/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip
There is 1 shared gcs bucket for all ephemeral cluster. In case of HDFS, we have one HDFS for one ephemeral cluster which is why we don't encounter this issue here.
We have hardcoded this path
cdap/framework/spark
currently.we should add something unique in the path so that for GCS there are separate folder for separate clusters. Eg:
gs://\<bucket>/<pipeline-id>${logicalStartTime(yyyy-MM-dd'T'HH-mm-ss)}/cdap/framework/spark/spark.archive-spark3_2.12-3.3.6.zip