Temporary GCS buckets are not cleaned up when the BigQuerySink is used in the pipeline.


Noticing empty buckets in the project after pipeline is completed.

Sagar Kapare
May 5, 2021, 5:43 PM

This looks like problem with google-cloud plugin artifacts from hub. I am not able to reproduce issue with 0.18.0-SNAPSHOT, but issue is reproducible with the plugins from hub which has 0.17.1 version. Issue was supposed to fix in 0.17.1 - however for some reason fix is missing in the hub.

Sagar Kapare
May 3, 2021, 5:30 PM

Saw it with 0.17.0 and 0.17.1 of BQ plugin.

I think even if bucket name is not provided AbstractBigQuerySink#115 will be executed. Reason for that is bucketName is not null at that point as it is initialized https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L243 here to uuid when its null.

Albert Shau
May 3, 2021, 4:31 PM

Do you know which version of the plugin you saw this with? This was supposed to be fixed with PLUGIN-635.


AbstractBigQuerySink#115 is code that runs when the bucket is specified, not when the bucket is auto-created by the plugin.

Sagar Kapare
April 30, 2021, 7:31 PM

It looks like when bucket name is not provided, it is initialized to uuid (for example - 79bed01d-489b-4677-b6ce-aaa2b97bc938` in one run). This initialization happens here https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L99 during the prepareRun, so the bucket is set to `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938` at this point.

BigQuerySinkUtils again add same uuid to this bucket here - https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/BigQuerySinkUtils.java#L95 resulting in bucket name `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938/79bed01d-489b-4677-b6ce-aaa2b97bc938/…`

Once the run is finished, we delete the gcs bucket here https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L115 i.e. `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938/79bed01d-489b-4677-b6ce-aaa2b97bc938/…` in this case leaving top level `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938` empty bucket behind in the project.

