Temporary GCS buckets are not cleaned up when the BigQuerySink is used in the pipeline.

Description

Noticing empty buckets in the project after pipeline is completed.

Release Notes

None

Activity

Show:
Sagar Kapare
May 5, 2021, 5:43 PM

This looks like problem with google-cloud plugin artifacts from hub. I am not able to reproduce issue with 0.18.0-SNAPSHOT, but issue is reproducible with the plugins from hub which has 0.17.1 version. Issue was supposed to fix in 0.17.1 - however for some reason fix is missing in the hub.

Sagar Kapare
May 3, 2021, 5:30 PM

Saw it with 0.17.0 and 0.17.1 of BQ plugin.

I think even if bucket name is not provided AbstractBigQuerySink#115 will be executed. Reason for that is bucketName is not null at that point as it is initialized https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L243 here to uuid when its null.

Albert Shau
May 3, 2021, 4:31 PM

Do you know which version of the plugin you saw this with? This was supposed to be fixed with PLUGIN-635.

 

AbstractBigQuerySink#115 is code that runs when the bucket is specified, not when the bucket is auto-created by the plugin.

Sagar Kapare
April 30, 2021, 7:31 PM

It looks like when bucket name is not provided, it is initialized to uuid (for example - 79bed01d-489b-4677-b6ce-aaa2b97bc938` in one run). This initialization happens here https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L99 during the prepareRun, so the bucket is set to `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938` at this point.

BigQuerySinkUtils again add same uuid to this bucket here - https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/BigQuerySinkUtils.java#L95 resulting in bucket name `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938/79bed01d-489b-4677-b6ce-aaa2b97bc938/…`

Once the run is finished, we delete the gcs bucket here https://github.com/data-integrations/google-cloud/blob/develop/src/main/java/io/cdap/plugin/gcp/bigquery/sink/AbstractBigQuerySink.java#L115 i.e. `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938/79bed01d-489b-4677-b6ce-aaa2b97bc938/…` in this case leaving top level `gs://79bed01d-489b-4677-b6ce-aaa2b97bc938` empty bucket behind in the project.

Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Albert Shau

Reporter

Sagar Kapare

Labels