Google Cloud Storage Sink
Plugin version: 0.22.0
This plugin writes records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.
Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.
Credentials
If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.
If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.
Configuration
Property | Macro Enabled? | Version Introduced | Description |
---|---|---|---|
Use Connection | No | 6.7.0/0.20.0 | Optional. Whether to use a connection. If a connection is used, you do not need to provide the credentials. |
Connection | Yes | 6.7.0/0.20.0 | Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function |
Project ID | Yes |
| Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. Default is auto-detect. |
Service Account Type | Yes | 6.3.0/0.16.0 | Optional. Select one of the following options:
|
Service Account File Path | Yes |
| Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. Default is auto-detect. |
Service Account JSON | Yes | 6.3.0/0.16.0 | Optional. Content of the service account. |
Reference Name | No |
| Optional. Name used to uniquely identify this sink for lineage, annotating metadata, etc. |
Path | Yes |
| Required. Path to write to. For example, gs:///path/to/ You can also use the logicalStartTime function to append a date to the output filename. |
Path Suffix | Yes |
| Optional. Time format for the output directory that will be appended to the path. For example, the format 'yyyy-MM-dd-HH-mm' will result in a directory of the form '2015-01-01-20-42'. If not specified, nothing will be appended to the path. Default is yyyy-MM-dd-HH-mm. |
Format | No |
| Required. Format to write the records in. The format must be one of 'json', 'avro', 'parquet', 'csv', 'tsv', or 'delimited'. If the Format is a macro, only the pre-packaged formats can be used. Default is json. |
Delimiter | Yes |
| Optional. Delimiter to use if the format is 'delimited'. The delimiter will be ignored if the format is anything other than 'delimited'. |
Write Header | Yes |
| Optional. Whether to write a header to each file if the format is ‘delimited’, ‘csv’, or ‘tsv’. Default is false. |
Location | Yes |
| Optional. The location where the GCS bucket will get created. This value is ignored if the bucket already exists. Default is auto-detect. |
Content Type | Yes |
| Optional. The Content Type entity is used to indicate the media type of the resource. Defaults to ‘application/octet-stream’. See the table below for valid content types for each format. |
Custom Content Type | Yes |
| Optional. The Custom Content Type is used when the value of Content-Type is set to other. User can provide specific Content-Type, different from the options in the dropdown. More information about the Content-Type can be found at https://cloud.google.com/storage/docs/metadata |
Encryption Key Name | Yes | 6.5.1/0.18.1 | Optional. The GCP customer managed encryption key (CMEK) used to encrypt data written to any bucket created by the plugin. If the bucket already exists, this is ignored. More information can be found here. |
Output File Prefix | Yes | 6.1.x | Optional. Prefix for the output file name. |
File System Properties | Yes | 6.1.x | Optional. Additional properties to use with the OutputFormat when reading the data. ou can use this property to set the prefix of the output file name. If you are using avro, you can set the file prefix using a property like the following one:
If you are using any other file format, you can set the file prefix with a property like the following one:
Also, you might use this property when you have multiple runs of a reusable pipeline writing to the same output directory. |
Output Schema | Yes |
| Optional. Schema of the data to write. The 'avro' and 'parquet' formats require a schema but other formats do not. |
Valid Content Types
Format type | Content type |
---|---|
avro | application/avro, application/octet-stream |
csv | text/csv, application/csv, text/plain, application/octet-stream |
delimited | text/csv, application/csv, text/tab-separated-values, text/plain, application/octet-stream |
json | application/json, text/plain, application/octet-stream |
orc | application/octet-stream |
parquet | application/octet-stream |
tsv | text/tab-separated-values, text/plain, application/octet-stream |
Created in 2020 by Google Inc.