Google Cloud Storage Multi File Sink

Plugin version: 0.22.0

This plugin is normally used in conjunction with the Multiple Database Table batch source to write records from multiple databases into multiple directories in various formats. The plugin expects that the directories it needs to write to will be set as pipeline arguments, where the key is 'multisink.[directory]' and the value is the schema of the data.

Normally, you rely on the Multiple Database Table source to set those pipeline arguments, but they can also be manually set or set by an Action plugin in your pipeline. The sink will expect each record to contain a special split field that will be used to determine which records are written to each directory. For example, suppose the the split field is 'tablename'. A record whose 'tablename' field is set to 'activity' will be written to the 'activity' directory.

This plugin writes records to one or more Avro, ORC, Parquet or Delimited format files in a directory on Google Cloud Storage.

Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

Credentials

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Property

Macro Enabled?

Version Introduced

Description

Property

Macro Enabled?

Version Introduced

Description

Use Connection

No

6.7.0/0.20.0

Optional. Whether to use a connection. If a connection is used, you do not need to provide the credentials.

Connection

Yes

6.7.0/0.20.0

Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function ${conn(connection_name)}

Project ID

Yes

 

Optional. The Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Service Account Type

Yes

6.3.0/0.16.0

Optional. Select one of the following options:

  • File Path. File path where the service account is located.

  • JSON. JSON content of the service account.

Service Account File Path

Yes

 

Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

Default is auto-detect.

Service Account JSON

Yes

6.3.0/0.16.0

Optional. Content of the service account.

Reference Name

No

 

Required. This along with the table name will be used to uniquely identify this sink for lineage, annotating metadata, etc.

Path

Yes

 

Required. The path to write to. For example, gs:///path/to/directory

You can also use the logicalStartTime function to append a date to the output filename.

Path Suffix

Yes

 

Optional. The time format for the output directory that will be appended to the path. For example, the format 'yyyy-MM-dd-HH-mm' will result in a directory of the form '2015-01-01-20-42'. If not specified, nothing will be appended to the path.

Default is yyyy-MM-dd-HH-mm.

Format

Yes

 

Required. Format to write the records in. The format must be one of 'avro', 'parquet', 'orc' or 'delimited'.

Default is json.

Delimiter

Yes

 

Optional. Delimiter to use if the format is 'delimited'. The delimiter will be ignored if the format is anything other than 'delimited'.

Write Header

Yes

 

Optional. Whether to write a header to each file if the format is ‘delimited’, ‘csv’, or ‘tsv’.

Codec

No

 

Optional. The codec to use when writing data. Must be 'none', 'snappy', 'gzip' or 'deflate', defaults to 'none'. The 'avro' supports 'snappy' and 'deflate'. The parquet supports 'snappy' and 'gzip'. Other formats does not support compression.

Default is none.

Split Field

No

 

Required. The name of the field that will be used to determine which directory to write to. Defaults to 'tablename'.

Allow flexible schemas in Output

Yes

 

Optional. When enabled, this sink will write out records with arbitrary schemas. Records may not have a well defined schema depending on the source. When enabled, the format must be one of ‘avro’, ‘json’, ‘csv’, ‘tsv’, ‘delimited’.

Location

Yes

 

Optional. The location where the gcs buckets will get created. This value is ignored if the bucket already exists.

Default is US.

Content Type

Yes

 

Optional. The Content Type entity is used to indicate the media type of the resource. Defaults to ‘application/octet-stream’. See the table below for valid content types for each format.

Custom Content Type

Yes

 

Optional. The Custom Content Type is used when the value of Content-Type is set to other. User can provide specific Content-Type, different from the options in the dropdown. More information about the Content-Type can be found at https://cloud.google.com/storage/docs/metadata

Encryption Key Name

Yes

6.5.1/0.18.1

Optional. Used to encrypt data written to any bucket created by the plugin. If the bucket already exists, this is ignored. More information can be found here.

Output Schema

Yes

 

Required. Schema of the data to write. The 'avro' and 'parquet' abd 'orc' formats require a schema but other formats do not.

Valid Content Types

Format type

Content type

Format type

Content type

avro

application/avro, application/octet-stream

csv

text/csv, application/csv, text/plain, application/octet-stream

delimited

text/csv, application/csv, text/tab-separated-values, text/plain, application/octet-stream

json

application/json, text/plain, application/octet-stream

orc

application/octet-stream

parquet

application/octet-stream

tsv

text/tab-separated-values, text/plain, application/octet-stream

Example

Suppose the input records are:

id

name

email

tablename

id

name

email

tablename

0

Samuel

sjax@example.net

accounts

1

Alice

a@example.net

accounts

userid

item

action

tablename

userid

item

action

tablename

0

shirt123

view

activity

0

carxyz

view

activity

0

shirt123

buy

activity

0

coffee

view

activity

1

cola

buy

activity

The plugin will expect two pipeline arguments to tell it to write the first two records to an 'accounts' directory and the last records to an 'activity' directory.

Created in 2020 by Google Inc.