Google Cloud Storage File Reader Batch Source

Plugin version: 0.22.0

This plugin reads objects from a path in a Google Cloud Storage bucket.

Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

Credentials

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Property

Macro Enabled?

Version Introduced

Description

Property

Macro Enabled?

Version Introduced

Description

Use Connection

No

6.5.0/0.18.0

Optional. Whether to use a connection. If a connection is used, you do not need to provide the credentials.

Connection

Yes

6.5.0/0.18.0

Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function ${conn(connection_name)}

Project ID

Yes

 

Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console.

Default is auto-detect.

Service Account Type

Yes

6.3.0/0.16.0

Optional. Select one of the following options:

  • File Path. File path where the service account is located.

  • JSON. JSON content of the service account.

Service Account File Path

Yes

 

Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

Default is auto-detect.

Service Account JSON

Yes

6.3.0/0.16.0

Optional. Content of the service account.

Reference Name

No

 

Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.

Path

Yes

 

Required. Path to file(s) to be read. If a directory is specified, terminate the path name with a ‘/‘. For example, gs://<bucket>/path/to/directory/. An asterisk (“*“) can be used as a wildcard to match a filename pattern. If no files are found or matched, the pipeline will fail.

Format

No

 

Required. Format of the data to read. The format must be one of 'avro', 'blob', 'csv', 'delimited', 'json', 'parquet', 'text', or 'tsv'. The 'blob' format also requires a schema that contains a field named 'body' of type 'bytes'. If the format is 'text', the schema must contain a field named 'body' of type 'string'.

If the format is a macro, only the pre-packaged formats can be used. 

Sample Size

Yes

6.4.0 / 0.17.0

Optional. The maximum number of rows that will get investigated for automatic data type detection.

Default is 1000.

Override

Yes

6.4.0 / 0.17.0

Optional. A list of columns with the corresponding data fro which the automatic data type detection gets skipped.

Delimiter

Yes

 

Optional. Delimiter to use when the format is 'delimited'. This will be ignored for other formats.

Enable Quoted Values

Yes

6.7.0/0.20.0

Optional. Whether to treat content between quotes as a value. This value will only be used if the format is ‘csv’, ‘tsv’ or ‘delimited’. For example, if this is set to true, a line that looks like 1, "a, b, c" will output two fields. The first field will have 1 as its value and the second will have a, b, c as its value. The quote characters will be trimmed. The newline delimiter cannot be within quotes.

It also assumes the quotes are well enclosed, for example, "a, b, c". If there is an unenclosed quote, for example "a,b,c, an error will occur.

Default value is False.

Use First Row as Header

Yes

6.7.0/0.20.0

Optional. Whether to use the first line of each file as the column header. Supported formats are 'text', 'csv', 'tsv', 'delimited'.

Default is False.

Minimum Split Size

Yes

6.3.0/0.16.0

Optional. Minimum size in bytes for each input partition. Smaller partitions will increase the level of parallelism, but will require more resources and overhead.

If the Format is blob, you cannot split the data.

Maximum Split Size

Yes

 

Optional. Maximum size in bytes for each input partition. Smaller partitions will increase the level of parallelism, but will require more resources and overhead.

If the Format is blob, you cannot split the data.

Default is 128 MB.

Regex Path Filter

Yes

 

Optional. Regular expression that file paths must match in order to be included in the input. The full file path is compared, not just the filename. If no file is giving, no file filtering will be done. For more information about regular expression syntax, see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

Path Field

Yes

 

Optional. Output field to place the path of the file that the record was read from. If not specified, the file path will not be included in output records. If specified, the field must exist in the output schema as a string.

Path Filename Only

Yes

 

Optional. Whether to only use the filename instead of the URI of the file path when a path field is given.

Default is false.

Read Files Recursively

Yes

 

Optional. Whether files are to be read recursively from the path.

Default is false.

Allow Empty Input

Yes

6.7.0/0.20.0

Optional. Whether to allow an input path that contains no data. When set to false, the plugin will error when there is no data to read. When set to true, no error will be thrown and zero records will be read.

Default is False.

Data File Encrypted

Yes

6.1.x

 

Optional. Whether files are encrypted. For more information, see “Data File Encrypted” below.

Default is false.

Encryption Metadata File Suffix

Yes

6.1.x

Optional. The file name suffix for the encryption metadata file.

Default is .metadata.

File System Properties

Yes

 

Optional. Additional properties to use with the InputFormat when reading the data.

File Encoding

Yes

6.3.0/0.16.0

Optional. The character encoding for the file(s) to be read. 

Default is UTF-8.

Output Schema

Yes

 

If a Path Field is set, it must be present in the schema as a string.

Data File Encrypted

Whether files are encrypted. If it is set to true, files will be decrypted using the Streaming AEAD provided by the Google Tink library. Each data file needs to be accompanied with a metadata file that contains the cipher information. For example, an encrypted data file at gs://<bucket>/path/to/directory/file1.csv.enc must have a metadata file at gs://<bucket>/path/to/directory/file1.csv.enc.metadata.

The metadata file contains a JSON object with the following properties:

Property

Description

Property

Description

kms

The Cloud KMS URI that was used to encrypt the Data Encryption Key.

aad

The Base64 encoded Additional Authenticated Data used in the encryption.

key set

A JSON object representing the serialized keyset information from the Tink library.

For example:

{ "kms": "gcp-kms://projects/my-key-project/locations/us-west1/keyRings/my-key-ring/cryptoKeys/mykey", "aad": "73iT4SUJBM24umXecCCf3A==", "keyset": { "keysetInfo": { "primaryKeyId": 602257784, "keyInfo": [{ "typeUrl": "type.googleapis.com/google.crypto.tink.AesGcmHkdfStreamingKey", "outputPrefixType": "RAW", "keyId": 602257784, "status": "ENABLED" }] }, "encryptedKeyset": "CiQAz5HH+nUA0Zuqnz4LCnBEVTHS72s/zwjpcnAMIPGpW6kxLggSrAEAcJKHmXeg8kfJ3GD4GuFeWDZzgGn3tfolk6Yf5d7rxKxDEChIMWJWGhWlDHbBW5B9HqWfKx2nQWSC+zjM8FLefVtPYrdJ8n6Eg8ksAnSyXmhN5LoIj6az3XBugtXvCCotQHrBuyoDY+j5ZH9J4tm/bzrLEjCdWAc+oAlhsUAV77jZhowJr6EBiyVuRVfcwLwiscWkQ9J7jjHc7ih9HKfnqAZmQ6iWP36OMrEn" } }

 

 

 

Created in 2020 by Google Inc.