Amazon S3 Batch Source

Plugin version: 0.19.5

Use this source to read from Amazon S3. For example, you might want to read log files from S3 every hour and then store the logs in a table on BigQuery.

Configuration

Property

Macro Enabled?

Version Introduced

Description

Property

Macro Enabled?

Version Introduced

Description

Use Connection

No

6.5.0/0.15.0

Optional. Whether to use an existing connection. If you use a connection, connection related properties do not appear in the plugin properties.

Browse Connections

Yes

6.5.0/0.15.0

Optional. Browse to find the connection.

Authentication Method



Yes

 

Optional. Authentication method to access S3. IAM can only be used if the plugin is run in an AWS environment, such as on EMR.

Default is Access Credentials.

Access ID

Yes

 

Optional. Amazon access ID required for authentication.

Access Key

Yes

 

Optional. Amazon access key required for authentication.

Session Token

Yes

6.7.0/1.7.0

Optional. Amazon session token required for authentication. Only required for temporary credentials. Temporary credentials are only supported for S3A paths.

Reference Name

No

 

Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.

Path

Yes

 

Required. Path to read from. For example, s3a://<bucket>/path/to/input

Format

No

 

Required. Format of the data to read. The format must be one of ‘avro’, ‘blob’, ‘csv’, ‘delimited’, ‘json’, ‘parquet’, ‘text’, or ‘tsv’. The ‘blob’ format also requires a schema that contains a field named ‘body’ of type ‘bytes’. If the format is ‘text’, the schema must contain a field named ‘body’ of type ‘string’.

Default is text.

Delimiter

Yes

 

Optional. Delimiter to use when the format is ‘delimited’. This will be ignored for other formats.

Enable Quoted Values

Yes

6.7.0/0.17.0

Optional. Whether to treat content between quotes as a value. This value will only be used if the format is ‘csv’, ‘tsv’ or ‘delimited’. For example, if this is set to true, a line that looks like 1, "a, b, c" will output two fields. The first field will have 1 as its value and the second will have a, b, c as its value. The quote characters will be trimmed. The newline delimiter cannot be within quotes.

It also assumes the quotes are well enclosed, for example, "a, b, c". If there is an unenclosed quote, for example "a,b,c, an error will occur.

Default value is False.

Use First Row as Header

Yes

6.7.0/0.17.0

Optional. Whether to use the first line of each file as the column headers. Supported formats are text, csv, tsv, and delimited.

Default is False.

Maximum Split Size

Yes

 

Optional. Maximum size in bytes for each input partition. Smaller partitions will increase the level of parallelism, but will require more resources and overhead.

Default is 128 MB.

Regex Path Filter

Yes

 

Optional. Regular expression that file paths must mach in order to be included in the input. The full file path is compared, not just the filename. If no file is giving, no file filtering will be done. For more information about regular expression syntax, see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

Path Field

No

 

Optional. Output field to place the path of the file that the record was read from. If not specified, the file path will not be included in output records. If specified, the field must exist in the output schema as a string.

Path Filename Only

No

 

Optional. Whether to only use the filename instead of the URI of the file path when a path field is given.

Default is false.

Read Files Recursively

No

 

Optional. Whether files are to be read recursively from the path. Default is false.

Allow Empty Input

No

 

Optional. Whether to allow an input path that contains no data. When set to false, the plugin will error when there is no data to read. When set to true, no error will be thrown and zero records will be read.

Default is false.

Verify Credentials

No

6.9.0/1.19.5

6.8.0/1.18.4

Optional. Whether to verify the access credentials. When false, validation succeeds, even if the credentials are incorrect. When true, the accuracy of the credentials is evaluated and validation fails, if credentials are incorrect. The default value is false.

File System Properties

Yes

 

Optional. Additional properties to use when reading from the filesystem. This is an advanced feature that requires knowledge of the properties supported by the underlying filesystem.

File Encoding

Yes

6.3.0/1.13.0

Optional. The character encoding for the file(s) to be read. 

Default is UTF-8.

Output Schema

Yes

 

Required. The output schema for the data.

 

 

Created in 2020 by Google Inc.