Google Sheets Batch Source

The Google Sheets batch source is available in the Hub.

Plugin version: 1.4.2

Reads spreadsheets from specified Google Drive directory via Google Sheets API.

Configuration

Property

Macro Enabled?

Version Introduced

Description

Property

Macro Enabled?

Version Introduced

Description

Reference Name

No

 

Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.

Directory Identifier

No

 

Required. Identifier of the source folder.

This comes after folders/ in the URL. For example, if the URL is

https://drive.google.com/drive/folders/1dyUEebJaFnWa3Z4n0BFMVAXQ7mfUH11g?resourcekey=0-XVijrJSp3E3gkdJp20MpCQ

Then the Directory Identifier would be 1dyUEebJaFnWa3Z4n0BFMVAXQ7mfUH11g.

File Identifier

Yes

 

Identifier of the spreadsheet file.

This comes after spreadsheets/d/ in the URL. For example, if the URL is

https://docs.google.com/spreadsheets/d/17W3vOhBwe0i24OdVNsbz8rAMClzUitKeAbumTqWFrkows

Then the File Identifier would be 17W3vOhBwe0i24OdVNsbz8rAMClzUitKeAbumTqWFrkows.
Note: Either Directory Identifier or File Identifier should have a value.

Sheets To Pull

Yes

 

Required. Filter that specifies set of sheets to process. For ‘numbers’ or ‘titles’ selections, user can populate specific values in the Sheets identifiers field.

Default is all.

Modification Date Range

No

 

Optional. Filter that narrows set of files by modified date range. User can select either among predefined or custom entered ranges. For Custom selection, the dates range can be specified via Start date and End date.

Default is lifetime.

Start Date

No

 

Start date for custom modification date range. Is shown only when Custom range is selected for Modification date range field. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.

End Date

No

 

End date for custom modification date range. Is shown only when Custom range is selected for Modification date range field. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.

Sheets Identifiers

Yes

 

Optional. Set of sheets' numbers/titles to process. Is shown only when ‘titles’ or ‘numbers’ are selected for the Sheets to pull field.

Authentication Type

No

 

Required. Type of authentication used to access Google API.

OAuth2 and Service Account types are available.

Make sure that:

  • Google Drive API and Google Sheets API is enabled in the GCP Project.

  • Google Drive Folder is shared to the service account email used with the required permission.

OAuth2 client credentials can be generated on Google Cloud Credentials Page.

For more information about OAuth2, see Google Drive API Documentation.

Default is OAuth2.

OAuth2: Client ID

Yes

 

OAuth2 Client ID used to identify the application.

OAuth2: Client Secret

Yes

 

OAuth2 Client Secret used to access the authorization server.

OAuth2: Refresh Token

Yes

 

OAuth2 refresh token for acquiring new access tokens.

For more information, see the Google Drive API.

OAuth2: Access Token

 

 

Short lived access token used for connecting.

Service Account Type

Yes

6.3.0/1.3.0

Optional. Select one of the following options:

  • File Path. File path where the service account is located.

  • JSON. JSON content of the service account.

Make sure that the Google Drive Folder is shared with the specified service account email. Viewer role must be granted to the specified service account to read files from the Google Drive Folder.

Service Account File Path

Yes

 

Optional. Path on the local file system of the service account key used for authorization.

Can be set to 'auto-detect' when running on a Dataproc cluster which needs to be created with the following scopes:

When running on other clusters, the file must be present on every node in the cluster.

Default is auto-detect.

Service Account JSON

Yes

1.4.0

Optional. Contents of the service account JSON file. Service Account JSON can be generated on Google Cloud Service Account page.

Extract Metadata

Yes

 

Required. Field to enable metadata extraction. Metadata extraction is useful when user wants to specify a header or a footer for a sheet. The rows in headers and footers are not available as data records. Instead, they are available in every record as a field called metadata, which is a record of the specified metadata.

Default is No.

Metadata Field Name

Yes

 

Required. Name of the record with metadata content. It is needed to distinguish metadata record from possible column with the same name.

Default is No.

First Header Row Index

Yes

 

Optional. Row number of the first row to be treated as header.

Last Header Row Index

Yes

 

Optional. Row number of the last row to be treated as header.

First Footer Row Index

Yes

 

Optional. Row number of the first row to be treated as footer.

Last Footer Row Index

Yes

 

Optional. Row number of the last row to be treated as footer.

Metadata Cells

Yes

 

Optional. Set of the cells for key-value pairs to extract as metadata from the specified metadata sections. Only shown if Extract metadata is set to true. The cell numbers should be within the header or footer.

Text Formatting

Yes

 

Required. Output format for numeric sheet cells. In ‘Formatted values’ case the value will contain appropriate format of source cell e.g. ‘1.23$’, ‘123%’.” For ‘Values only’ only number value will be returned.

Default is Values only.

Skip Empty Data

Yes

 

Required. Field that allows skipping of empty structure records.

Default is No.

Add spreadsheet/sheet name fields

Yes

 

Optional. Toggle that defines if the source extends output schema with spreadsheet and sheet names.

Default is No.

Spreadsheet field name

Yes

 

Optional. Schema field name for spreadsheet name.

Sheet field name

Yes

 

Optional. Schema field name for sheet name.

Column Names Selection

Yes

 

Required. Source for column names. User can specify where from the plugin should get schema filed names. Are available following values: No column names - default sheet column names will be used (‘A’, ‘B’ etc.), Treat first row as column names - the plugin uses first row for schema defining and field names, Custom row as column names - as previous, but for custom row index.

Default is Treat first row as column names.

Custom Row Index For Column Names

Yes

 

Optional. Number of the row to be treated as a header. Only shown when the ‘Column Names Selection’ field is set to ‘Custom row as column names’ header.

Default is 1.

Last Data Column Index

Yes

 

Optional. Last column plugin will read as data.

Default is 26.

Last Data Row Index

Yes

 

Optional. Last row plugin will read as data.

Default is 1000.

Read Buffer Size

Yes

 

Optional. Number of rows the source reads with a single API request.

Default is 100.

Steps to Generate OAuth2 Credentials

  1. Create credentials for the Client ID and Client Secret properties here.

  2. On the Create OAuth client ID page, under Authorized redirect URIs, specify a URI of http://localhost:8080. This is just to generate the refresh token.

  3. Click Create. The OAuth client is created. For more information, see this doc.

  4. Copy the Client ID and Client Secret to the plugin properties.

  5. To get the Refresh Token, follow these steps:

    1. Authenticate and authorize with the Google Auth server to get an authorization code.

    2. Use that authorization code with the Google Token server to get a refresh token that the plugin will use to get future access tokens.

    To get the authorization code, you can copy the URL below, change to use your client_id, and then open that URL in a browser window.

    https://accounts.google.com/o/oauth2/v2/auth? scope=https%3A//www.googleapis.com/auth/drive%20https://www.googleapis.com/auth/spreadsheets.readonly& access_type=offline& include_granted_scopes=true& response_type=code& state=state_parameter_passthrough_value& redirect_uri=http%3A//localhost:8080& client_id=199375159079-st8toco9pfu1qi5b45fkj59unc5th2v1.apps.googleusercontent.com

    This will prompt you to login, authorize this client for specified scopes, and then redirect you to http://localhost:8080. It will look like an error page, but notice that the URL of the error page redirected to include the code. In a normal web application, that is how the authorization code is returned to the requesting web application.

    For example, URL of the page will be something like

    Here, code=4/0AX4XfWi6PsiJiPO4MjltrcD6uoRgwci-HX16aL1-Ax-tgqYgC47NnjtCCKRoVzv46m8aJw.

    Note: If you see an error like this Authorization Error — Error 400: admin_policy_enforced, then the GCP User’s organization has a policy that restricts you from using Client IDs for third party products. In that case, they’ll need to get that restriction lifted, or use a different GCP user in a different org.

    With that authorization code, you can now call the Google Token server to get the access token and the refresh token in the response. Set the code, client_id, and client_secret in the curl command below and run it in a Cloud Shell terminal.

  6. Now, you will have your refresh_token, which is the last OAuth 2.0 property that the Google Sheets Batch Source needs to authorize with the Google Drive and Google Sheets API.

Created in 2020 by Google Inc.