Google Drive Batch Source
The Google Drive Batch Source plugin is available in the Hub.
Plugin version: 1.4.1
Reads a fileset from specified Google Drive directory via Google Drive API.
Configuration
Property | Macro Enabled? | Version Introduced | Description |
---|---|---|---|
Reference Name | No |
| Required. Name used to uniquely identify this source for lineage, annotating metadata, etc. |
Directory Identifier | No |
| Required. Identifier of the source folder. This comes after https://drive.google.com/drive/folders/1dyUEebJaFnWa3Z4n0BFMVAXQ7mfUH11g?resourcekey=0-XVijrJSp3E3gkdJp20MpCQ Then the Directory Identifier would be |
File Metadata Properties | Yes |
| Optional. Properties that represent metadata of files. They will be a part of output structured record. Descriptions for properties can be view at Drive API file reference. |
Filter | No |
| Optional. Filter that can be applied to the files in the selected directory. Filters follow the Google Drive filters syntax. |
Modification Date Range | No |
| Required. Filter that narrows set of files by modified date range. User can select either among predefined or custom entered ranges. For Custom selection the dates range can be specified via Start Date and End Date. Default is lifetime. |
File Types To Pull | Yes |
| Required. Types of files which should be pulled from a specified directory. The following values are supported: binary (all non-Google Drive formats), Google Documents, Google Spreadsheets, Google Drawings, Google Presentations and Google Apps Scripts. For Google Drive formats user should specify exporting format in Exporting section. Default is Binary. |
Authentication Type | No |
| Required. Type of authentication used to access Google API. OAuth2 and Service Account types are available. Make sure that:
OAuth2 client credentials can be generated on Google Cloud Credentials Page. For more details on OAuth2, see Google Drive API Documentation. Default is OAuth2. |
Client ID | No |
| Optional. OAuth2 Client ID used to identify the application. |
Client Secret | No |
| Optional. OAuth2 Client Secret used to access the authorization server. |
Refresh Token | No |
| Optional. OAuth2 Refresh Token to acquire new access tokens. |
Service Account Type | Yes |
| Optional. Make sure that the Google Drive Folder is shared with the specified service account email. |
Service Account File Path | Yes |
| Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster which needs to be created with the following scopes: When running on other clusters, the file must be present on every node in the cluster. Default is |
Service Account JSON | Yes |
| Optional. Contents of the service account JSON file. Service Account JSON can be generated on Google Cloud Service Account page. |
Maximum Partition Size | Yes |
| Required. Maximum body size for each structured record specified in bytes. Default 0 value means unlimited. Is not applicable for files in Google formats. Default is 0. |
Body Output Format | Yes |
| Required. Output format for body of file. “Bytes” and “String” values are available. Default is Bytes. |
Google Documents Export Format | Yes |
| Required. MIME type which is used for Google Documents when converted to structured records. Default is text/plain. |
Google Spreadsheets Export Format | Yes |
| Required. MIME type which is used for Google Spreadsheets when converted to structured records. Default is text/csv. |
Google Drawings Export Format | Yes |
| Required. MIME type which is used for Google Drawings when converted to structured records. Default is image/svg+xml. |
Google Presentations Export Format | Yes |
| Required. MIME type which is used for Google Presentations when converted to structured records. Default is text/plain. |
Steps to Generate OAuth2 Credentials
Create credentials for the Client ID and Client Secret properties here.
On the Create OAuth client ID page, under Authorized redirect URIs, specify a URI of
http://localhost:8080
. This is just to generate therefresh token
.Click
Create
. The OAuth client is created. For more information, see this doc.Copy the Client ID and Client Secret to the plugin properties.
To get the Refresh Token, follow these steps:
Authenticate and authorize with the Google Auth server to get an authorization
code
.Use that authorization code with the Google Token server to get a
refresh token
that the plugin will use to get future access tokens.
To get the authorization code, you can copy the URL below, change to use your
client_id
, and then open that URL in a browser window.https://accounts.google.com/o/oauth2/v2/auth? scope=https%3A//www.googleapis.com/auth/drive.readonly& access_type=offline& include_granted_scopes=true& response_type=code& state=state_parameter_passthrough_value& redirect_uri=http%3A//localhost:8080& client_id=199375159079-st8toco9pfu1qi5b45fkj59unc5th2v1.apps.googleusercontent.com
This will prompt you to login, authorize this client for specified scopes, and then redirect you to
http://localhost:8080
. It will look like an error page, but notice that the URL of the error page redirected to include thecode
. In a normal web application, that is how the authorization code is returned to the requesting web application.For example, URL of the page will be something like
http://localhost:8080/?state=state_parameter_passthrough_value&code=4/0AX4XfWi6PsiJiPO4MjltrcD6uoRgwci-HX16aL1-Ax-tgqYgC47NnjtCCKRoVzv46m8aJw&scope=https://www.googleapis.com/auth/drive
Here, code=
4/0AX4XfWi6PsiJiPO4MjltrcD6uoRgwci-HX16aL1-Ax-tgqYgC47NnjtCCKRoVzv46m8aJw
.Note: If you see an error like this
Authorization Error — Error 400: admin_policy_enforced
, then the GCP User’s organization has a policy that restricts you from using Client IDs for third party products. In that case, they’ll need to get that restriction lifted, or use a different GCP user in a different org.With that authorization code, you can now call the Google Token server to get the
access token
and therefresh token
in the response. Set thecode
,client_id
, andclient_secret
in the curl command below and run it in a Cloud Shell terminal.curl -X POST -d "code=4/0AX4XfWjgRdrWXuNxqXOOtw_9THZlwomweFrzcoHMBbTFkrKLMvo8twSXdGT9JramIYq86w&client_id=199375159079-st8toco9pfu1qi5b45fkj59unc5th2v1.apps.googleusercontent.com&client_secret=q2zQ-vc3wG5iF5twSwBQkn68&redirect_uri=http%3A//localhost:8080&grant_type=authorization_code&access_type=offline" \ https://oauth2.googleapis.com/token
Now, you will have your
refresh_token
, which is the last OAuth 2.0 property that the Google Drive Batch Source needs to authorize with the Google Drive API.
Related content
Created in 2020 by Google Inc.