Google Drive plugins
Introduction
Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.
User Storie(s)
As a pipeline developer, I want to move all files from a Google drive directory to a different destination
As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
As a pipeline developer, I want to move all files from an FTP source into Google drive.
Plugin Type
Configurables
This section defines properties that are configurable for this plugin.
Source
Option level | User Facing Name | Type | Description | Optional | Constraints | Default value |
|---|---|---|---|---|---|---|
Basic | Directory identifier | String | Identifier of the source folder. | no |
|
|
File metadata properties | Multi-select | Properties which should be get for each file in directory. Allowed names can be get from Google Drive API: Files | Yes |
|
| |
Filtering | Filter | String | A filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter Syntax | Yes |
|
|
Modification date range | Select | In addition to the filter specified above, also filter files to only pull those that were modified between the date range | Yes |
| select | |
Start Date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | No |
|
| |
End date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | No |
|
| |
File types to pull | Multi-select | Types of files should be pulled from specified directory. | Yes |
| binary | |
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No |
| OAuth2 |
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes |
| auto-detect | |
Advanced | Maximum partition size | Number | Maximum partition size specified in bytes. Default 0 value means unlimited. | Yes |
| 0 |
Body output format | Radio-group | Format of body of file. "Bytes" and "String" values are available. | Yes |
| bytes | |
Exporting | Google Documents export format | Select | MIME type for Google Documents. Allowed values from Downloading Google Documents. | Yes |
| text/plain |
Google Spreadsheets export format | Select | MIME type for Google Spreadsheets. | Yes |
| text/csv | |
Google Drawings export format | Select | MIME type for Google Drawings. | Yes |
| image/svg+xml | |
Google Presentations export format | Select | MIME type for Google Presentations. | Yes |
| text/plain |
Sink
Option level | User Facing Name | Type | Description | Optional | Constraints |
|
|---|---|---|---|---|---|---|
Basic | File name field | String | Name of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names. | Yes |
|
|
File body field | String | Name of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field. | No |
|
| |
File mime field | String | Name of the schema field (should be STRING type) which will be used as MIME type of file. | Yes |
|
| |
Directory identifier | String | Identifier of the destination folder. | No |
|
| |
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No |
| OAuth2 |
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes |
|
| |
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes |
| auto-detect |
Design / Implementation Tips
Tip #1
Tip #2
Design
Approach(s)
Source
Google Drive Source plugin reads files from specified Google Drive folder via Drive API.
Output schema if fully defined by plugin settings provided by user. Two fields are mandatory: body and offset, everyone else depends on file's fields selected in File properties property.
Body format is defined by Body output format property and may has BYTES and STRING formats. Offset field provides number of byte in the original file body starts from.
Plugin is able to limit maximal body size per partition with Maximum partition size set. For default value 0 plugin processes entire files without partitioning and offset field has always 0 value.
Example of output schema, when user selected id, name, mimeType, description, createdTime, modifiedTime, size, imageMediaMetadata.width, imageMediaMetadata.height and imageMediaMetadata.rotation fields from File properties property:
Sink
Google Drive Sink plugin writes received data as files to specified Google Drive folder via Drive API.
Sink writes single separate file per received partition. The only required input field from schema is file's body. The name of the schema's field with file body can be specified with File name field property. BYTES format is only supported for body input field.
There are also similar properties for name and mime type of file. Both requires STRING format. In case name field is not specified sink will generate random 16-symbol names. In case mime type field is not set Google Drive will try define it automatically.
Sink plugin doesn't support partitioned files writing.
Properties
Modification date range
Filters files by last modified date. Available values:
None - files will not be filtered.
Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
Current year - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
Custom - user should enter start and end dates by himself.
This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.
File properties
User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.
Property name | Type | Description |
|---|---|---|
id | string | The ID of the file. |
name | string | The name of the file. This is not necessarily unique within a folder. |
mimeType | string | The MIME type of the file. Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided. |
description | string | A short description of the file. |
starred | boolean | Whether the user has starred the file. |
trashed | boolean | Whether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash. |
explicitlyTrashed | boolean | Whether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder. |
trashedTime | timestamp milliseconds | The time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives. |
parents | array of strings | The IDs of the parent folders which contain the file. |
properties | record of key-value strings | A collection of arbitrary key-value pairs which are visible to all apps. |
spaces | string | The list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'. |
createdTime | timestamp milliseconds | The time at which the file was created (RFC 3339 date-time). |
modifiedTime | timestamp milliseconds | The last time the file was modified by anyone (RFC 3339 date-time). |
driveId |
| ID of the shared drive the file resides in. Only populated for items in shared drives. |
originalFilename | string | The original filename of the uploaded content if available, or else the original value of the |
fullFileExtension | string | The full file extension extracted from the |
md5Checksum | string | The MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive. |
size | long | The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive. |
imageMediaMetadata.width | int | The width of the image in pixels. |
imageMediaMetadata.height | int | The height of the image in pixels. |
imageMediaMetadata.rotation | int | The rotation in clockwise degrees from the image's original orientation. |