Google Drive plugins

Google Drive plugins

Introduction

Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.

User Storie(s)

  • As a pipeline developer, I want to move all files from a Google drive directory to a different destination

  • As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination

  • As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs

  • As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.

  • As a pipeline developer, I want to move all files from an FTP source into Google drive.

Plugin Type

Batch Source
Batch Sink 
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

Source

Option level

User Facing Name

Type

Description

Optional

Constraints

Default value

Option level

User Facing Name

Type

Description

Optional

Constraints

Default value

Basic

Directory identifier

String

Identifier of the source folder.

no

 

 

File metadata properties

Multi-select

Properties which should be get for each file in directory. Allowed names can be get from Google Drive API: Files

Yes

 

 

Filtering

Filter

String

A filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter Syntax

Yes

 

 

Modification date range

Select

In addition to the filter specified above, also filter files to only pull those that were modified between the date range

Yes

 

select

Start Date

textbox

Only shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.

No

 

 

End date

textbox

Only shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00.

No

 

 

File types to pull

Multi-select

Types of files should be pulled from specified directory.

Yes

 

binary

Authentication

Authentication type

Radio-group

Defines the authentication type. OAuth2 and Service account types are available.

No

 

OAuth2

Client ID

String

OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Client secret

String

OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Refresh token

String

OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Account file path

String

Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property.
Can be set to 'auto-detect' when running on a Dataproc cluster, then plugin uses value of environment variable "GOOGLE_APPLICATION_CREDENTIALS".
When running on other clusters, the file must be present on every node in the cluster.
Service account json can be generated on Google Cloud Service Account page

Yes

 

auto-detect

Advanced

Maximum partition size

Number

Maximum partition size specified in bytes. Default 0 value means unlimited.

Yes

 

0

Body output format

Radio-group

Format of body of file. "Bytes" and "String" values are available.

Yes

 

bytes

Exporting

Google Documents export format

Select

MIME type for Google Documents. Allowed values from Downloading Google Documents.

Yes

 

text/plain

Google Spreadsheets export format

Select

MIME type for Google Spreadsheets.

Yes

 

text/csv

Google Drawings export format

Select

MIME type for Google Drawings.

Yes

 

image/svg+xml

Google Presentations export format

Select

MIME type for Google Presentations.

Yes

 

text/plain

Sink

Option level

User Facing Name

Type

Description

Optional

Constraints

 

Option level

User Facing Name

Type

Description

Optional

Constraints

 

Basic


File name field

String

Name of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names.

Yes

 

 

File body field

String

Name of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field.

No

 

 

File mime field

String

Name of the schema field (should be STRING type) which will be used as MIME type of file.
All MIME types are supported except Google Drive types.
In the case it is not set Google API will try to recognize file's MIME type automatically.

Yes

 

 

Directory identifier

String

Identifier of the destination folder.

No

 

 

Authentication



Authentication type

Radio-group

Defines the authentication type. OAuth2 and Service account types are available.

No

 

OAuth2

Client ID

String

OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Client secret

String

OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Refresh token

String

OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property.

Yes

 

 

Account file path

String

Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property.
Can be set to 'auto-detect' when running on a Dataproc cluster, then plugin uses value of environment variable "GOOGLE_APPLICATION_CREDENTIALS".
When running on other clusters, the file must be present on every node in the cluster.
Service account json can be generated on Google Cloud Service Account page

Yes

 

auto-detect

Design / Implementation Tips

  • Tip #1

  • Tip #2

Design

Approach(s)

Source

Google Drive Source plugin reads files from specified Google Drive folder via Drive API.

Output schema if fully defined by plugin settings provided by user. Two fields are mandatory: body and offset, everyone else depends on file's fields selected in File properties property.

Body format is defined by Body output format property and may has BYTES and STRING formats. Offset field provides number of byte in the original file body starts from.

Plugin is able to limit maximal body size per partition with Maximum partition size set. For default value 0 plugin processes entire files without partitioning and offset field has always 0 value.

Example of output schema, when user selected id, name, mimeType, description, createdTime, modifiedTime, size, imageMediaMetadata.width, imageMediaMetadata.height and imageMediaMetadata.rotation fields from File properties property:

Sink

Google Drive Sink plugin writes received data as files to specified Google Drive folder via Drive API.

Sink writes single separate file per received partition. The only required input field from schema is file's body. The name of the schema's field with file body can be specified with File name field property. BYTES format is only supported for body input field.

There are also similar properties for name and mime type of file. Both requires STRING format. In case name field is not specified sink will generate random 16-symbol names. In case mime type field is not set Google Drive will try define it automatically.

Sink plugin doesn't support partitioned files writing.

Properties

Modification date range

Filters files by last modified date. Available values:

  1. None - files will not be filtered.

  2. Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".

  3. Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".

  4. Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".

  5. Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".

  6. Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".

  7. Current year  - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".

  8. Custom - user should enter start and end dates by himself.

This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.

 

File properties

User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.

Property name

Type

Description

Property name

Type

Description

id

string

The ID of the file.

name

string

The name of the file. This is not necessarily unique within a folder.

mimeType

string

The MIME type of the file.

Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided.

description

string

A short description of the file.

starred

boolean

Whether the user has starred the file.

trashed

boolean

Whether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash.

explicitlyTrashed

boolean

Whether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder.

trashedTime

timestamp milliseconds

The time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives.

parents

array of strings

The IDs of the parent folders which contain the file.

properties

record of key-value strings

A collection of arbitrary key-value pairs which are visible to all apps.

spaces

string

The list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'.

createdTime

timestamp milliseconds

The time at which the file was created (RFC 3339 date-time).

modifiedTime

timestamp milliseconds

The last time the file was modified by anyone (RFC 3339 date-time).

driveId

 

ID of the shared drive the file resides in. Only populated for items in shared drives.

originalFilename

string

The original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive.

fullFileExtension

string

The full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive.

md5Checksum

string

The MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive.

size

long

The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive.

imageMediaMetadata.width

int

The width of the image in pixels.

imageMediaMetadata.height

int

The height of the image in pixels.

imageMediaMetadata.rotation

int

The rotation in clockwise degrees from the image's original orientation.

Created in 2020 by Google Inc.