Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Google drive plugins will help users move entire files from source to destination. Along the way, users can potentially run transformations on unstructured data such as images, audio and video as well.
User Storie(s)
- As a pipeline developer, I want to move all files from a Google drive directory to a different destination
- As a pipeline developer, I want to move all files from a Google drive directory that satisfy a filter to a different destination
- As a pipeline developer, I want to pull all images from a Google drive directory, so that I can process them using image recognition APIs
- As a pipeline developer, I want to pull all audio and video files from a Google drive directory, so that I can process them to extract metadata and/or generate transcripts, or apply other enrichments.
- As a pipeline developer, I want to move all files from an FTP source into Google drive.
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
This section defines properties that are configurable for this plugin.
Source
Option level | User Facing Name | Type | Description | Optional | Constraints | Default value |
---|---|---|---|---|---|---|
Basic | Directory identifier | String | Identifier of the source folder. | no | ||
Filter | String | A filter that can be applied to the files in the selected directory. Filters follow the Google Drive Filter Syntax | Yes | |||
Modification date range | Select | In addition to the filter specified above, also filter files to only pull those that were modified between the date range | Yes | select | ||
Start Date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts start date for modification date range. RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | No | |||
End date | textbox | Only shown when the "Modification date range" is set to "Custom" value. Accepts end date for modification date range.RFC3339 format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | No | |||
File properties | Multi-select | Properties which should be get for each file in directory. Allowed names can be get from Google Drive API: Files | Yes | |||
File types to pull | Multi-select | Types of files should be pulled from specified directory. | Yes | binary | ||
Authentication | Client ID | String | OAuth2 client id. | No | ||
Client secret | String | OAuth2 client secret. | No | |||
Refresh token | String | OAuth2 refresh token. | No | |||
Access token | String | OAuth2 access token. | No | |||
Advanced | Maximum partition size | Number | Maximum partition size specified in bytes. Default 0 value means unlimited. | Yes | 0 | |
Body output format | Radio-group | Format of body of file. "Bytes" and "String" values are available. | Yes | bytes | ||
Exporting | Google Documents export format | Select | MIME type for Google Documents. Allowed values from Downloading Google Documents. | Yes | text/plain | |
Google Spreadsheets export format | Select | MIME type for Google Spreadsheets. | Yes | text/csv | ||
Google Drawings export format | Select | MIME type for Google Drawings. | Yes | image/svg+xml | ||
Google Presentations export format | Select | MIME type for Google Presentations. | Yes | text/plain |
Sink
Option level | User Facing Name | Type | Description | Optional | Constraints |
---|---|---|---|---|---|
Basic | File name field | String | Name of the schema field (should be STRING type) which will be used as name of file. Is optional. In the case it is not set files have randomly generated 16-symbols names. | Yes | |
File body field | String | Name of the schema field (should be BYTES type) which will be used as body of file. The minimal input schema should contain only this field. | No | ||
Directory identifier | String | Identifier of the destination folder. | No | ||
Authentication | Client ID | String | OAuth2 client id. | No | |
Client secret | String | OAuth2 client secret. | No | ||
Refresh token | String | OAuth2 refresh token. | No | ||
Access token | String | OAuth2 access token. | No |
Design / Implementation Tips
- Tip #1
- Tip #2
Design
Approach(s)
Properties
Modification date range
Filters files by last modified date. Available values:
- None - files will not be filtered.
- Last 7 days - from current time 7 days ago until current date and time, for example: from "2019-09-12T19:52:13.456" to "2019-09-19T19:52:13.456".
- Last 30 days - from current time 30 days ago moment until current date and time, for example: from "2019-08-20T19:52:13.456" to "2019-09-19T19:52:13.456".
- Previous quarter - from start of previous quarter until end of previous quarter, for example: from "2019-04-01T00:00:00.000" to "2019-06-30T23:59:59.999".
- Current quarter - from start of current quarter until current date and time, for example: from "2019-07-01T00:00:00.000" to "2019-09-19T19:52:13.456".
- Last year - from start of previous year until end of previous year, for example: from "2018-01-01T00:00:00.000" to "2018-12-31T23:59:59.999".
- Current year - from start of current year until current date and time for example: from "2019-01-01T00:00:00.000" to "2019-09-19T19:52:13.456".
- Custom - user should enter start and end dates by himself.
This filter is not interconnected with Filter property, so user is able to populate Modification date range with value what conflicts with Filter.
File properties
User can select some files' metadata provided by Google Drive API. Not all properties are available for now. Descriptions were obtained from Files overview.
Property name | Type | Description |
---|---|---|
id | string | The ID of the file. |
name | string | The name of the file. This is not necessarily unique within a folder. |
mimeType | string | The MIME type of the file. Google Drive will attempt to automatically detect an appropriate value from uploaded content if no value is provided. |
description | string | A short description of the file. |
starred | boolean | Whether the user has starred the file. |
trashed | boolean | Whether the file has been trashed, either explicitly or from a trashed parent folder. Only the owner may trash a file, and other users cannot see files in the owner's trash. |
explicitlyTrashed | boolean | Whether the file has been explicitly trashed, as opposed to recursively trashed from a parent folder. |
trashedTime | timestamp milliseconds | The time that the item was trashed (RFC 3339 date-time). Only populated for items in shared drives. |
parents | array of strings | The IDs of the parent folders which contain the file. |
properties | record of key-value strings | A collection of arbitrary key-value pairs which are visible to all apps. |
spaces | string | The list of spaces which contain the file. The currently supported values are 'drive', 'appDataFolder' and 'photos'. |
createdTime | timestamp milliseconds | The time at which the file was created (RFC 3339 date-time). |
modifiedTime | timestamp milliseconds | The last time the file was modified by anyone (RFC 3339 date-time). |
driveId | ID of the shared drive the file resides in. Only populated for items in shared drives. | |
originalFilename | string | The original filename of the uploaded content if available, or else the original value of the name field. This is only available for files with binary content in Google Drive. |
fullFileExtension | string | The full file extension extracted from the name field. May contain multiple concatenated extensions, such as "tar.gz". This is only available for files with binary content in Google Drive. |
md5Checksum | string | The MD5 checksum for the content of the file. This is only applicable to files with binary content in Google Drive. |
size | long | The size of the file's content in bytes. This is only applicable to files with binary content in Google Drive. |
imageMediaMetadata.width | int | The width of the image in pixels. |
imageMediaMetadata.height | int | The height of the image in pixels. |
imageMediaMetadata.rotation | int | The rotation in clockwise degrees from the image's original orientation. |
imageMediaMetadata.location.latitude | double | The latitude stored in the image. |
imageMediaMetadata.location.longitude | double | The longitude stored in the image. |
imageMediaMetadata.location.altitude | double | The altitude stored in the image. |
imageMediaMetadata.time | string | The date and time the photo was taken (EXIF DateTime). |
imageMediaMetadata.cameraMake | string | The make of the camera used to create the photo. |
imageMediaMetadata.cameraModel | string | The model of the camera used to create the photo. |
imageMediaMetadata.exposureTime | float | The length of the exposure, in seconds. |
imageMediaMetadata.aperture | float | The aperture used to create the photo (f-number). |
imageMediaMetadata.flashUsed | boolean | Whether a flash was used to create the photo. |
imageMediaMetadata.focalLength | float | The focal length used to create the photo, in millimeters. |
imageMediaMetadata.isoSpeed | int | The ISO speed used to create the photo. |
imageMediaMetadata.meteringMode | string | The metering mode used to create the photo. |
imageMediaMetadata.sensor | string | The type of sensor used to create the photo. |
imageMediaMetadata.exposureMode | string | The exposure mode used to create the photo. |
imageMediaMetadata.colorSpace | string | The color space of the photo. |
imageMediaMetadata.whiteBalance | string | The white balance mode used to create the photo. |
imageMediaMetadata.exposureBias | float | The exposure bias of the photo (APEX value). |
imageMediaMetadata.maxApertureValue | float | The smallest f-number of the lens at the focal length used to create the photo (APEX value). |
imageMediaMetadata.subjectDistance | int | The distance to the subject of the photo, in meters. |
imageMediaMetadata.lens | string | The lens used to create the photo. |
videoMediaMetadata.width | int | The width of the video in pixels. |
videoMediaMetadata.height | int | The height of the video in pixels. |
videoMediaMetadata.durationMillis | long | The duration of the video in milliseconds. |
File types to pull
All files in Google Drive can be divided by format between two types: Google formats and all other (binary). Google formats are hidden and can not be exported directly, instead of this they should be exported into any binary format firstly and only if this option is available for specified format. Binary files can be downloaded directly. User can specify which formats he wants download/export:
- Binary - will be downloaded directly (text/plain, image/bmp, video/mp4 etc.)
- Google Documents - will be exported to format specified in Google Documents export format property before.
- Google Spreadsheets - will be exported to format specified in Google Spreadsheets export format property before.
- Google Drawings - will be exported to format specified in Google Drawings export format property before.
- Google Presentations - will be exported to format specified in Google Presentations export format property before.
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature