Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Google Sheets plugins will allow users to import data from Google Sheets into their pipeline, so that they can transform and enrich with other data sources.
User Storie(s)
- As a pipeline developer, I want to import data from Google Sheets, so that I can transform and enrich it using CDAP
- As a pipeline developer, I want to move all sheets from a given Google drive directory to a destination
- As a pipeline developer, I want to be able to pick certain sheets from a particular sheet to process using CDAP, so that I do not have to process all sheets all the time. I want to be able to specify the sheet using the sheet name or number.
- As a pipeline developer, I want to treat the first row of a sheet as a header, so that CDAP can automatically treat it as schema
- As a pipeline developer, I want to be able to specify a section at the top of my Sheet as a header, so that it is extracted as metadata, and not as actual data
- As a pipeline developer, I want to be able to specify a section at the bottom of my sheet as a footer, so that it is extracted as metadata, and not as actual data
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
Source
This section defines properties that are configurable for this plugin.
Section | User Facing Name | Type | Description | Optional | Constraints |
---|---|---|---|---|---|
Basic |
Directory Id | string | Directory ID is the last part of the URL, such as https://drive.google.com/drive/folders/0B2kqcwp2ycGZanhSR3JmREw5VTV | No | ||
Filtering | Filter | String |
Filter that can be applied to the files in the selected directory. Filters follow |
the Google Drive |
Yes | ||
Modification date range |
Select | Filter that narrows set of files by modified date range. User can select either among predefined or custom entered ranges. For Custom selection the dates range can be specified via Start date and End date. | Yes | |||
Start date | String | Start date for custom modification date range. Is shown only when 'Custom' range is selected for 'Modification date range' field. RFC3339 (https://tools.ietf.org/html/rfc3339) format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | |||
End date | String | End date for custom modification date range. Is shown only when 'Custom' range is selected for 'Modification date range' field. RFC3339 (https://tools.ietf.org/html/rfc3339) format, default timezone is UTC, e.g., 2012-06-04T12:00:00-08:00. | |||
Sheets to pull | Select | Filter for specifying set of sheets to process. For numbers or titles selections user can populate specific values in Sheets identifiers field. Default is all value. | all | ||
Sheets identifiers | CSV | Set of sheets' numbers/titles to process. Is shown only when titles or numbers are selected for Sheets to pull field. | |||
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No | |
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes | auto-detect | |
Retrying | Max Retry Count | Number | Maximum number of retry attempts. | Yes | 10 |
Max Retry Wait | Number | Maximum wait time for attempt in seconds. | Yes | 500 | |
Max Retry Jitter Wait | Number | Maximum additional wait time is milliseconds. | Yes | 100 | |
Metadata extraction | Extract metadata | Toggle | Field to enable metadata extraction. Metadata extraction is useful when user wants to specify a header or a footer for a sheet. The rows in headers and footers are not available as data records. Instead, they are available in every record as a field called |
'metadata', " |
which is a record of the specified metadata. | No | |||
Metadata record name | String | Name of the record with metadata content. It is needed to distinguish metadata record from possible column with the same name. | Yes | metadata |
First header row | Number |
Row number of the first row to be treated as header. |
Yes | ||
Last header row | Number |
Row number of the last row to be treated as header. | Yes | |
First footer row | Number |
Row number of the first row to be treated as footer. | Yes | |
Last footer row | Number |
Row number of the last row to be treated as footer. | Yes | |
Metadata |
cells | KeyValue |
Set of the cells |
for key-value pairs to extract as metadata from the specified metadata sections. Only shown if |
Extract metadata is set to true. The cell numbers should be within the header or footer. E.g. A1 → |
B5, |
A5 → |
C4 | Yes |
Specifies the cells and the corresponding key names of the values to extract as metadata from the specified metadata sections. Only shown if Capture Metadata is set to true. The cell numbers should be within the header or footer.
E.g. department → A2, category → C5
Advanced | Numeric formatting | Radio buttons | Output format for numeric sheet cells. In Formatted values case the value will contain appropriate format of source cell e.g. '1.23$', '123%'. For Values only only number value will be returned. | ||
Skip empty data | Toggle | Field to allow skipping of empty structure records. | No | false | |
Column Names Selection | Radio buttons | Source for column names. Choose between "No column names", "Treat first row as column names", "Custom row as column names" Defaults to "Treat first row as column names". | Yes | ||
Custom row for column names | Number | Row number of the row to be treated as a header. Only shown when the Column Names Selection field is set to Custom row as column names header. | Yes | 1 | |
Last data column | Number | Last column number of the maximal field of plugin work for data. | Yes | ||
Last data row | Number | Last row number of the maximal field of plugin work for data. |
Note: The data in the specified header and footer rows should not be available as records to the rest of the pipeline.
Sink
Section | User Facing Name | Type | Description | Optional | Constraints | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
App Id | string | No | Access token | string | No | Basic | Directory Id | string | Directory ID is the last part of the URL, such as https://drive.google.com/drive/folders/0B2kqcwp2ycGZanhSR3JmREw5VTV | No | |
Sheet name | string | Name of the sheet. Defaults to Sheet 1 | Yes | ||||||||
Write first row as headers | Toggle | If true, the schema is written as the first row of the sheet. Defaults to True. | Yes | Format for nested data | select | Choose amongst JSON, CSV. Format to serialize complex (nested) data as. Defaults to JSON. | Yes | . | Yes | ||
Merge data cells | Toggle | Field to choose of merging for non-array cells after array flattening. | Yes | ||||||||
Authentication | Authentication type | Radio-group | Defines the authentication type. OAuth2 and Service account types are available. | No | |||||||
Client ID | String | OAuth2 client id. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||||||||
Client secret | String | OAuth2 client secret. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||||||||
Refresh token | String | OAuth2 refresh token. Is shown only when 'OAuth2' auth type is selected for 'Authentication type' property. | Yes | ||||||||
Account file path | String | Path on the local file system of the user/service account key used for authorization. Is shown only when 'Service account' auth type is selected for 'Authentication type' property. | Yes | auto-detect | |||||||
Retrying | Max Retry Count | Number | Maximum number of retry attempts. | Yes | 10 | ||||||
Max Retry Wait | Number | Maximum wait time for attempt in seconds. | Yes | 500 | |||||||
Max Retry Jitter Wait | Number | Maximum additional wait time is milliseconds. | Yes | 100 |
Note: Incoming records should be written to columns in the sheet
Design / Implementation Tips
- Tip #1
- Tip #2
Design
Approach(s)Source
- One structure record per sheet's row.
- Single spreadsheet file per split.
- Schema is defined by data from sheets.
- Custom column names should not be empty, empty columns are skipped.
- Row numbers are counted from 1.
Sink
- Single file per input structure record.
Design
Approach(s)
Source
Google Sheets Source plugin reads spreadsheet files from specified Google Drive folder via Sheets API.
Each structure record represents single row of some sheet from spreadsheet, also it may contain metadata from metadata rows the same for all data rows of the sheet.
The plugin separates each input spreadsheet file into separate split.
The out scheme is auto-generated by plugin. It mandatory contains spreadsheet name, sheet title and set of column values, also there can be a metadata record which is optional. Data formats for cells are getting from the first data row of the first suitable sheet of the first spreadsheet which was found. In the case the first data row is empty the data formats will be set to String by default.
First data row is defined as first one after custom column names row or after last header row.
Last data row is bounded by one of the following values (the lowest is used):
- Last data row property value set by user;
- the row before First footer row (when is set);
- last row of the sheet (1000 is the default value, may be greater).
Data types transforming:
Google Sheets type | Batch source plugin output scheme type |
---|---|
String | string |
Boolean | bool |
Double | double/string in dependence of formatting settings |
Time and time intervals | long* |
Date | date |
Date time | timestamp |
* CDAP does not have any logical type for time intervals (Scheme.LogicalType.TIME_MILLIS/TIME_MICROS supports only intervals less than a day), so long type is used for number of milliseconds for time of a day/time interval.
Sink
Properties
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature