Bigtable Source
Batch source to use Google Cloud Platforms's Bigtable as a Source.
User Expectations
- User specifies how they would like to handle errors during ingesting, depending on option chosen, the errors in processing are handled.
- User should be able to specify account credentials in configuration.
User Configurations
Section | User Configuration Label | Label Description | Mandatory | Macro-enabled | Options | Default | Variable | User Widget |
---|---|---|---|---|---|---|---|---|
Standard | Reference Name | This will be used to uniquely identify this source for lineage, annotating metadata, etc | + | + | referenceName | Text Box | ||
Table | Database table name | + | + | table | Text Box | |||
Instance ID | Bigtable instance ID | + | + | instance | Text Box | |||
Project ID | The ID of the project in Google Cloud If not specified, will be automatically read from the cluster environment | + | project | Text Box | ||||
Service Account File Path | Path on the local file system of the service account key used for If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. When running on other clusters, the file must be present on every node in the cluster. See Google's documentation on Service account credentials for details. | + | serviceFilePath | Text Box | ||||
Key Alias | Name of the field for row key. | __key__ | keyAlias | Text Box | ||||
Scan Row Start | Scan start row. | + | scanRowStart | Text Box | ||||
Scan Row Stop | Scan stop row. | + | scanRowStop | Text Box | ||||
Scan Time Range Start | The starting timestamp used to filter columns with a specific range of versions. | + | scanTimeRangeStart | Text Box | ||||
Scan Time Range Stop | The ending timestamp used to filter columns with a specific range of versions. | + | scanTimeRangeStop | Text Box | ||||
Schema | Specifies the schema that has to be output. If not specified, then by default each item will be emitted as a JSON string. Only columns defined in schema will be included into output record. Field name should be in form "<family>__<column>". | + | schema | schema | ||||
Error Handling | On Record Error | How to handle error in record processing. Error will be thrown if failed to parse value according to provided schema. | + |
| Skip error | on-error | Radio Button (layout: block) |
Bigtable Overview
Storage model
Cloud Bigtable stores data in massively scalable tables, each of which is a sorted key/value map. The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row. Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.
Each row/column intersection can contain multiple cells, or versions, at different timestamps, providing a record of how the stored data has been altered over time. Cloud Bigtable tables are sparse; if a cell does not contain any data, it does not take up any space.
General concepts
Designing a Cloud Bigtable schema is very different than designing a schema for a relational database. As you design your Cloud Bigtable schema, keep the following concepts in mind:
- Each table has only one index, the row key. There are no secondary indices.
- Rows are sorted lexicographically by row key, from the lowest to the highest byte string. Row keys are sorted in big-endian, or network, byte order, the binary equivalent of alphabetical order.
- Columns are grouped by column family and sorted in lexicographic order within the column family.
- All operations are atomic at the row level. For example, if you update two rows in a table, it's possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.
- Ideally, both reads and writes should be distributed evenly across the row space of the table.
- In general, keep all information for an entity in a single row. An entity that doesn't need atomic updates and reads can be split across multiple rows. Splitting across multiple rows is recommended if the entity data is large (hundreds of MB).
- Related entities should be stored in adjacent rows, which makes reads more efficient.
- Cloud Bigtable tables are sparse. Empty columns don't take up any space. As a result, it often makes sense to create a very large number of columns, even if most columns are empty in most rows.
Supported data types
Cloud Bigtable treats all data as raw byte strings for most purposes. The only time Cloud Bigtable tries to determine the type is for increment operations, where the target must be a 64-bit integer encoded as an 8-byte big-endian value.
Implementation Details
- Task will be split using org.apache.hadoop.hbase.mapreduce.TableInputFormat.
- Plugin will fetch only latest version of cells.
- Values will be converted info StructuredRecord format using defined provided output schema.
Reference
Created in 2020 by Google Inc.