Google BigQuery Replication Target

This Target plugin writes to BigQuery. The plugin requires write access to both BigQuery and a GCS staging bucket. Change events are first written in batches to GCS. They are then loaded into staging tables in BigQuery. Finally, changes from the staging table are merged into the final target table using a BigQuery merge query.

The final target tables will include all the original columns from the source table plus one additional _sequence_num column. The sequence number is used to ensure that data is not duplicated or missed in replicator failure scenarios.

Credentials

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Limitations

Tables must have a primary key in order to be replicated.

Table rename operations are not supported. If a rename event is encountered, it will be ignored.

Table alters are partially supported. An existing non-nullable column can be altered into a nullable column. New nullable columns can be added to an existing table. Any other type of alteration to the table schema will fail. Changes to the primary key will not fail, but existing data will not rewritten to obey uniqueness on the new primary key.

Properties

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Project ID

No

Required. Project of the BigQuery dataset. When running on a Dataproc cluster, this can be left blank, which will use the project of the cluster.

Service Account Key

Yes

Required. The contents of the service account key (JSON) to use when interacting with GCS and BigQuery. When running on a Dataproc cluster, this can be left blank, which will use the service account of the cluster.

Dataset Name

No

Optional. Name of the dataset to be created in the BigQuery. It’s optional and by default the dataset name is same as source database name. A valid name should only contain letters, numbers, and underscores and maximum length can be 1024. Any invalid chars would be replaced with underscore in the final dataset name and any characters exceeds length limit will be truncated.

Encryption Key Name

No

Optional. GCP Customer-managed encryption key (CMEK) used to encrypt the resources created by this target. Encryption key name should be of the form “projects/<gcp-project-id>/locations/<key-location>/keyRings/<key-ring-name>/cryptoKeys/<key-name>“.

Location

No

Optional. The location where the BigQuery dataset and GCS staging bucket will get created. For example, ‘us-east1’ for regional bucket, ‘us’ for multi-regional bucket. A complete list of available locations can be found at https://cloud.google.com/bigquery/docs/locations. This value is ignored if an existing GCS bucket is specified, as the staging bucket and the BigQuery dataset will be created in the same location as that bucket.

Default is us.

Staging Bucket

No

Optional. GCS bucket to write change events to before loading them into staging tables. Changes are written to a directory that contains the replication job name and namespace. It is safe to use the same bucket across multiple replication jobs within the same instance. If it is shared by replication jobs across multiple instances, ensure that the namespace and name are unique, otherwise the behavior is undefined. The bucket must be in the same location as the BigQuery dataset. If not provided, new bucket will be created for each job named as 'df-rbq---'. Note that user will have to explicitly delete the bucket once the job is deleted.

Staging Bucket Location

No

Optional. The location where the staging gcs bucket will get created. For example, 'us-east1' for regional bucket, 'us' for multi-regional bucket. Complete list of available GCS locations can be found here https://cloud.google.com/storage/docs/locations. This value is ignored if the bucket already exists.

Load Interval (seconds)

No

Optional. Number of seconds to wait before loading a batch of data into BigQuery.

Staging Table Prefix

No

Optional. Changes are first written to a staging table before merged to the final table. Staging tables names are generated by prepending this prefix to the target table name.

Required Manual Drop Intervention

No

Optional. Whether to require manual administrative action to drop tables and datasets when a drop table or drop database event is encountered. When set to true, the replicator will not delete a table or dataset. Instead, it will fail and retry until the table or dataset does not exist. If the dataset or table does not already exist, no manual intervention is required. The event will be skipped as normal.