This sink writes to a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.

Credentials

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Property	Macro Enabled?	Version Introduced	Description
Reference Name	No		Required. Name used to uniquely identify this sink for lineage, annotating metadata, etc.
Project ID	Yes		Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. This is the project that the BigQuery job will run in. If a temporary bucket needs to be created, the service account must have permission in this project to create buckets. Default is auto-detect.
Dataset Project Id	Yes		Optional. Project the dataset belongs to. This is only required if the dataset is not in the same project that the BigQuery job will run in. If no value is given, it will default to the configured Project ID.
Dataset	Yes		Required. Dataset the table belongs to. A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to tables and views.
Table	Yes		Required. Table to write to. A table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information.
Temporary Buck Name	Yes		Optional. Google Cloud Storage bucket to store temporary data in. It will be automatically created if it does not exist, but will not be automatically deleted. Temporary data will be deleted after it is loaded into BigQuery. If it is not provided, a unique bucket will be created and then deleted after the run finishes.
GCS Upload Request Chunk Size	Yes		Optional. GCS upload request chunk size in bytes. Default value is 8388608 bytes.
Service Account Type	Yes	6.3.0 / 0.16.0	Optional. Select one of the following options: File Path. File path where the service account is located. JSON. JSON content of the service account.
Service Account File Path	Yes		Required. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. Default is auto-detect.
Service Account JSON	Yes	6.3.0 / 0.16.0	Optional. Content of the service account.
Operation	Yes		Optional. Type of write operation to perform. This can be set to Insert, Update or Upsert. Insert. All records will be inserted in destination table. Update. Records that match on Table Key will be updated in the table. Records that do not match will be dropped. Upsert. Records that match on Table Key will be updated. Records that do not match will be inserted. Default is Insert.
Table Key	Yes		Optional. List of fields that determines relation between tables during Update and Upsert operations.
Dedupe By	Yes		Optional. Column names and sort order used to choose which input record to update/upsert when there are multiple input records with the same key. For example, if this is set to 'updated_time desc', then if there are multiple input records with the same key, the one with the largest value for 'updated_time' will be applied.
Partition Filter	Yes		Optional. Partition filter that can be used for partition elimination during Update or Upsert operations. Should only be used with Update or Upsert operations for tables where require partition filter is enabled. For example, if the table is partitioned the Partition Filter '_PARTITIONTIME > "2020-01-01" and _PARTITIONTIME < "2020-03-01"', the update operation will be performed only in the partitions meeting the criteria.
Truncate Table	Yes		Optional. Whether or not to truncate the table before writing to it. Should only be used with the Insert operation. Note: If you set both Truncate Table and Update Table Schema to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored. Default is False.
Update Table Schema	Yes		Optional. Whether the BigQuery table schema should be modified when it does not match the schema expected by the pipeline. When this is set to false, any mismatches between the schema expected by the pipeline and the schema in BigQuery will result in pipeline failure. When this is set to true, the schema in BigQuery will be updated to match the schema expected by the pipeline, assuming the schemas are compatible. Compatible changes fall under the following categories: The pipeline schema contains nullable fields that do not exist in the BigQuery schema. In this case, the new fields will be added to the BigQuery schema. The pipeline schema contains nullable fields that are non-nullable in the BigQuery schema. In this case, the fields will be modified to become nullable in the BigQuery schema. The pipeline schema does not contain fields that exist in the BigQuery schema. In this case, those fields in the BigQuery schema will be modified to become nullable. Incompatible schema changes will result in pipeline failure. Note: If you set both Truncate Table and Update Table Schema to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored. Default is False.
Location	Yes		Optional. The location where the big query dataset will get created. This value is ignored if the dataset or temporary bucket already exist. Default is US.
Create Partitioned Table	Yes		[DEPRECATED] Optional. Whether to create the BigQuery table with time partitioning. This value is ignored if the table already exists. When this is set to true, table will be created with time partitioning. When this is set to false, table will be created without time partitioning. [DEPRECATED] use Partitioning Type Default is False.
Partitioning Type	Yes	6.3.0 / 0.16.0	Optional. Specifies the partitioning type. Can either be Integer or Time or None. Defaults to Time. This value is ignored if the table already exists. When this is set to Time, table will be created with time partitioning. When this is set to Integer, table will be created with range partitioning. When this is set to None, table will be created without time partitioning.
Range Start	Yes	6.3.0 / 0.16.0	Optional. For integer partitioning, specifies the start of the range. Only used when table doesn’t exist already, and partitioning type is set to Integer. The start value is inclusive.
Range End	Yes	6.3.0 / 0.16.0	Optional. For integer partitioning, specifies the end of the range. Only used when table doesn’t exist already, and partitioning type is set to Integer. The end value is exclusive.
Range Interval	Yes	6.3.0 / 0.16.0	For integer partitioning, specifies the partition interval. Only used when table doesn’t exist already, and partitioning type is set to Integer. The interval value must be a positive integer.
Partition Field	Yes		Optional. Partitioning column for the BigQuery table. This should be left empty if the BigQuery table is an ingestion-time partitioned table.
Require Partition Filter	Yes		Optional. Whether to create a table that requires a partition filter. This value is ignored if the table already exists. When this is set to true, table will be created with required partition filter. When this is set to false, table will be created without required partition filter. Default is False.
Clustering Order	Yes		Optional. List of fields that determines the sort order of the data. Fields must be of type INT, LONG, STRING, DATE, TIMESTAMP, BOOLEAN or DECIMAL. Tables cannot be clustered on more than 4 fields. This value is only used when the BigQuery table is automatically created and ignored if the table already exists.
Output Schema	Yes		Required. Schema of the data to write. If a schema is provided, it must be compatible with the table schema in BigQuery.

Data Type Mappings from CDAP to BigQuery

The following table lists out different CDAP data types, as well as the corresponding BigQuery data type for each CDAP type, for updates and upserts.

Note: Support for the datetime data type was introduced in CDAP 6.4.0.

CDAP Schema Data Type	BigQuery Data Type
array	repeated
boolean	bool
bytes	bytes
date	date
datetime	datetime, string
decimal	numeric
double / float	float64
enum	unsupported
int / long	int64
map	unsupported
record	struct
string	string, datetime (Should be ISO 8601 format)
time	time
timestamp	timestamp
union	unsupported

For inserts, the type conversions are the same as those used in loading Avro data to BigQuery; the table is available here.

For more information on BigQuery data types, see Standard SQL Data Types.

Google BigQuery Table Sink

Credentials

Configuration

Data Type Mappings from CDAP to BigQuery