Google BigQuery Table Batch Source

Plugin version: 0.22.0

Reads the entire contents of a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data from the BigQuery table is first exported to a temporary location on Google Cloud Storage, and then read into the pipeline from there.

Credentials

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. Credentials will be automatically read from the cluster environment.

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Properties

Macro Enabled?

Version introduced

Description

Properties

Macro Enabled?

Version introduced

Description

Use Connection

No

6.5.0/0.18.0

Optional. Whether to use an existing connection. If you use a connection, connection related properties do not appear in the plugin properties.

Connection

Yes

6.5.0/0.18.0

Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function ${conn(connection_name)}

Project ID

Yes

 

Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. This is the project that the BigQuery job will run in. BigQuery Job User role on this project must be granted to the specified service account to run the job. If a temporary bucket needs to be created, the bucket will also be created in this project and GCE Storage Bucket Admin role on this project must be granted to the specified service account to create buckets.

Default is auto-detect.

Dataset Project ID

Yes

 

Optional. Project the dataset belongs to. This is only required if the dataset is not in the same project that the BigQuery job will run in. If no value is given, it will default to the configured Project ID. BigQuery Data Viewer role on this project must be granted to the specified service account to read BigQuery data from this project.

Service Account Type

Yes

6.3.0/0.16.0

Optional. Select one of the following options:

  • File Path. File path where the service account is located.

  • JSON. JSON content of the service account.

Service Account File Path

Yes

 

Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

Default is auto-detect.

Service Account JSON

Yes

6.3.0/0.16.0

Optional. Content of the service account.

Reference Name

No

 

Optional. Name used to uniquely identify this source for lineage, annotating metadata, etc.

Dataset

Yes

 

Required. Dataset the table belongs to. A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to tables and views.

Table

Yes

 

Table to read from. A table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information.

Partition Start Date

Yes

 

Optional. Inclusive partition start date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up to the partition end date will be read.

Partition End Date

Yes

 

Optional. Exclusive partition end date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up from the partition start date will be read.

Filter

Yes

 

Optional. Filters out rows that do not match the given condition. For example, if the filter is 'age > 50 and name is not null', all output rows will have an 'age' over 50 and a value for the 'name' field. This is the same as the WHERE clause in BigQuery. More information can be found at https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause

Temporary Bucket Name

Yes

 

Optional. Google Cloud Storage bucket to store temporary data in. It will be automatically created if it does not exist, but will not be automatically deleted. Temporary data will be deleted after it has been read. If it is not provided, a unique bucket will be created and then deleted after the run finishes.

Encryption Key Name

Yes

6.5.1/0.18.1

Optional. Used to encrypt data written to any bucket created by the plugin. If the bucket already exists, this value is ignored. More information can be found here.

Enable Querying Views

Yes

 

Optional. Whether to allow querying views. Since BigQuery views are not materialized by default, querying them may have performance overhead.

Default is No.

Temporary Table Creation Project

Yes

 

Optional. The project name where the temporary table should be created. Defaults to the same project in which the table is located.

Temporary Table Creation Dataset

Yes

 

Optional. The dataset in the specified project where the temporary table should be created. Defaults to the same dataset in which the table is located.

Output Schema

Yes

 

Required. Schema of the table to read. This can be fetched by clicking the Get Schema button.

Data Type Mapping

The following table lists out different BigQuery data types, as well as the corresponding CDAP data type for each BigQuery type.

Note: Support for the datetime data type was introduced in CDAP 6.4.0. For BigQuery batch source plugins, datetime fields can only be mapped to CDAP datetime or CDAP string. 

BigQuery type

CDAP type

BigQuery type

CDAP type

bool

boolean

bytes

bytes

date

date

datetime

datetime, string

float64

double

geo

unsupported

int64

long

numeric

decimal

bignumeric

decimal

repeated

array

string

string, datetime (ISO 8601 format)

struct

record

time

time (microseconds)

timestamp

timestamp (microseconds)

json

unsupported

Required roles and permissions

To get the permissions that you need to read data from BigQuery datasets and tables, ask your administrator to grant you the BigQuery Data Editor, BigQuery Job User, and Storage Admin IAM roles on the project where Dataproc clusters are launched. For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to read data from BigQuery datasets and tables.

To see the exact permissions, see the following:

Permission

Description

Comments

BigQuery

bigquery.datasets.get

Allows reading datasets from BigQuery.

 

bigquery.tables.export

Enables exporting tables from BigQuery.

 

bigquery.tables.get

Permits reading tables from BigQuery.

 

bigquery.jobs.create

Enables creating jobs in BigQuery.

 

Cloud Storage

storage.buckets.create

Allows creating buckets in cloud storage.

Only needed if the staging bucket is not specified, or the specified bucket doesn’t exist.

storage.buckets.delete

Permits deleting buckets in cloud storage.

Only needed if the staging bucket is not specified.

storage.buckets.get

Allows retrieving information about buckets.

 

storage.objects.create

Enables creating objects in cloud storage.

 

storage.objects.delete

Permits deleting objects in cloud storage.

 

storage.objects.get

Allows getting information about objects.

 

storage.objects.list

Permits listing objects in a bucket.

 

You might be able to get these permissions with custom roles or other predefined roles.

Created in 2020 by Google Inc.