Page Comparison

...

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Properties	Macro Enabled?	Version introduced	Description
Use Connection	No	6.5.0/0.18.0	Optional. Whether to use an existing connection. If you use a connection, connection related properties do not appear in the plugin properties.
Connection	Yes	6.5.0/0.18.0	Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function `${conn(connection_name)}`
Project ID	Yes		Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. This is the project that the BigQuery job will run in. `BigQuery Job User` role on this project must be granted to the specified service account to run the job. If a temporary bucket needs to be created, the bucket will also be created in this project and `GCE Storage Bucket Admin` role on this project must be granted to the specified service account to create buckets. Default is auto-detect.
Dataset Project ID	Yes		Optional. Project the dataset belongs to. This is only required if the dataset is not in the same project that the BigQuery job will run in. If no value is given, it will default to the configured Project ID. `BigQuery Data Viewer` role on this project must be granted to the specified service account to read BigQuery data from this project.
Service Account Type	Yes	6.3.0/0.16.0	Optional. Select one of the following options: File Path. File path where the service account is located. JSON. JSON content of the service account.
Service Account File Path	Yes		Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. Default is auto-detect.
Service Account JSON	Yes	6.3.0/0.16.0	Optional. Content of the service account.
Reference Name	No		Optional. Name used to uniquely identify this source for lineage, annotating metadata, etc.
Dataset	Yes		Required. Dataset the table belongs to. A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to tables and views.
Table	Yes		Table to read from. A table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information.
Partition Start Date	Yes		Optional. Inclusive partition start date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up to the partition end date will be read.
Partition End Date	Yes		Optional. Exclusive partition end date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up from the partition start date will be read.
Filter	Yes		Optional. Filters out rows that do not match the given condition. For example, if the filter is 'age > 50 and name is not null', all output rows will have an 'age' over 50 and a value for the 'name' field. This is the same as the WHERE clause in BigQuery. More information can be found at https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause
Temporary Bucket Name	Yes		Optional. Google Cloud Storage bucket to store temporary data in. It will be automatically created if it does not exist, but will not be automatically deleted. Temporary data will be deleted after it has been read. If it is not provided, a unique bucket will be created and then deleted after the run finishes.
Encryption Key Name	Yes	6.5.1/0.18.1	Optional. Used to encrypt data written to any bucket created by the plugin. If the bucket already exists, this value is ignored. More information can be found here.
Enable Querying Views	Yes		Optional. Whether to allow querying views. Since BigQuery views are not materialized by default, querying them may have performance overhead. Default is No.
Temporary Table Creation Project	Yes		Optional. The project name where the temporary table should be created. Defaults to the same project in which the table is located.
Temporary Table Creation Dataset	Yes		Optional. The dataset in the specified project where the temporary table should be created. Defaults to the same dataset in which the table is located.
Output Schema	Yes		Required. Schema of the table to read. This can be fetched by clicking the Get Schema button.

Data Type Mapping

The following table lists out different BigQuery data types, as well as the corresponding CDAP data type for each BigQuery type.

Note: Support for the datetime data type was introduced in CDAP 6.4.0. For BigQuery batch source plugins, datetime fields can only be mapped to CDAP datetime or CDAP string.

BigQuery type	CDAP type
bool	boolean
bytes	bytes
date	date
datetime	datetime, string
float64	double
geo	unsupported
int64	long
numeric	decimal
bignumeric	decimal
repeated	array
string	string, datetime (ISO 8601 format)
struct	record
time	time (microseconds)
timestamp	timestamp (microseconds)
json	unsupported

Required roles and permissions

To get the permissions that you need to read data from BigQuery datasets and tables, ask your administrator to grant you the BigQuery Data Editor, BigQuery Job User, and Storage Admin IAM roles on the project where Dataproc clusters are launched. For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to read data from BigQuery datasets and tables.

To see the exact permissions, see the following:

Permission	Description	Comments
BigQuery
`bigquery.datasets.get`	Allows reading datasets from BigQuery.
`bigquery.tables.export`	Enables exporting tables from BigQuery.
`bigquery.tables.get`	Permits reading tables from BigQuery.
`bigquery.jobs.create`	Enables creating jobs in BigQuery.
Cloud Storage
`storage.buckets.create`	Allows creating buckets in cloud storage.	Only needed if the staging bucket is not specified, or the specified bucket doesn’t exist.
`storage.buckets.delete`	Permits deleting buckets in cloud storage.	Only needed if the staging bucket is not specified.
`storage.buckets.get`	Allows retrieving information about buckets.
`storage.objects.create`	Enables creating objects in cloud storage.
`storage.objects.delete`	Permits deleting objects in cloud storage.
`storage.objects.get`	Allows getting information about objects.
`storage.objects.list`	Permits listing objects in a bucket.

You might be able to get these permissions with custom roles or other predefined roles.

Versions Compared

Old Version 50

New Version Current

Key

Configuration

Data Type Mapping

Required roles and permissions