...
If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.
Configuration
Properties | Macro Enabled? | Version introduced | Description |
---|---|---|---|
Use Connection | No | 6.5.0/0.18.0 | Optional. Whether to use an existing connection. If you use a connection, connection related properties do not appear in the plugin properties. |
Connection | Yes | 6.5.0/0.18.0 | Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function |
Project ID | Yes | Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. This is the project that the BigQuery job will run in. Default is auto-detect. | |
Dataset Project ID | Yes | Optional. Project the dataset belongs to. This is only required if the dataset is not in the same project that the BigQuery job will run in. If no value is given, it will default to the configured Project ID. | |
Service Account Type | Yes | 6.3.0/0.16.0 | Optional. Select one of the following options:
|
Service Account File Path | Yes | Optional. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster. Default is auto-detect. | |
Service Account JSON | Yes | 6.3.0/0.16.0 | Optional. Content of the service account. |
Reference Name | No | Optional. Name used to uniquely identify this source for lineage, annotating metadata, etc. | |
Dataset | Yes | Required. Dataset the table belongs to. A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to tables and views. | |
Table | Yes | Table to read from. A table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information. | |
Partition Start Date | Yes | Optional. Inclusive partition start date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up to the partition end date will be read. | |
Partition End Date | Yes | Optional. Exclusive partition end date, specified as 'yyyy-MM-dd'. For example, '2019-01-01'. If no value is given, all partitions up from the partition start date will be read. | |
Filter | Yes | Optional. Filters out rows that do not match the given condition. For example, if the filter is 'age > 50 and name is not null', all output rows will have an 'age' over 50 and a value for the 'name' field. This is the same as the WHERE clause in BigQuery. More information can be found at https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause | |
Temporary Bucket Name | Yes | Optional. Google Cloud Storage bucket to store temporary data in. It will be automatically created if it does not exist, but will not be automatically deleted. Temporary data will be deleted after it has been read. If it is not provided, a unique bucket will be created and then deleted after the run finishes. | |
Encryption Key Name | Yes | 6.5.1/0.18.1 | Optional. Used to encrypt data written to any bucket created by the plugin. If the bucket already exists, this value is ignored. More information can be found here. |
Enable Querying Views | Yes | Optional. Whether to allow querying views. Since BigQuery views are not materialized by default, querying them may have performance overhead. Default is No. | |
Temporary Table Creation Project | Yes | Optional. The project name where the temporary table should be created. Defaults to the same project in which the table is located. | |
Temporary Table Creation Dataset | Yes | Optional. The dataset in the specified project where the temporary table should be created. Defaults to the same dataset in which the table is located. | |
Output Schema | Yes | Required. Schema of the table to read. This can be fetched by clicking the Get Schema button. |
Data Type Mapping
The following table lists out different BigQuery data types, as well as the corresponding CDAP data type for each BigQuery type.
Note: Support for the datetime data type was introduced in CDAP 6.4.0. For BigQuery batch source plugins, datetime fields can only be mapped to CDAP datetime or CDAP string.
BigQuery type | CDAP type |
---|---|
bool | boolean |
bytes | bytes |
date | date |
datetime | datetime, string |
float64 | double |
geo | unsupported |
int64 | long |
numeric | decimal |
bignumeric | decimal |
repeated | array |
string | string, datetime (ISO 8601 format) |
struct | record |
time | time (microseconds) |
timestamp | timestamp (microseconds) |
json | unsupported |
Required roles and permissions
To get the permissions that you need to read data from BigQuery datasets and tables, ask your administrator to grant you the BigQuery Data Editor, BigQuery Job User, and Storage Admin IAM roles on the project where Dataproc clusters are launched. For more information about granting roles, see Manage access.
These predefined roles contain the permissions required to read data from BigQuery datasets and tables.
To see the exact permissions, see the following:
Permission | Description | Comments |
BigQuery | ||
| Allows reading datasets from BigQuery. | |
| Enables exporting tables from BigQuery. | |
| Permits reading tables from BigQuery. | |
| Enables creating jobs in BigQuery. | |
Cloud Storage | ||
| Allows creating buckets in cloud storage. | Only needed if the staging bucket is not specified, or the specified bucket doesn’t exist. |
| Permits deleting buckets in cloud storage. | Only needed if the staging bucket is not specified. |
| Allows retrieving information about buckets. | |
| Enables creating objects in cloud storage. | |
| Permits deleting objects in cloud storage. | |
| Allows getting information about objects. | |
| Permits listing objects in a bucket. |
You might be able to get these permissions with custom roles or other predefined roles.