Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This sink writes Plugin version: 0.22.0

Writes to a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, and then loaded into BigQuery from there.

...

If the plugin is not run on a Dataproc cluster, the path to a service account key must be provided. The service account key can be found on the Dashboard in the Cloud Platform Console. Make sure the account key has permission to access BigQuery and Google Cloud Storage. The service account key file needs to be available on every node in your cluster and must be readable by all users running the job.

Configuration

Property

Macro Enabled?

Version Introduced

Description

Reference Name

No

Required. Name used to uniquely identify this sink for lineage, annotating metadata, etc.

Use Connection

No

6.7.0/0.20.0

Optional. Whether to use a connection. If a connection is used, you do not need to provide the credentials.

Connection

Yes

6.7.0/0.20.0

Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function ${conn(connection_name)}

Project ID

Yes

Optional. Google Cloud Project ID, which uniquely identifies a project. It can be found on the Dashboard in the Google Cloud Platform Console. This is the project that the BigQuery job will run in. BigQuery Job User role on this project must be granted to the specified service account to run the job. If a temporary bucket needs to be created, the

service account must have permission

bucket will also be created in this project and GCE Storage Bucket Admin role on this project must be granted to the specified service account to create buckets.

Default is auto-detect.

Dataset Project

Id

ID

Yes

Optional. Project the dataset belongs to. This is only required if the dataset is not in the same project that the BigQuery job will run in. If no value is given, it will default to the configured Project IDBigQuery Data Editor role on this project must be granted to the specified service account to write BigQuery data to this project.

Service Account Type

Yes

6.3.0 / 0.16.0

Optional. Select one of the following options:

  • File Path. File path where the service account is located.

  • JSON. JSON content of the service account.

Service Account File Path

Yes

Required. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

Default is auto-detect.

Service Account JSON

Yes

6.3.0 / 0.16.0

Optional. Content of the service account.

Reference Name

No

Required. Name used to uniquely identify this sink for lineage, annotating metadata, etc.

Dataset

Yes

Required

Optional. Dataset the table belongs to. A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to tables and views.

Table

Yes

Required. Table to write to. A table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information.

Temporary

Buck

Bucket Name

Yes

Optional. Google Cloud Storage bucket to store temporary data in. It will be automatically created if it does not exist, but will not be automatically deleted. Temporary data will be deleted after it is loaded into BigQuery. If it is not provided, a unique bucket will be created and then deleted after the run finishes.

Syntax: gs://bucketname

GCS Upload Request Chunk Size

Yes

Optional. GCS upload request chunk size in bytes.

Default value is 8388608 bytes

.

Service Account Type

Yes

6.3.0 / 0.16.0

Optional. Select one of the following options:

File Path

.

File path where the service account is located.

Service Account JSON

Yes

6.3.0 / 0.16.0

Optional. Content of the service account.

  • JSON. JSON content of the service account.

  • Service Account File Path

    Yes

    Required. Path on the local file system of the service account key used for authorization. Can be set to 'auto-detect' when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.

    Default is auto-detect.

    Operation

    Yes

    Optional. Type of write operation to perform. This can be set to Insert, Update or Upsert.

    • Insert. All records will be inserted in destination table.

    • Update. Records that match on Table Key will be updated in the table. Records that do not match will be dropped.

    • Upsert. Records that match on Table Key will be updated. Records that do not match will be inserted.

    Default is Insert.

    Table Key

    Yes

    Optional. List of fields that determines relation between tables during Update and Upsert operations.

    Dedupe By

    Yes

    Optional. Column names and sort order used to choose which input record to update/upsert when there are multiple input records with the same key. For example, if this is set to 'updated_time desc', then if there are multiple input records with the same key, the one with the largest value for 'updated_time' will be applied.

    Partition Filter

    Yes

    Optional. Partition filter that can be used for partition elimination during Update or Upsert operations. Should only be used with Update or Upsert operations for tables where require partition filter is enabled. For example, if the table is partitioned the Partition Filter '_PARTITIONTIME > "2020-01-01" and _PARTITIONTIME < "2020-03-01"', the update operation will be performed only in the partitions meeting the criteria.

    Truncate Table

    Yes

    Optional. Whether or not to truncate the table before writing to it

    . Should only be used with the Insert operation

    .

    Note: If you set both Truncate Table and Update Table Schema to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored.

    Default is False.

    Update Table Schema

    Yes

    Optional. Whether the BigQuery table schema should be modified when it does not match the schema expected by the pipeline.

    • When this is set to false, any mismatches between the schema expected by the pipeline and the schema in BigQuery will result in pipeline failure.

    • When this is set to true, the schema in BigQuery will be updated to match the schema expected by the pipeline, assuming the schemas are compatible.

    Compatible changes fall under the following categories:

    • The pipeline schema contains nullable fields that do not exist in the BigQuery schema. In this case, the new fields will be added to the BigQuery schema.

    • The pipeline schema contains nullable fields that are non-nullable in the BigQuery schema. In this case, the fields will be modified to become nullable in the BigQuery schema.

    • The pipeline schema does not contain fields that exist in the BigQuery schema. In this case, those fields in the BigQuery schema will be modified to become nullable.

    Incompatible schema changes will result in pipeline failure.

    Note: If you set both Truncate Table and Update Table Schema to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored.

    Default is False.

    Location

    Yes

    Optional. The location where the

    big query

    BigQuery dataset will get created. This value is ignored if the dataset or temporary bucket already exist.

    Default is US.

    Encryption Key Name

    Yes

    6.5.1/0.18.1

    Optional. The GCP customer managed encryption key (CMEK) used to encrypt data written to any bucket, dataset, or table created by the plugin. More information can be found here.

    Create Partitioned Table

    Yes

    [DEPRECATED] Optional. Whether to create the BigQuery table with time partitioning. This value is ignored if the table already exists.

    • When this is set to true, table will be created with time partitioning.

    • When this is set to false, table will be created without time partitioning.

    • [DEPRECATED] use Partitioning Type

    Default is False.

    Partitioning Type

    Yes

    6.2.3 / 0.15.3
    6.3.0 / 0.16.0

    Optional. Specifies the partitioning type. Can either be Time, Integer

    or Time

    , or None. Defaults to Time. This value is ignored if the table already exists.

    • When this is set to Time, table will be created with time partitioning.

    • When this is set to Integer, table will be created with integer-range partitioning.

    • When this is set to None, table will be created without

    time
    • partitioning.

    Range Start

    Yes

    6.2.3 / 0.15.3
    6.3.0 / 0.16.0

    Optional.

    For integer partitioning, specifies the start of the range. Only used when table doesn’t exist already, and partitioning type is set to Integer.

    • The start value is inclusive.

    Range End

    Yes

    6.2.3 / 0.15.3
    6.3.0 / 0.16.0

    Optional.

    For integer partitioning, specifies the end of the range. Only used when table doesn’t exist already, and partitioning type is set to Integer.

    • The end value is exclusive.

    Range Interval

    Yes

    6.2.3 / 0.15.3
    6.3.0 / 0.16.0

    Optional. For integer partitioning, specifies the partition interval. Only used when table doesn’t exist already, and partitioning type is set to Integer.

    • The interval value must be a positive integer.

    Partition Field

    Yes

    Optional. Partitioning column for the BigQuery table.

    This should be left empty

    Leave blank if the BigQuery table is an ingestion-time partitioned table.

    Require Partition Filter

    Yes

    Optional. Whether to create a table that requires a partition filter. This value is ignored if the table already exists.

    • When this is set to true, table will be created with required partition filter.

    • When this is set to false, table will be created without required partition filter.

    Default is False.

    Clustering Order

    Yes

    Optional. List of fields that determines the sort order of the data. Fields must be of type INT, LONG, STRING, DATE, TIMESTAMP, BOOLEAN or DECIMAL. Tables cannot be clustered on more than 4 fields. This value is only used when the BigQuery table is automatically created and ignored if the table already exists.

    Output Schema

    Yes

    Required. Schema of the data to write. If a schema is provided, it must be compatible with the table schema in BigQuery.

    Data Type Mappings from CDAP to BigQuery

    The following table lists out different CDAP data types , as well as and the corresponding BigQuery data type types for each CDAP type, for updates and upserts.

    For inserts, the type conversions are the same as those used in loading Avro data to BigQuery. For more information, see Avro conversions.

    Note: Support for the datetime data type was introduced in CDAP 6.4.0.

    CDAP Schema Data Type

    BigQuery Data Type

    array

    repeated

    boolean

    bool

    bytes

    bytes

    date

    date

    datetime

    datetime, string

    decimal

    numeric, bignumeric

    Note: Support for bignumeric was added in CDAP 6.7.0.

    double / float

    float64

    enum

    unsupported

    int / long

    int64

    map

    unsupported

    record

    struct

    string

    string, datetime (

    Should be

    ISO 8601 format)

    time

    time

    timestamp

    timestamp

    union

    unsupported

    For inserts, the type conversions are the same as those used in loading Avro data to BigQuery; the table is available here.For more information on BigQuery data types, see Standard SQL Data Types.more information about BigQuery data types, see Standard SQL Data Types.

    Required roles and permissions

    To get the permissions that you need to write data from BigQuery datasets and tables, ask your administrator to grant you the BigQuery Data Editor, BigQuery Job User, and Storage Admin IAM roles on the project where Dataproc clusters are launched. For more information about granting roles, see Manage access.

    These predefined roles contain the permissions required to write data from BigQuery datasets and tables. To see the exact permissions, see the following:

    Permission

    Description

    Comments

    BigQuery

    bigquery.datasets.get

    Allows reading datasets from BigQuery and creating new ones.

    bigquery.datasets.create

    Permits creating datasets in BigQuery.

    Only needed if the target dataset doesn’t exist.

    bigquery.tables.export

    Enables exporting tables from BigQuery.

    bigquery.tables.get

    Allows reading tables from BigQuery.

    bigquery.tables.create

    Permits creating tables in BigQuery.

    Only needed if the target table doesn’t exist.

    bigquery.tables.createIndex

    Permits creating index in BigQuery

    Only needed if the target table doesn’t exist.

    bigquery.jobs.create

    Enables creating jobs in BigQuery.

    Cloud Storage

    storage.buckets.create

    Allows creating buckets in cloud storage.

    Only needed if the staging bucket isn't specified, or the specified bucket doesn’t exist.

    storage.buckets.delete

    Permits deleting buckets in cloud storage.

    Only needed if the staging bucket isn’t specified.

    storage.buckets.get

    Allows retrieving information about buckets.

    storage.objects.create

    Enables creating objects in cloud storage.

    storage.objects.delete

    Permits deleting objects in cloud storage.

    storage.objects.get

    Allows getting information about objects.

    storage.objects.list

    Permits listing objects in a bucket.

    storage.objects.update

    Permits to update the objects in the bucket.

    You might be able to get these permissions with custom roles or other predefined roles.

    Troubleshooting

    Missing permission to create a temporary bucket 

    If your pipeline failed with the following error in the log:

    Code Block
    com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
    POST https://storage.googleapis.com/storage/v1/b?project=projectId&projection=full
    {
    "code" : 403,
    "errors" : [ {
    "domain" : "global",
    "message" : "xxxxxxxxxxxx-compute@developer.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project.",
    "reason" : "forbidden"
    } ],
    "message" : "xxxxxxxxxxxx-compute@developer.gserviceaccount.com does not have storage.buckets.create access to the Google Cloud project."
    }

    `xxxxxxxxxxxx-compute@developer.gserviceaccount.comis the service account you specified in this plugin. This means the temporary bucket you specified in this plugin doesn't exist. CDF/CDAP is trying to create the temporary bucket, but the specified service account doesn't have the permission. You must grant "GCE Storage Bucket Admin" role on the project identified by theProject IDyou specified in this plugin to the service account. If you think you already granted the role, check if you granted the role to the wrong project (for example the one identified by theDataset Project ID`).

    Missing permission to run BigQuery jobs

    If your pipeline failed with the following error in the log:

    Code Block
    POST https://bigquery.googleapis.com/bigquery/v2/projects/xxxx/jobs
    {
    "code" : 403,
    "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Project xxxx: User does not have bigquery.jobs.create permission in project xxxx",
    "reason" : "accessDenied"
    } ],
    "message" : "Access Denied: Project xxxx: User does not have bigquery.jobs.create permission in project xxxx.",
    "status" : "PERMISSION_DENIED"
    }

    xxxx is the Project ID you specified in this plugin. This means the specified service account doesn’t have the permission to run BigQuery jobs. You must grant “BigQuery Job User” role on the project identified by the Project ID you specified in this plugin to the service account. If you think you already granted the role, check if you granted the role on the wrong project (for example the one identified by the Dataset Project ID).

    Missing permission to create the BigQuery dataset

    If your pipeline failed with the following error in the log:

    Code Block
    POST https://www.googleapis.com/bigquery/v2/projects/xxxx/datasets?prettyPrint=false
    {
      "code" : 403,
      "errors" : [ {
        "domain" : "global",
        "message" : "Access Denied: Project xxxx: User does not have bigquery.datasets.create permission in project xxxx.",
        "reason" : "accessDenied"
      } ],
      "message" : "Access Denied: Project xxxx: User does not have bigquery.datasets.create permission in project xxxx.",
      "status" : "PERMISSION_DENIED"
    }

    xxxx is the Dataset Project ID you specified in this plugin. This means the dataset specified in this plugin doesn’t exist. CDF/CDAP is trying to create the dataset but the service account you specified in this plugin doesn’t have the permission. You must grant “BigQuery Data Editor” role on the project identified by the Dataset Project ID you specified in this plugin to the service account. If you think you already granted the role, check if you granted the role on the wrong project (for example the one identified by the Project ID).

    Missing permission to create the BigQuery table 

    If your pipeline failed with the following error in the log:

    Code Block
    POST https://bigquery.googleapis.com/bigquery/v2/projects/xxxx/jobs
    {
    "code" : 403,
    "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Dataset xxxx:mysql_bq_perm: Permission bigquery.tables.create denied on dataset xxxx:mysql_bq_perm (or it may not exist).",
    "reason" : "accessDenied"
    } ],
    "message" : "Access Denied: Dataset xxxx:mysql_bq_perm: Permission bigquery.tables.create denied on dataset xxxx:mysql_bq_perm (or it may not exist).",
    "status" : "PERMISSION_DENIED"
    }

    xxxx is the Dataset Project ID you specified in this plugin. This means the table specified in this plugin doesn’t exist. CDF/CDAP is trying to create the table but the service account you specified in this plugin doesn’t have the permission. You must grant “BigQuery Data Editor” role on the project identified by the Dataset Project ID you specified in this plugin to the service account. If you think you already granted the role, check if you granted the role on the wrong project (for example the one identified by the Project ID).

    Missing permission to read the BigQuery dataset 

    If your pipeline failed with the following error in the log:

    Code Block
    com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
    GET https://www.googleapis.com/bigquery/v2/projects/xxxx/datasets/mysql_bq_perm?prettyPrint=false
    {
    "code" : 403,
    "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Dataset xxxx:mysql_bq_perm: Permission bigquery.datasets.get denied on dataset xxxx:mysql_bq_perm (or it may not exist).",
    "reason" : "accessDenied"
    } ],
    "message" : "Access Denied: Dataset xxxx:mysql_bq_perm: Permission bigquery.datasets.get denied on dataset xxxx:mysql_bq_perm (or it may not exist).",
    "status" : "PERMISSION_DENIED"
    }

    xxxx is the Dataset Project ID you specified in this plugin. The service account you specified in this plugin doesn’t have the permission to read the dataset you specified in this plugin. You must grant “BigQuery Data Editor” role on the project identified by the Dataset Project ID you specified in this plugin to the service account. If you think you already granted the role, check if you granted the role on the wrong project (for example the one identified by the Project ID).

    3

    BigQuery

    0.19.0-SNAPSHOT