Replicating data from SQL Server to BigQuery

This tutorial shows you how to create and deploy a job that continuously replicates changed data from a Microsoft SQL Server database to a BigQuery table.

Objectives

In this tutorial, you:

Enable Change Data Capture (CDC) in your SQL Server database.
Create and run a CDAP Replication job.
View the results in BigQuery.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

When Replication runs, you're charged for the Dataproc cluster and you incur processing costs for BigQuery. To optimize these costs, we strongly recommend that you use BigQuery flat rate pricing.

Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Enable the BigQuery and Cloud Storage APIs.
Create a SQL Server instance.
Download the AdventureWorks2017 database and load the data into your SQL Server instance.
If your SQL Server is always on, see the Troubleshooting section below for additional steps.

Enable CDC in your SQL Server database

Enable Change Data Capture (CDC) in your SQL Server instance.

Note: Contact your database administrator to grant permission to run the commands.

If you defined custom data types in the table columns being replicated, the table's owner must grant EXECUTE permissions for the custom data types to the database user who is specified in the Replication job.

Create and run a Cloud CDAP Replication job

Upload the JDBC driver

Download the SQL Server JDBC driver to your local machine.
In the CDAP UI, upload the JDBC driver. Use these values to configure the JDBC driver:
- In the Name field, enter sqlserver.
- In the Version field, keep the default.
- In the Class Name field, enter com.microsoft.sqlserver.jdbc.SQLServerDriver.

Create the job

In the CDAP UI, click menu Menu > Replication.
On the Create new replication job page, specify a Replication job Name.
Click Next.
Configure the source:
1. Select Microsoft SQL Server as the source.
2. For Host, enter the hostname of the SQL Server to read from.
3. For Port, enter the port to use to connect to the SQL Server: 1433.
4. For JDBC Plugin Name, select sqlserver (or the name you specified when you configured the JDBC driver).
5. For Database name, enter AdventureWorks2017.
6. In the Credentials section, enter your username and password to access the SQL Server.
Click Next.
If the connection is successful, a list of AdventureWorks2017 tables is displayed. For this tutorial, select all of the tables and events, such as Insert, Update, and Delete events.
Click Next.
Configure the target:
1. Select the BigQuery target.
2. The Project ID and Service Account Key are automatically detected. Keep the default values.
3. Optional: In the Advanced section, you can configure the Cloud Storage bucket's name and location, the load interval, the staging table prefix, and the behavior when tables or databases are dropped.
Click Next.
Optional: Configure the advanced properties. For this tutorial, you can accept the default settings.
Click Next.
On the Review assessment page, click View mappings by any of the tables to get an assessment of schema issues, missing features, or connectivity issues that might occur during replication. If any issues occur, they must be resolved before you can proceed. For this tutorial, if any of the tables have issues, go back to the step where you selected tables and select a table or event without issues.
Click Back.
Click Next.
Review the summary Replication job details, and then click Deploy Replication job.

Start the job

From the Replication job details page:

Click Start.

The Replication job transitions from Provisioning to Starting to Running state. In the running state, the Replication job loads an initial snapshot of the table data that you selected (for example, the People table) into BigQuery. In this state, the state of the People table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the People table are replicated to BigQuery, and the state of the table is listed as Replicating.

Monitor the job

You can start and stop the Replication job, review its configuration and logs, and monitor your Replication job.

You can monitor Replication job activities from the Replication job details page.

From the Replication page, click the desired Replication job Name.
Click Monitoring.

View the results in BigQuery

The Replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding SQL Server database and table names.

If your source database name does not comply with the BigQuery dataset naming conventions, or if you have naming conflicts when you replicate multiple datasets into a single BigQuery dataset, specify aliases in the BigQuery target properties.

Open BigQuery in the Google Cloud console.
In the left panel, click the project name to expand a list of datasets.
Select the adventureworks2017 dataset, and then select a table to view.

For more information, see the BigQuery documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you've finished the tutorial, delete the resources you created on Google Cloud.

Troubleshooting

Replication for SQL Server does not replicate all columns for changed tables

The following issue occurs in Replication jobs that are replicating data from a table in SQL Server. If your replication source table has a newly added column, it is not automatically added to the CDC table. You must manually add it to the underlying CDC table.

Recommendation

To resolve this issue:

Disable the CDC instance:
EXEC sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'myTable', @capture_instance = 'dbo_myTable' GO
Enable CDC instance again:
EXEC sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'myTable', @role_name = NULL, @capture_instance = 'dbo_myTable' GO
Create a new Replication job.

For more information, see Handling changes to source tables.

Replication and SQL Server Always On databases

A Microsoft SQL Server source can capture changes from an Always On read-only replica. For this setup, the runtime argument source.connector.database.applicationIntent=ReadOnly must be passed to the Replication job. Without this runtime argument, the job fails with following error:

Producer failure java.lang.RuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: Failed to update database "DATABASE_NAME" because the database is read-only.

Recommendation

To resolve the error, set source.connector.database.applicationIntent=ReadOnly as a runtime argument. This internally sets the snapshot.isolation.mode to snapshot.

CDAP Documentation

Replicating data from SQL Server to BigQuery

Analytics

Objectives

Costs

Before you begin

Enable CDC in your SQL Server database

Create and run a Cloud CDAP Replication job

Upload the JDBC driver

Create the job

Start the job

Monitor the job

View the results in BigQuery

Clean up

Troubleshooting

Replication for SQL Server does not replicate all columns for changed tables

Recommendation

Replication and SQL Server Always On databases

Recommendation

Related content