Replicating Data from MySQL to BigQuery

This tutorial shows you how to create and deploy a replication job that continuously replicates changed data from a MySQL database to a BigQuery table.

Objectives

In this tutorial, you:

Install your MySQL database on Compute Engine.
Set up your MySQL database to enable replication.
Create and run a CDAP replication job.

Costs

This tutorial uses billable components of Google Cloud, including:

When Replication runs, you are charged for the Dataproc cluster and you incur processing costs for BigQuery. To optimize these costs, we strongly recommend that you use BigQuery flat rate pricing.

Use the pricing calculator to generate a cost estimate based on your projected usage.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created.

Before You Begin

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project. Make note of the Project ID, which you’ll need during the replication process.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Go to the project selector page
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
Enable the BigQuery and Cloud Storage APIs.
Go to the Service Accounts page and create a Service Account Key with Key Type of JSON. A JSON file for the Service Account Key will download to your local drive. You’ll need this during the replication process. For more information, see Create a Service Account Key.

Install MySQL in Compute Engine

Download a MySQL Server Docker image.
Upload your Docker image to the Container Registry.
Deploy your Docker image on a new VM instance.
On the Compute Engine Disks page, change the disk size to 500 GB, and then restart the VM.
Go to the Disks page.
Create a firewall for the VM instance.
Install the Sakila sample database.

Enable Replication in your MySQL Database

To enable replication, set up Change Data Capture (CDC) in MySQL.

You may need to contact your database administrator to obtain permissions to run the commands.

Create and Run a CDAP Replication Job

Upload the JDBC driver

Note: Use JDBC driver version 8 or above.

Download the MySQL JDBC driver to your local machine.
In the CDAP UI, go to the Control Center.
Click the green + button and upload the JDBC driver for MySQL.
Use these values to configure the JDBC driver:
- In the Name field, enter mysql.
- In the Class Name field, enter com.mysql.jdbc.Driver.
- In the Version field, keep the default.

For more information about using JDBC drivers in CDAP, see JDBC Drivers.

Create the Replication Job

In the CDAP UI, navigate to the Replication page.
Click the green + (plus) sign, and then click Create a new replication job.
On the Create new replication job page, specify a replication job Name.
Click Next.
Configure the source:
1. Select MySQL as the source.
2. For Host, enter the hostname of the MySQL server to read from.
3. For Port, enter the port to use to connect to the MySQL server: 3306.
4. For JDBC Plugin Name , select mysql (or the name you specified when you configured the JDBC driver).
5. For Database Name, enter sakila.
6. In the Credentials section, enter your username and password to access the MySQL server.
Click Next.
If the connection is successful, a list of sakila sample database tables is displayed. For this tutorial, select a few tables and events to replicate (including Insert, Update, and Delete events). You can replicate all the columns in a table or select specific columns to replicate.
Click Next.
Configure the target:
1. Select the BigQuery target.
2. Enter the Project ID.
3. In the Service Account Key field, paste the entire content of the Service Account Key JSON file.
4. (Optional) In the Advanced section, you can configure other options, such as the dataset name, the staging bucket's name and location, the load interval, the staging table prefix, and the behavior when tables or databases are dropped.
5. While creating BigQuery table, only select fields from primary keys for clustering, whose data type is supported for clustering by BigQuery.
Click Next.
Select the tables you want to replication.
(Optional) To rename a column or mask string fields, click the column name. On the Mappings, assessments, and transformations page, click Transform for the column you want to rename or mask. Click Refresh to validate the transformations.
(Optional) Configure the advanced properties. For this tutorial, you can accept the default settings.
Click Next.
The Review Assessment page lists any tables with schema issues, missing features, or connectivity issues that might occur during replication. To see if any tables have issues, click View mappings next to the table name. If any issues occur, they must be resolved before you can proceed. For this tutorial, if any of the tables have issues, go back to the step where you selected tables and instead select a table or event (Inserts, Updates, or Deletes) without issues.
Click Back.
Click Next.
Review the summary Replication job details, and then click Deploy Replication Job.

Start the Replication Job

On the Replication job details page, click Start.

The Replication job transitions from Provisioning to Starting to Running state. In the running state, the replication job loads an initial snapshot of the table data that you selected into BigQuery. In this state, the state of the table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the table are replicated to BigQuery, and the state of the table is listed as Replicating.

Monitor the Replication Job

You can start and stop the Replication job, review its configuration and logs, and monitor the replication job.

To monitor replication job activities from the Replication job details page, click the Monitoring tab.

View the Results in BigQuery

The replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding MySQL database and table names.

Open BigQuery in the Cloud Console.
In the left panel, click on your project name to expand a list of datasets.
Select the sakila dataset, and then select a table.

For more information about using BigQuery, see the BigQuery documentation.