Replicating data from Oracle Database to BigQuery

This tutorial shows you how to deploy a job that continuously replicates changed data from an Oracle database to a BigQuery dataset, using CDAP Replication. This feature is powered by Datastream, Google Cloud's cloud-native change provider and Replication service.

Objectives

In this tutorial, you:

Configure your Oracle database to enable supplemental logging.
Create and run a CDAP Replication job.
View the results in BigQuery.

Costs

This tutorial uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

When Replication runs, you are charged for the Dataproc cluster and Cloud Storage, and you incur processing costs for Datastream and BigQuery. To optimize these costs, we strongly recommend that you use BigQuery flat rate pricing.

Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Enable the Dataproc, Datastream, BigQuery, and Cloud Storage APIs.

If you use an existing instance, or if you don't see Replication in the CDAP menu, upgrade to enable Replication.

Optional: Install Oracle in Compute Engine

This section shows how to set up an example database. If you already have an Oracle database installed, you can use that and skip this section.

Note: Datastream supports only specific versions of Oracle database.

Download an Oracle Server Docker image. (This image is an express edition of Oracle 11g. It has capability limitations. See Oracle Database Editions for details.)
Upload your Docker image to the Container Registry.
Deploy your Docker image on a new VM instance. When you create the VM, change the disk size to 500 GB.
Install the HR sample schema.

Configure your Oracle Server to enable supplemental logging

Follow the steps to configure your source Oracle database.

Grant permissions to service accounts

Grant the Dataproc service account permission to call the Datastream API and access Cloud Storage.

In the Google Cloud console, go to the IAM page.
In the permissions table, in the Principals column, find the Dataproc VM service account that matches the format <project-number>-compute@developer.gserviceaccount.com
Click Edit, to the right of the service account.
Click Add Another Role.
Click Select a role.
Use the search bar to search for and select Datastream Admin.
Click Save.
Check that Datastream Admin appears in the Role column of the Permissions table.

For more information about service accounts and service account keys, see Creating and managing service account keys.

Create VPC network peering or firewall rule for your Oracle server

If your Oracle Database doesn't allow ingress traffic from public IPs, you need to set up VPC peering between the Datastream VPC and the VPC where your Oracle database can be accessed. See Create a private connectivity configuration for details.

If your Oracle database allows ingress traffic from public IPs, create a firewall rule for the VM instance to allow ingress traffic from Datastream public IPs.

Create and run a CDAP Replication job

Create the job

In the CDAP UI, click the menu and navigate to the Replication page. Click the green circle button on the top right.
On the Create new Replication job page, specify a Replication job Name. Optionally, specify a description.
Click Next.
Configure the source:
1. Select Oracle (by Datastream) as the source.
2. For Connectivity Method, if your Oracle server allows ingress traffic from Datastream public IPs, choose IP allowlisting. Otherwise, for Private Connection Name choose Private connectivity (VPC peering) and enter the VPC peering name you created in the Create VPC peering or firewall rule for your Oracle Server section.
3. For Host, enter the hostname of the Oracle Server to read from.
4. For Port, enter the port to use to connect to the Oracle Server: 1521.
5. For System Identity, enter xe (the sample database name of the Oracle server).
6. In the credentials section, enter your username and password for accessing the Oracle Server.
7. Leave all other properties as is.
Click Next.
If the connection is successful, a list of tables under HR sample schema will appear. For this tutorial, select all of the tables and events to replicate (including Insert, Update, and Delete events).
Click Next.
Configure the target:
1. Select the BigQuery target.
2. The Project ID and Service Account Key are automatically detected. Keep the default values as is.
3. (Optional) In the Advanced section, you can configure the staging bucket's name and location, the load interval, the staging table prefix, and the behavior when tables or databases are dropped.
Click Next.
(Optional) Configure the advanced properties. For this tutorial, you can use the default settings.
Click Next.
On the Review assessment page, click View mappings by any of the tables to get an assessment of schema issues, missing features, or connectivity issues that might occur during Replication. If any issues occur, they must be resolved before you can proceed. For this tutorial, if any of the tables have issues, go back to the step where you selected tables and instead select a table or event (Inserts, Updates, or Deletes) without issues.
Click Back.
Click Next.
Review the summary Replication job details, and then click Deploy Replication job.

Start the job

On the Replication job details page, click Start.

The Replication job transitions from Provisioning to Starting to Running state. In the running state, the Replication job loads an initial snapshot of the table data that you selected into BigQuery. In this state, the state of the table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the table are replicated to BigQuery, and the state of the table is listed as Replicating.

Monitor the job

You can start and stop the Replication job, review its configuration and logs, and monitor your Replication job.

You can monitor Replication job activities from the Replication job details page.

From the Replication page, click the desired Replication job Name.
Click Monitoring.

View the results in BigQuery

The Replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding Oracle database and table names.

If your source database name does not comply with the BigQuery dataset naming conventions, or if you have naming conflicts when you replicate multiple datasets into a single BigQuery dataset, specify aliases in the BigQuery target properties.

Open BigQuery in the Google Cloud console.
In the left panel, click on your project name to expand a list of datasets.
Select the xe dataset, and then select a table to view.

For more information, see the BigQuery documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

CDAP Documentation