Replication Concepts

Replication Entities

This section defines all the entities in a Replication.

Concept

Description

Concept

Description

Replication

Replication is a capability of CDAP that makes it possible to replicate data continuously at low-latency from operational data stores into analytical data warehouses. Create a Replication job by configuring a source and a target with optional transformations.

Source

Reads database, table, or column change events and makes them available for further processing in a Replication job. A Replication job contains one source, which relies upon a change capture solution to provide the changes. There can be multiple sources for a database, each with a different change capture solution. A source is a pluggable module built using CDAP's plugin architecture. If a source is not available to meet your needs, you can build your own by implementing the source interface, and then upload it to CDAP.

Target

Writes changes received from a source into a target database. A Replication job contains one target. A target is a pluggable module built using CDAP's plugin architecture. If a target is not available to meet your needs, you can build your own by implementing the target interface then upload it to CDAP.

Source properties

Configures the source, including connection details, source database and table names, credentials, and other properties.

Target properties

Configures the target, including connection details, target database and table names, credentials, and other properties.

Replication Job properties

Configures Replication job including failure thresholds, staging areas, notifications, and validation settings.

Transformations

Implicit transformations such as column filtering and in-flight transformations such as masking fields and renaming columns.

Draft

A saved, partially completed Replication job. When the Replication job definition is complete, it can be started.

Events

Change events in the source to be replicated to the target. Events include inserts, updates, deletes and DDL ( Data Definition Language) changes.

Insert

Addition of new records in the source.

Update

Update to existing records in the source.

Delete

Removal of existing records in the source.

Logs

The operational logs of a Replication job.

Replication Job detail

A detail page with Replication job information, such as its current state, operational metrics, historical view over time, validation results, and its configuration.

Dashboard

A page that lists the state of all change data capture activities, including throughput, latency, failure rates, and validation results.

Actions

This section defines all the actions that you can perform in a Replication.

Concept

Description

Concept

Description

Deploy

Creating a new Replication job by following a UI flow to specify a source, target, and their configuration.

Refresh

Validating a replication transformation. If a transformation is invalid, delete it and add a new one.

Save

Saving a partially created Replication job to resume creation at a later time.

Delete

Deleting an existing Replication job. Only stopped pipelines can be deleted.

Start

Starting a Replication job. The Replication job enters the active state if there are changes to be processed; otherwise, it enters the waiting state.

Stop

Stopping a Replication job. The Replication job stops processing changes from the source.

View logs

Viewing logs of a Replication job for debugging or other analysis.

Search

Searching for a Replication job by its name, description, or other Replication job metadata.

Assess

Assessing the impact of replication prior to starting replication. Assessing a Replication job generates an assessment report that flags schema incompatibilities and missing features.

Monitoring

Replication states

Concept

Description

Concept

Description

Deployed

The Replication job is deployed, but not started. In this state, a Replication job does not replicate events.

Starting

The Replication job is initializing, and is not ready to replicate changes.

Running

The Replication job is started, and is replicating changes.

Stopping

The Replication job is stopping.

Stopped

The Replication job is stopped.

Failed

The Replication job failed due to fatal errors.

Table states

Concept

Description

Concept

Description

Snapshotting

The Replication job is taking a snapshot of the current state of the table prior to replicating changes.

Replicating

The Replication job is replicating changes from the source table into the destination table.

Failing

The Replication job is failing to replicate changes from the source table due to error.

Metrics

Concept

Description

Concept

Description

Inserts

The number of inserts applied to the target in the selected time period.

Updates

The number of updates applied to the target in the selected time period.

Deletes

The number of deletes applied to the target in the selected time period.

Throughput

The number of events and the number of bytes replicated to the target in the selected time period.

Latency

The latency at which data is replicated to the target in the selected time period.

Components

Component

Description

Component

Description

Service

Oversees the end-to-end orchestration of Replication jobs, and provides capabilities for designing, deploying, managing, and monitoring Replication jobs. It runs inside the CDAP tenant project (the tenant project is hidden from the user). Its status is displayed in the SYSTEM ADMIN page of the CDAP UI.

State Management

The service manages the state of each Replication job in a Cloud Storage bucket in the customer project. The bucket can be configured when the Replication job is created. It stores the current offsets and replication state of each Replication job.

Execution

Dataproc clusters provide the execution environment of Replication jobs, which run in your project. Replication jobs execute using CDAP workers. The size and characteristics of the execution environment are configured with Compute Engine profiles.

Source Database

Your production operational database that replicates into your target database. This database can be located on-premises or on Google Cloud. CDAP Replication supports MySQL, Microsoft SQL Server, and Oracle source databases.

Change Tracking Mechanism

Instead of running on an agent that runs on the source database, Cloud Data Fusion relies on a change tracking solution to read changes in the source database. The solution can be a component of the source database or a separately licensed, third-party solution. In the latter case, the change tracking solution runs on-premises, colocated with the source database, or on Google Cloud. Each source must be associated with a change tracking solution.

  1. SQL Server

    • Supported solution: SQL Server CDC (change tracking tables)

    • Additional software: No

    • License/cost: N/A

    • Comments: Available SQL Server 2016 and later

  2. MySQL

    • Supported solution: MySQL binary log

    • Additional software: No

    • License/cost: N/A

    • Comments: N/A

  3. Oracle

Target Database

The destination location for replication and analysis. CDAP supports the BigQuery target database.

Authentication

Authentication mechanisms vary according to the source database or change tracking software. When using the built-in capabilities of source databases, such as SQL Server and MySQL, database logins are used for authentication. When using change tracking software, the software's authentication mechanism is used.

Connectivity

The following table describes the network connections required for Replication, and the security mechanisms they use.

From

To

Optional

Protocol

Network

Auth Security

Purpose

From

To

Optional

Protocol

Network

Auth Security

Purpose

Replication Service (Tenant Project)

Source DB

Yes

Depends on Source. JDBC for direct DB connection

Peering + Firewall rules + VPN/Interconnect + Router

DB Login

Needed at design time only, not execution time. Functions:

  • Table listing

  • Assessment

These are optional steps, and replication can continue without these steps as well.

Delta Service (Tenant Project)

GCS

No

Cloud API

VPC-SC

IAM

State Management

  • Offsets

  • Replication states

Dataproc (Customer project)

Source DB

No

Depends on Source. JDBC for direct DB connection

Peering + Firewall rules + VPN/Interconnect + Router

DB Login

Needed at execution time, for reading changes from source DB to replicate them to target

Dataproc (Customer project)

GCS

No

Cloud API

VPC-SC

IAM

State Management

  • Offsets

  • Replication states

Dataproc (Customer project)

BigQuery

No

Cloud API

VPC-SC

IAM

Needed at execution time, for applying changes from the source DB to the target

 

Created in 2020 by Google Inc.