Replication overview

Enterprises traditionally maintain independent data stores for different workloads. They typically have Operational Data Stores (ODS) for their production systems, such as Point of Sales (PoS) systems, payroll systems, ERP systems. In addition, they maintain separate data warehouses for analytics purposes, so that analytics workloads do not affect the performance of ODS adversely. Hence, they have a requirement to replicate data between different data stores. Market trends such as ever-increasing volume of data, growth in real-time use cases, and cloud migration has led to a requirement to consistently deliver data to different target subsystems and partners in real-time, as soon as data changes in the source. One of the keys to satisfying this need quickly and efficiently is to only replicate the delta (the data that has changed, which is often only a fraction of the data), instead of transferring entire batches in bulk all the time. An efficient data replication system is key to unlocking many business use cases such as migration and consolidation, as well as IT initiatives such as Business Intelligence and Data Warehousing, Data Science and Machine Learning, without adversely impacting critical production workloads.

Enterprises have long used Change Data Capture (CDC) to fulfill their real-time data replication needs. CDAP enables traditional enterprise ETL developers, data scientists and business analysts fulfill their data integration needs, with a graphical user interface. When coupled with existing data integration and ETL capabilities of CDAP, this feature unlocks the following applications and use cases:

  1. Up-to-date data for real-time analytics;

  2. Efficient data movement across modern, hybrid architectures;

  3. Faster data sharing and consistency;

  4. Unlock traditional and modern processing paradigms. 

 

At a high level, this capability allows users to:

  1. Configure their source and target databases;

  2. Assess the impact of replication to detect incompatibilities and missing features before starting replication;

  3. Replicate data from source to target incrementally;

  4. Validate that data was successfully replicated;

  5. Monitor the data replication process

You can use Replication to replicate changes at low-latency and in real-time from transactional and operational databases into BigQuery. With easy-to-use wizard interface users can set up replication jobs in minutes. No coding is required. Replication currently supports replicating from Microsoft SQLServer and MySQL.

For most companies, their most valuable data – transactional and operational data – is stored on-prem in traditional relational databases. While old-school migrations or batch ETL uploads achieve the objective of moving the data to BigQuery, these out-of-date, high-latency approaches cannot support the continuous data pipelines and real-time operational decision-making that BigQuery is built for. Users are struggling to reliably replicate data continuously from the transactional and operational database into BigQuery with minimum impact, and not having the most up-to-date data for analytics in real-time limits users' ability to make more accurate decisions. 

Replication provides a low-latency, real-time replication of data from transactional and operational databases into BigQuery. For the initial load of data to BigQuery, Replication enables zero-downtime, zero-data-loss snapshot replication from databases to BigQuery. As an enterprise-grade solution, It has built-in, real-time monitoring to validate that the database transactions have loaded successfully to BigQuery, minimizing risk by ensuring data consistency. Replication also features an inspection capability that will help identify schema, connectivity, configuration and feature incompatibilities to identify problems and provide corrective actions for easier setup. Replication will not only load snapshots and continuously replicate changes into BigQuery but it also provides in-flight processing such as filtering (tables, columns, operations, records), transformations into the desired schema, and data masking. In-memory processing minimizes ETL workloads, improves performance, reduces complexity, and facilitates compliance. Replication offers log-based replication from Oracle using Oracle XStream and SQLServer through SQL Server CDC. All of the Replication CDC jobs can be accessed and configured (filtering, transformation, data masking) via Replication’s easy-to-use Wizards and drag-and-drop UI, speeding delivery of CDC to BigQuery. Replication also records end-to-end lineage including all filters and transformations integrated into CDAP lineage for compliance.



Created in 2020 by Google Inc.