Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

CDAP offers change data capture via three different approaches

  • Golden gate for Oracle
  • Log miner for Oracle
  • Change tracking for SQL server

All these CDC mechanisms are supported via Realtime data pipelines and the plugins are available from Hub. The CDC solution currently runs on Spark 1.x and has experimental support for BigTable. 

Use case(s)

The scope of work involves making the CDC solution work with Spark 2.x and being able to write to BigTable. The performance numbers throughput and latency should be published for these two destinations with all the three CDC approaches.

  • ETL developers should be able to set up realtime pipelines to write data to BigTable/BigQuery
  • Users should get field level lineage for the source and sink that is being used
  • Reference documentation should be updated to account for the changes 
  • The solution should run with all versions of Spark 2.x
  • Integration tests for specific plugins should be added in the test repos
  • Reference document should be updated for the CDC plugins

Deliverables 

  • Source code in cask-solution/cdc repo
  • Performance tests for the three approaches with BigTable 
  • Integration test code 
  • Relevant documentation in the source repo and reference documentation section in plugin

Relevant links 

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

CDCBigTable Sink Properties

User Facing NameTypeDescriptionConstraints
Reference NameStringReference specifies the name to be used to track this external sourceRequired
Instance Id
String

BigTable instance id.
Uniquely identifies BigTable instance within your Google Cloud Platform project. (Macro-enabled)

Required
Project IdString

Google Cloud Project ID, which uniquely identifies a project.
It can be found on the Dashboard in the Google Cloud Platform Console.

If not specified, Project ID will be automatically read from the cluster environment. (Macro-enabled)

Optional
Service Account File PathString

Path on the local file system of the service account key used for
authorization.

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'.
Credentials will be automatically read from the cluster environment.

When running on other clusters, the file must be present on every node in the cluster.

See Google's documentation on Service account credentials for details.(Macro-enabled)

Optional

(default: null)

CDCOracleLogMiner Source Properties

TODO

Design

Approach(s)

Add lineage support for real-time plugins. Requires CDAP core modules modification (co.cask.cdap.etl.spark.streaming.DefaultStreamingContext). 

User Facing NameTypeDescriptionConstraints
Reference NameStringUniquely identified name for lineageRequired
HostStringOracle DB host

Optional

(default: localhost)

PortNumberPort where Oracle is running

Optional

(default: 1521)
DatabaseStringDatabase name to connectRequired
UsernameStringDB usernameRequired
PasswordStringUser passwordRequired
Connection ArgumentsKeyvalueA list of arbitrary string tag/value pairs as connection arguments, list of properties https://docs.oracle.com/cd/B28359_01/win.111/b28375/featConnecting.htm

Design

Approach(s)

Iteration 1: Support Spark 2, update documentation, add lineage.

  • Make credentials optional for BigTable Sink.
  • Add config, schema and third-party application health validation to "configurePipeline" step for all CDC plugins. To validate schema properly, source and sink plugins should use single schema for messages.
  • Register datasets of source CDC plugins for lineage view.
  • Register fields of all CDC plugins for lineage view.
  • Create new core library for Spark 2 plugins. Analogue of "cdap-etl-api-spark".
  • Migrate CDC plugins to CDAP core v6.x.x.
  • Update documentation to be consistent with the other GCP plugins.
  • Write documentation for CDCKudu Sink, CDCDatabase Source, CTSQLServer Source.
  • Update UI widgets for all CDC plugins to match properties and schema defined in documentation.Implement integration tests for all CDC plugins. Tests should pass using Spark 2. Tests should use local environment if possible. Tests against paid services should run only with special maven profile.
  • Try to support both Spark 1 and Spark 2 but drop Spark 1 support if necessary.
  • Implement Log miner for Oracle (CDCOracleLogMiner Source).
  • Implement performance tests for the three approaches with BigTable.

Limitation(s)

  • CDC plugins may not be compatible with Spark 2.

Future Work

  • Jira Legacy
    serverCask Community Issue Tracker
    serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
    keyCDAP-13558
  • Oracle Log Miner Source Plugin
  • Integration tests for CDC plugins
  • Performance tests for CDC plugins

Test Case(s)

  • Track changes from Golden gate for Oracle (create table, insert row, update row, delete row).
  • Track changes from Log miner for Oracle (create table, insert row, update row, delete row).
  • Track changes from SQL Server (create table, insert row, update row, delete row).
  • Write changes to Apache Kudu (create table, insert row, update row, delete row).
  • Write changes to HBase (create table, insert row, update row, delete row).
  • Write changes to BigTable (create table, insert row, update row, delete row).

Sample Pipeline

Pipeline #1 (SQL Server → BigTable)

TODO

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature

Introduction

CDAP offers change data capture via three different approaches

  • Golden gate for Oracle
  • Log miner for Oracle
  • Change tracking for SQL server

All these CDC mechanisms are supported via Realtime data pipelines and the plugins are available from Hub. The CDC solution currently runs on Spark 1.x and has experimental support for BigTable. 

Use case(s)

The scope of work involves making the CDC solution work with Spark 2.x and being able to write to BigTable. The performance numbers throughput and latency should be published for these two destinations with all the three CDC approaches.

  • ETL developers should be able to set up realtime pipelines to write data to BigTable/BigQuery
  • Users should get field level lineage for the source and sink that is being used
  • Reference documentation should be updated to account for the changes 
  • The solution should run with all versions of Spark 2.x
  • Integration tests for specific plugins should be added in the test repos
  • Reference document should be updated for the CDC plugins

Deliverables 

  • Source code in cask-solution/cdc repo
  • Performance tests for the three approaches with BigTable 
  • Integration test code 
  • Relevant documentation in the source repo and reference documentation section in plugin

Relevant links 

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

CDCBigTable Sink Properties

User Facing NameTypeDescriptionConstraintsReference NameStringReference specifies the name to be used to track this external sourceRequiredInstance Id
String

BigTable instance id.
Uniquely identifies BigTable instance within your Google Cloud Platform project. (Macro-enabled)

RequiredProject IdString

Google Cloud Project ID, which uniquely identifies a project.
It can be found on the Dashboard in the Google Cloud Platform Console.

If not specified, Project ID will be automatically read from the cluster environment. (Macro-enabled)

OptionalService Account File PathString

Path on the local file system of the service account key used for
authorization.

If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'.
Credentials will be automatically read from the cluster environment.

When running on other clusters, the file must be present on every node in the cluster.

See Google's documentation on Service account credentials for details.(Macro-enabled)Optional (default: null)

CDCOracleLogMiner Source Properties

TODO

CDCOracleLogMiner Source Properties

TODO

Design

Approach(s)

Iteration 2: Implement LogMiner Source Plugin.

  • Implement LogMiner Source Plugin for Oracle (CDCOracleLogMiner Source). Plugin should use JDBC to retrieve updates from LogMiner.

Iteration 3: Integration tests.

  • Implement integration tests for all CDC plugins. Tests should pass using Spark 2. Tests should use local environment if possible. Tests against paid services should run only with special maven profile. All tests should run on local and distributed environments. Provide docker-compose for local environment setup.

Iteration 4: Performance tests.

  • Implement performance tests for the three approaches with BigTable.

Future iterations (not in scope of current changes):

  • Add lineage support for real-time plugins. Requires CDAP core modules modification (co.cask.cdap.etl.spark.streaming.DefaultStreamingContext). 
  • Make credentials optional for BigTable Sink.
  • Add config, schema and third-party application health validation to "configurePipeline" step for all CDC plugins. To validate schema properly, source and sink plugins should use single schema for messages.
  • Register datasets of source CDC plugins for lineage view.
  • Register fields of all CDC plugins for lineage view.
  • Create new core library for Spark 2 plugins. Analogue of "cdap-etl-api-spark".
  • Migrate CDC plugins to CDAP core v6.x.x.
  • Update documentation to be consistent with the other GCP plugins.
  • Write documentation for CDCKudu Sink, CDCDatabase Source, CTSQLServer Source.
  • Update UI widgets for all CDC plugins to match properties and schema defined in documentation.
  • Implement integration tests for all CDC plugins. Tests should pass using Spark 2. Tests should use local environment if possible. Tests against paid services should run only with special maven profile.
  • Try to support both Spark 1 and Spark 2 but drop Spark 1 support if necessary.
  • Implement Log miner for Oracle (

    CDCOracleLogMiner Source

    .
  • Implement performance tests for the three approaches with BigTable.

Message Schema

Code Block
{
  "type": "record",
  "name": "etlSchemaBody",
  "fields": [
    {
      "name": "op_type",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "table",
      "type": "string"
    },
    {
      "name": "primary_keys",
      "type": [
        {
          "type": "array",
          "items": "string"
        },
        "null"
      ]
    },
    {
      "name": "schema",
      "type": [
        "string",
        "null"
      ]
    },
    {
      "name": "change",
      "type": [
        {
          "type": "map",
          "keys": "string",
          "values": [
            "null",
            "boolean",
            "int",
            "long",
            "float",
            "double",
            "bytes",
            "string"
          ]
        },
        "null"
      ]
    }
  ]
}

Limitation(s)

  • CDC plugins may not be compatible with Spark 2.

Future Work

  • Jira Legacy
    serverCask Community Issue Tracker
    serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
    keyCDAP-13558
  • Oracle LogMiner Source Plugin
  • Integration tests for CDC plugins
  • Performance tests for CDC plugins

Test Case(s)

  • Track changes from Golden gate for Oracle (create table, insert row, update row, delete row).
  • Track changes from Log miner for Oracle (create table, insert row, update row, delete row).
  • Track changes from SQL Server (create table, insert row, update row, delete row).
  • Write changes to Apache Kudu (create table, insert row, update row, delete row).
  • Write changes to HBase (create table, insert row, update row, delete row).
  • Write changes to BigTable (create table, insert row, update row, delete row).

Sample Pipeline

Pipeline #1 (SQL Server → BigTable)

TODO



Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature