Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

Introduction:

Sometimes, users of CDAP want to transform their database to a hbase dataset. When they update the original database, they should be able to see the corresponding updates in the hbase dataset.

What we have now and the problem:

Users of CDAP can create an adapter with a table sink to store data from a source. The table sink can transform the data from the source and correctly update whenever insertions occur in the source database. However, whenever deletions occur, the table sink will not delete the corresponding entries and they will remain in the table.

Solution:

SNAPSHOT dataset is a good solution for this.


User stories:

Creating adapters with SNAPSHOT source/sink:

Users should be able to create an adapter with a SNAPSHOT dataset as a source or sink. User should be able to specify the TTL or minimum version of snapshots for the SNAPSHOT dataset. A SNAPSHOT sink should get all the data from the source in each run of the adapter. Whenever insertions, deletions, or updates happen in the source, users should see corresponding changes in the sink.


Reading from the SNAPSHOT datasets:

Users can read from the SNAPSHOT datasets using other applications. They should see the data with the current latest snapshot. They will not be able to see any updates in the dataset unless they try to get the dataset again. Therefore, the users should use the SNAPSHOT dataset for batch programs. For long run programs such as flow, the user can only see one SNAPSHOT and cannot see any updates.


Design:

My test process and results for the existing table sink:

1. I created an adapter with a DB source and a table sink. Initially the mysql table had 4 entries.

...

   the table sink correctly updated the inserted entry but the deleted entry was still there.


SNAPSHOT dataset design (TBD):

The SNAPSHOT dataset will be a subclass of a table dataset, since the functionality of a Snapshot dataset will be similar to a table dataset. It basically will have all the functionalities a Table dataset has.

...

  1. SnapshotDataset.java, which provides implementation of data operations that can be performed on this dataset instance.

  2. SnapshotDatasetAdmin.java, which defines the administrative operations the Snapshot dataset will support.

  3. SnapshotDatasetDefinition.java, which defines a way to configure the snapshot dataset and a way to perform administrative or data manipulation operations on the dataset instance by providing implementation of SnapshotDatasetAdmin and implementation of SnapshotDataset.

  4. SnapshotModule.java, which defines the dataset type.


Approach:

  • Description:

Have two tables for the SNAPSHOT dataset, one is metadata table use to store the current version, the other is use to store the data. When querying for the dataset, we will first try to get the current version from the metadata table and filter the other table for the corresponding version. Use TTL or minVersion to maintain the out-dated data.

...