Add a way to retrieve the properties with which a dataset was created

Description

In CDAP-3051, we want to add an API to allow updating the properties of a dataset. That goes hand in hand with an API to retrieve the current properties. For example, if an app wants to add an index column to an index table, it first needs to know the existing set of index columns.

However, it is not trivial to retrieve the current properties of a dataset. The dataset service does not store the properties that were used to configure the dataset in its metadata. What is actually does is call the dataset definition's configure() method, and then stores the dataset spec returned from that. That spec has a properties field, but that does not necessarily reflect the properties that were passed in.

In order to reconfigure or to clone a dataset, the client needs to be able to retrieve the original properties with which the dataset was created. This Jira adds an API to do so.

Release Notes

Adds an API to retrieve the properties that were used to configure (or reconfigure) a dataset.

Activity

Show:

Andreas NeumannMarch 14, 2016 at 11:25 PM

Missed one case: when retrieving the list of datasets in a namespace
(GET /v3/namespace/default/data/datasets), we also need to call fixProperties() - currently it simply uses spec.getProperties for the result.

Andreas NeumannMarch 9, 2016 at 12:25 AM

Andreas NeumannMarch 7, 2016 at 9:27 PM

Because the existing dataset framework does not store the original dataset properties, this consists of two parts:

  • store the original dataset properties as part of the spec

  • for existing datasets (after an upgrade), implement a method to derive the original properties from the dataset spec

The second part is possible for all built-in datasets, even though some of them manipulate the properties before creating the spec. For user-defined datasets, we can only make a best effort, because we do not know the code that configured them.

Here are the built-in datasets that do not preserve the original properties 1:1:

  • FileSet: adds a FILESET_VERSION property

  • TimePartitionedFileSet: adds the PARTITIONING property

  • ObjectMappedTable: adds TABLE_SCHEMA and TABLE_SCHEMA_ROW_FIELD

  • LineageDataset: adds CONFLICT_LEVEL=NONE

  • UsageDataset: adds CONFLICT_LEVEL=NONE

Only the first three are public datasets used by developers, the last two are only used by the system.

Andreas NeumannMarch 7, 2016 at 9:23 PM

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Components

Fix versions

Priority

Created March 7, 2016 at 9:22 PM
Updated March 16, 2016 at 6:24 PM
Resolved March 16, 2016 at 6:24 PM