...
Code Block |
---|
interface RuntimeContext { ... Admin getAdmin(); } interface Admin { boolean datasetExists(String name) throws DatasetManagementException; @Nullable String getDatasetType(String name) throws DatasetManagementException; @Nullable DatasetProperties getDatasetProperties(String name) throws DatasetManagementException; void createDataset(String name, String type, DatasetProperties properties) throws DatasetManagementException; void updateDataset(String name, DatasetProperties properties) throws DatasetManagementException; void dropDataset(String name) throws DatasetManagementException; void truncateDataset(String name) throws DatasetManagementException; } |
All methods throw:
InstanceNotFoundException
if a dataset does not exist - except fordatasetExists()
.InstanceConflictException
if creation of a dataset conflicts with an existing datasetDatasetManagementException
for errors encountered inside the dataset framework, or inside the dataset type'sDatasetAdmin
code.
Why are we adding this to RuntimeContext, and not to DatasetContext? The idea is that DatasetContext represents a way to obtain an instance of a dataset in a transactional context. Admin operations are not transactional, and therefore it seems cleaner to add them separately from DatasetContext. Also, in the future this Admin interface will provide, non-dataset related operations.
One complication lies hidden in the implementation of getDatasetProperties()
: When creating a dataset with a certain set of properties, the dataset framework of CDAP does not store that set of properties. Instead, it calls the configure()
method of the dataset definition with these properties. This method returns a dataset spec that contains properties - but it is up to the implementation of every single dataset definition to construct that spec, and it may not reflect the original properties that were passed in. For example, it may contain some properties that are derived from the original properties, or it may use the original properties to set properties on its embedded datasets, but not its own properties.
After checking all CDAP implementations of DatasetDefinition.configure()
for whether they preserve the original dataset properties (there are about 40 implementations in our code base):, found the following that manipulate the properties before storing them in the spec:
- FileSet: adds a FILESET_VERSION property
- TimePartitionedFileSet: adds the PARTITIONING property
- ObjectMappedTable: adds TABLE_SCHEMA and TABLE_SCHEMA_ROW_FIELD
- LineageDataset: adds CONFLICT_LEVEL=NONE
- UsageDataset: adds CONFLICT_LEVEL=NONE
This is a problem. To address this, we need to change DatasetFramework to store the original properties along with dataset spec returned by configure()
. That is the only way we can reliably reproduce the properties with which a dataset was created. For existing datasets (created pre-3.4), we can only make a best effort: Read the spec and get the properties from the spec. If it is one of the known dataset types above, remove the additional properties before returning. Otherwise return these properties unmodified. Because most users would have used AbstractDataset to define their custom datasets, and AbstractDatasetDefinition leaves the properties unmodified, it is highly likely that this will return the correct set of properties. If a user has really implemented his own configure()
method that modifies the properties, hope is that it only adds new properties, and that it can accept creation properties that already contain those. As a last resort, the user can manually update the dataset properties through the REST API to the original properties he used to create the dataset.