Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Therefore caution is required from developers who will use these APIs in programs. Good documentation is key. 

Use Cases

For CDAP 3.4, we are targeting only administrative operations on datasets. Future versions of CDAP will add capabilities to control app and program lifecycle, manipulate meta data, access logs and metrics, etc.  

  • As a developer, I am implementing a service that allows users to configure aggregations in a Cube dataset. When a user adds a new aggregation, the service needs to update the dataset's properties. 
  • As a developer, I am implementing a workflow that creates a temporary dataset in a first custom action, and deletes that dataset in the last custom action of the workflow. 
  • As a developer, I am implementing a MapReduce program that writes to a Table. In its beforeSubmit() method, the MapReduce needs to make sure the Table is empty, that is, truncate the table.

Proposed Design

We will add a method to all program contexts, more specifically, to RuntimeContext (which is inherited by all program contexts). 

Code Block
interface RuntimeContext {
  ...
  Admin getAdmin();
}
 
interface Admin {
  boolean datasetExists(String name) throws DatasetManagementException;
  String getDatasetType(String name) throws DatasetManagementException;
  DatasetProperties getDatasetProperties(String name) throws DatasetManagementException;

  void createDataset(String name, String type, DatasetProperties properties) throws DatasetManagementException;
  void updateDataset(String name, DatasetProperties properties) throws DatasetManagementException;
  void dropDataset(String name) throws DatasetManagementException;
  void truncateDataset(String name) throws DatasetManagementException;
}

All methods throw:

  • InstanceNotFoundException if a dataset does not exist - except for datasetExists().
  • InstanceConflictException if creation of a dataset conflicts with an existing dataset
  • DatasetManagementException for errors encountered inside the dataset framework, or inside the dataset type's DatasetAdmin code.

Why are we adding this to RuntimeContext, and not to DatasetContext? The idea is that DatasetContext represents a way to obtain an instance of a dataset in a transactional context. Admin operations are not transactional, and therefore it seems cleaner to add them separately from DatasetContext. Also, in the future this Admin interface will provide, non-dataset related operations. 

One complication lies hidden in the implementation of getDatasetProperties(): When creating a dataset with a certain set of properties, the dataset framework of CDAP does not store that set of properties. Instead, it calls the configure() method of the dataset definition with these properties. This method returns a dataset spec that contains properties - but it is up to the implementation of every single dataset definition to construct that spec, and it may not reflect the  original properties that were passed in. For example, it may contain some properties that are derived from the original properties, or it may use the original properties to set properties on its embedded datasets, but not its own properties. 

After checking all CDAP implementations of DatasetDefinition.configure() for whether they preserve the original dataset properties (there are about 40 implementations in our code base):, found the following that manipulate the properties before storing them in the spec:

  • FileSet: adds a FILESET_VERSION property
  • TimePartitionedFileSet: adds the PARTITIONING property
  • ObjectMappedTable: adds TABLE_SCHEMA and TABLE_SCHEMA_ROW_FIELD
  • LineageDataset: adds CONFLICT_LEVEL=NONE
  • UsageDataset: adds CONFLICT_LEVEL=NONE

This is a problem. To address this, we need to change DatasetFramework to store the original properties along with dataset spec returned by configure(). That is the only way we can reliably reproduce the properties with which a dataset was created. For existing datasets (created pre-3.4), we can only make a best effort: Read the spec and get the properties from the spec. If it is one of the known dataset types above, remove the additional properties before returning. Otherwise return these properties unmodified. Because most users would have used AbstractDataset to define their custom datasets, and AbstractDatasetDefinition leaves the properties unmodified, it is highly likely that this will return the correct set of properties. If a user has really implemented his own configure() method that modifies the properties, hope is that it only adds new properties, and that it can accept creation properties that already contain those. As a last resort, the user can manually update the dataset properties through the REST API to the original properties he used to create the dataset.