Motivation
CDAP has - roughly speaking - two sets of APIs: the programming APIs that are available for developing applications (and plugins, datasets, tests, etc.), and administrative or operational APIs that are available to external clients through REST calls. There is a little (but not much) overlap between the two sets of APIs, and that has its foundation in the fact that the two APIs are targeted for distinctly different audiences: one is for developers, the other is for operations.
However, as we evolve the platform the use cases that it solves, the line between the two sets of APIs becomes fuzzy. For example, data operations on datasets are only available within application code. But what if an operator needs to inspect some data in order to investigate a problem? There is no easy way to perform a Get on a row of a table through the REST APIs. If the operator is lucky, the Table is explorable and he can issue a SQL query. Or he can develop a service that exposes the Table through its REST API. The answer to particular problem is "Data as a Service" and it is beyond the scope of this document. Instead we will identify more use cases, where it is desirable to call operational APIs from inside an application.
What does the operational API include? Here is an (incomplete) list:
- Dataset administration: Create, drop, alter, truncate datasets;
- Program lifecycle: Start, stop, pause programs
- Application lifecycle: Create app (from artifact), drop app
- Metrics+Logs: Retrieve/query operational metrics and logs from an application/program/dataset
- Lineage and Metadata: define and query metadata for apps, programs, datasets etc; query the (system-generated) lineage metadata
- etc.
Typically, these kinds of operations are not done from inside an application. But here are some reasons why we may want to allow that:
- Uber-Apps
An Uber-app is an application that controls other applications. A perfect example is our own Hydrator. Right now, Hydrator is a UI that knows how to configure and create ETL pipelines. Creating an ETL pipeline is really the same as deploying a new application, as an instance of existing artifact with a new configuration. Today, the Hydrator UI understands every single detail about how that configuration has to be created. But this is not perfect, because it moves a lot of logic into the UI (which in theory should be a pure presentation layer). For example, to retrieve the logs for a batch pipeline, the UI checks whether the pipeline is implemented as MapReduce or a Spark job, and retrieves the logs of that particular program.
It would be a lot cleaner if Hydrator were an actual CDAP app with well-defined endpoints that the UI could rely on. Changes in the underlying implementation can then be hidden from the UI. That requires that the Hydrator app becomes an Uber-app: It needs to be able to create apps, start and stop programs, define schedules, retrieve logs and metrics etc. In other words, it needs to be able to perform the operations that the Hydrator UI makes via REST today through a program API. - Job Preparation
Sometimes programs need to perform operations on datasets to ensure proper functioning. Examples are:- A workflow contains a MapReduce program that writes its output to a Table. Before the MapReduce starts, the workflow needs to ensure that the Table is empty. The easiest way to that would be to call truncate() from a custom action in preparation for the MapReduce. That, however, is an administrative action is currently not available to workflows.
- A workflow contains two MapReduce jobs A and B. A writes its output to a file, B reads that file. Ideally, because that file is intermediate output, the file should be removed after the workflow is complete. Even better, if the workflow can create a "temporary" file set when it starts, use that for the intermediate output, and delete that file set at the end of the workflow, then it would be completely self-contained. That is only possible if create and drop dataset are available as operations in a workflow action.
- Advanced Scheduling
CDAP offers certain ways to schedule workflows out of the box. But that is not always sufficient for every application. If an application needs to implement specific scheduling logic, it should be able to that, say, using a worker. This worker would run as a daemon, observe the resources that influence its scheduling decisions, and then trigger a workflow when appropriate. For that, this worker needs a way to retrieve metrics, program run status and history, etc, and also a way to start and stop programs.
What types of programs will need access to these operational APIs? For sure, it makes sense in Services, Workers, and Workflow actions. It may also make sense for batch jobs like MapReduce and Spark, at least during lifecycle methods such as beforeSubmit() and onFinish(). Does it also make sense for flowlets? Perhaps it is not needed there, but it would seem odd excluded a single program type from this capability when all other programs can. We will therefore implement an operational interface that is available to all programs through their program context.
Caveats
Allowing programs to perform admin operations can, of course, introduce complications:
- A program can stop itself (maybe that is good?) but it can also start another run of itself (flooding the system with active runs of the same program)
- A program can delete a dataset that it is currently using for data operations
- Admin operations are not transactional. Allowing to execute them in a transactional context can lead to unexpected behavior. For example, if you read an event from a queue and react to it by starting a workflow, then the commit of the transaction fails, making the event available again to the next transaction, which will now start the workflow again.
- Admin operations can take a long time, longer than the current transaction timeout.
Therefore caution is required from developers who will use these APIs in programs. Good documentation is key.
Use Cases
For CDAP 3.4, we are targeting only administrative operations on datasets. Future versions of CDAP will add capabilities to control app and program lifecycle, manipulate meta data, access logs and metrics, etc.
- As a developer, I am implementing a service that allows users to configure aggregations in a Cube dataset. When a user adds a new aggregation, the service needs to update the dataset's properties.
- As a developer, I am implementing a workflow that creates a temporary dataset in a first custom action, and deletes that dataset in the last custom action of the workflow.
- As a developer, I am implementing a MapReduce program that writes to a Table. In its beforeSubmit() method, the MapReduce needs to make sure the Table is empty, that is, truncate the table.
Proposed Design
We will add a method to all program contexts, more specifically, to RuntimeContext (which is inherited by all program contexts).
interface RuntimeContext { ... Admin getAdmin(); } interface Admin { boolean datasetExists(String name); @Nullable String getDatasetType(String name); @Nullable DatasetProperties getDatasetProperties(String name); void createDataset(String name, String type, DatasetProperties properties) throws DatasetManagementException; void updateDataset(String name, DatasetProperties properties) throws DatasetManagementException; void dropDataset(String name) throws DatasetManagementException; void truncateDataset(String name) throws DatasetManagementException; }