Motivation
CDAP has - roughly speaking - two sets of APIs: the programming APIs that are available for developing applications (and plugins, datasets, tests, etc.), and administrative or operational APIs that are available to external clients through REST calls. There is a little (but not much) overlap between the two sets of APIs, and that has its foundation in the fact that the two APIs are targeted for distinctly different audiences: one is for developers, the other is for operations.
...
Typically, these kinds of operations are not done from inside an application. But here are some reasons why we may want to allow that:
- Uber-Apps
An Uber-app is an application that controls other applications. A perfect example is our own Hydrator. Right now, Hydrator is a UI that knows how to configure and create ETL pipelines. Creating an ETL pipeline is really the same as deploying a new application, as an instance of existing artifact with a new configuration. Today, the Hydrator UI understands every single detail about how that configuration has to be created. But this is not perfect, because it moves a lot of logic into the UI (which in theory should be a pure presentation layer). For example, to retrieve the logs for a batch pipeline, the UI checks whether the pipeline is implemented as MapReduce or a Spark job, and retrieves the logs of that particular program.
...
- It would be a lot cleaner if Hydrator were an actual CDAP app with well-defined endpoints that the UI could rely on. Changes in the underlying implementation can then be hidden from the UI. That requires that the Hydrator app becomes an Uber-app: It needs to be able to create apps, start and stop programs, define schedules, retrieve logs and metrics etc. In other words, it needs to be able to perform the operations that the Hydrator UI makes via REST today through a program API.
- Job Preparation
Sometimes programs need to perform operations on datasets to ensure proper functioning. Examples are:- A workflow contains a MapReduce program that writes its output to a Table. Before the MapReduce starts, the workflow needs to ensure that the Table is empty. The easiest way to that would be to call truncate() from a custom action in preparation for the MapReduce. That, however, is an administrative action is currently not available to workflows.
- A workflow contains two MapReduce jobs A and B. A writes its output to a file, B reads that file. Ideally, because that file is intermediate output, the file should be removed after the workflow is complete. Even better, if the workflow can create a "temporary" file set when it starts, use that for the intermediate output, and delete that file set at the end of the workflow, then it would be completely self-contained. That is only possible if create and drop dataset are available as operations in a workflow action.
- Advanced Scheduling
CDAP offers certain ways to schedule workflows out of the box. But that is not always sufficient for every application. If an application needs to implement specific scheduling logic, it should be able to that, say, using a worker. This worker would run as a daemon, observe the resources that influence its scheduling decisions, and then trigger a workflow when appropriate. For that, this worker needs a way to retrieve metrics, program run status and history, etc, and also a way to start and stop programs.
What types of programs will need access to these operational APIs? For sure, it makes sense in Services, Workers, and Workflow actions. It may also make sense for batch jobs like MapReduce and Spark, at least during lifecycle methods such as beforeSubmit() and onFinish(). Does it also make sense for flowlets? Perhaps it is not needed there, but it would seem odd excluded a single program type from this capability when all other programs can. We will therefore implement an operational interface that is available to all programs through their program context.
Caveats
Allowing programs to perform admin operations can, of course, introduce complications:
- A program can stop itself (maybe that is good?) but it can also start another run of itself (flooding the system with active runs of the same program)
- A program can delete a dataset that it is currently using for data operations
- Admin operations are not transactional. Allowing to execute them in a transactional context can lead to unexpected behavior. For example, if you read an event from a queue and react to it by starting a workflow, then the commit of the transaction fails, making the event available again to the next transaction, which will now start the workflow again.
- Admin operations can take a long time, longer than the current transaction timeout.
Therefore caution is required from developers who will use these APIs in programs. Good documentation is key.