...
- JIRA: CDAP-3969: CDAP should offer a temporary location to store results between jobs of a workflow.
Case A)Consider the above sample workflow from CDAP-Workflow guide. The goal is to process the raw purchase events from the purchaseEvents stream and find the purchases made by each customer and purchases made for the particular product. When workflow runs, PurchaseEventParser reads the raw events from the purchaseEvents stream and writes the purchase objects to the purchaseRecords dataset. This dataset is later used by PurchaseCounterByCustomer and PurchaseCounterByProduct MapReduce programs as input to create the datasets customerPurchases and productPurchases respectively. Note that when the workflow completes, user is only interested in the final datasets that are created by the Workflow run: customerPurchases and productPurchases. The dataset purchaseRecords created by the MapReduce program PurchaseEventParser is temporary and no longer required when the workflow run is completed.
Case B)
MapReduce program in CDAP can output to the multiple datasets. Consider that the above Workflow can be modified, so that PurchaseEventParser can also write to the errorRecords along with the purchaseRecords dataset. The errorRecords contains the raw events from the purchaseEvents stream for which parsing failed. In this case, the errorRecords may not be temporary since user may want to perform some analysis on it to find out the sources which are emitting the bad data frequently.Case C)
If for some reason, MapReduce program PurchaseEventParser is not generating the required amount of the data, user may want to debug it by making the dataset purchaseRecords as non-transient. So that when the Workflow runs next time the dataset purchaseRecords is not deleted even after the Workflow run completes.Case D)
Assume that in the Workflow above, errorRecords written by the PurchaseEventParser also stores the count of error records occurred so far. Assume that the current count of error records is 200. Now user wants to run the Workflow for debugging purpose in which he do not want to update the counts inside the errorRecords dataset. For this debug Workflow run, the errorRecords can be marked transient. Once the debugging is done, the dataset errorRecords can be marked as non-transient after which the Workflow run should start updating the error counts from 200. For such use case where there is a frequent toggle between the transient and non-transient mode of the datasets, it is better to persist the list of transient datasets associated with the particular run of the Workflow in its run record, so that user can reason about the state of the dataset. For example in this case user can verify that the counts in the errorRecords are as expected. - JIRA: CDAP-4075: Error handling for Workflows.
Case A) When the Workflow fails for some reason, user may want to notify appropriate parties via email, possibly with the cause of the failure and the node at which the Workflow failed.
Case B) When the Workflow fails for some reason at a particular node, user may want to cleanup the datasets and files created by the previous nodes in the Workflow.
...