Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

    1. JIRA: CDAP-3969: CDAP should offer a temporary location to store results between jobs of a workflow.

      Case A)

      Consider the above sample workflow from CDAP-Workflow guide. The goal is to process the raw purchase events from the purchaseEvents stream and find the purchases made by each customer and purchases made for the particular product. When workflow runs, PurchaseEventParser reads the raw events from the purchaseEvents stream and writes the purchase objects to the purchaseRecords dataset. This dataset is later used by PurchaseCounterByCustomer and PurchaseCounterByProduct MapReduce programs as input to create the datasets customerPurchases and productPurchases respectively. Note that when the workflow completes, user is only interested in the final datasets that are created by the Workflow run: customerPurchases and productPurchases. The dataset purchaseRecords created by the MapReduce program PurchaseEventParser is local to the Workflow and no longer required when the workflow run is completed.

      Case B)
      MapReduce program in CDAP can output to the multiple datasets. Consider that the above Workflow can be modified, so that PurchaseEventParser can also write to the errorRecords along with the purchaseRecords dataset. The errorRecords contains the raw events from the purchaseEvents stream for which parsing failed. In this case, the errorRecords may not be local since user may want to perform some analysis on it using another CDAP application to find out the sources which are emitting the bad data frequently. 

      Case C)
      If for some reason, MapReduce program PurchaseEventParser is not generating the required amount of the data, user may want to keep the dataset purchaseRecords even after the run of the Workflow completes, so that he can debug it further.

      Case D)
      Workflow DAG on the UI shows which nodes will be executed by the Workflow and in what order. User can click on any link between the nodes and mark it as local so that it can be kept after the Workflow run. This will cause the output of the source node for that link to be marked as local. User can again click on the link and mark the output as non-local. Solving this use case has few challenges though: MapReduce program can write to multiple output datasets which is decided dynamically during the Workflow run. How would we know before hand that which dataset to be marked as local. Also user can decide to have only one of the output dataset of MapReduce program as local. How would it work in case the link is between the custom action and MapReduce program or custom action and predicates?

                   

    2. JIRA: CDAP-4075: Error handling for Workflows.
      Case A) When the Workflow fails for some reason, user may want to notify appropriate parties via email, possibly with the cause of the failure and the node at which the Workflow failed.
      Case B) When the Workflow fails for some reason at a particular node, user may want to cleanup the datasets and files created by the previous nodes in the Workflow.

...