Operations Dashboard

API Requirements

Graph

Information Provided:

  1. List of namespaces
  2. Start Time
  3. End Time
  4. Time Resolution

Information Needed:

  1. Memory Usage over time per namespace, cluster, and max available
  2. Core Usage over time per namespace,  cluster, and max available
  3. Bucketed over time resolution aggregate. (The aggregate, we should be able to identify pipeline vs custom app):
    1. Manual start
    2. Scheduled start
    3. Status (RUNNING, COMPLETED, FAILED)
    4. Delay between STARTING and RUNNING
  4. If start time and end time is for future date, show the scheduled apps

 

Details when Graph Time Range is Clicked

Information Provided:

  1. List of Namespace
  2. Start Time
  3. End Time

Information Needed:

  1. Entity Details:
    1. Namespace
    2. App Name
    3. Program Type
    4. Program Name
    5. Parent Artifact
    6. Duration
    7. User
    8. Start Method (time schedule, trigger, manual)
    9. Status

 

 

Reports View

Information Provided:

  1. List of namespaces
  2. List of statuses
  3. Start Time
  4. End Time

 

Information Needed:

  1. Entity Details:
    1. Namespace
    2. App Name
    3. Program Type
    4. Program Name
    5. Parent Artifact
    6. Duration
    7. User
    8. Start Method
    9. Status
    10. Runtime Arguments
    11. Memory Usage
    12. Number of CPU
    13. Number of Containers
    14. Number of Log Warnings
    15. Number of Log Errors
    16. Number of records out
  2. Summary Counts:
    1. Runs per namespace
    2. Time range
    3. Pipelines (Realtime vs Batch), custom apps
    4. Durations: min, max, average
    5. Last Started: Oldest and Newest
    6. List of users & count per user
    7. List of start method & count per methods

 

 

Answered Questions:

1. For older version of CDAP that gets upgraded to 5.0.0 that doesn’t have some information (ie. program start methods, program parent artifact), those information won't be shown and will be displayed as unknown.

2. Future timeline (design should get updated, grey out the statuses and manually started in graph). Only Time trigger schedules will be displayed.

4. How should the runs list be displayed, Batch vs Realtime vs Custom Apps (collapsed by workflow? What about if the programs started outside workflow?): at the frontend users can choose to expand the custom app to show details of different programs in the app.

5. In Dashboard view, we need to limit the time window to a fixed range such as 24 hours in order to display at realtime. 

6. After user selects the options and click generate report, a (Spark?) job will be launched. If the job takes less than a specific time (10 sec?) to finish, UI will directly display the report. Otherwise, UI will ask user to wait for the report. When the job finishes, a permalink will be produced and it will be only accessible by the user who generated it. If the user chooses to share the report with others, a different link will be generated that will be viewable by other users.

7. The report will only contain programs that are readable to the user who generates the report.

Action Items:

1. Feasibility of features (core & memory usage, start methods for programs): Need to modify TWILL ApplicationMaster to get containers information. For MapReduce and Spark, how to get containers info is TBD. 

2. Need to clarify in the Memory Usage chart, what's the difference between Namespace(s) Usage and App Usage

3. When zooming in to resolution of an hour, can multiple hours be selected? In each row, what are Detail and Summary? 

4. Is it feasible to get resource usage for each namespace?

Created in 2020 by Google Inc.