Operations Dashboard
API Requirements
Graph
Information Provided:
List of namespaces
Start Time
End Time
Time Resolution
Information Needed:
Memory Usage over time per namespace, cluster, and max available
Core Usage over time per namespace, cluster, and max available
Bucketed over time resolution aggregate. (The aggregate, we should be able to identify pipeline vs custom app):
Manual start
Scheduled start
Status (RUNNING, COMPLETED, FAILED)
Delay between STARTING and RUNNING
If start time and end time is for future date, show the scheduled apps
Details when Graph Time Range is Clicked
Information Provided:
List of Namespace
Start Time
End Time
Information Needed:
Entity Details:
Namespace
App Name
Program Type
Program Name
Parent Artifact
Duration
User
Start Method (time schedule, trigger, manual)
Status
Reports View
Information Provided:
List of namespaces
List of statuses
Start Time
End Time
Information Needed:
Entity Details:
Namespace
App Name
Program Type
Program Name
Parent Artifact
Duration
User
Start Method
Status
Runtime Arguments
Memory Usage
Number of CPU
Number of Containers
Number of Log Warnings
Number of Log Errors
Number of records out
Summary Counts:
Runs per namespace
Time range
Pipelines (Realtime vs Batch), custom apps
Durations: min, max, average
Last Started: Oldest and Newest
List of users & count per user
List of start method & count per methods
Answered Questions:
1. For older version of CDAP that gets upgraded to 5.0.0 that doesn’t have some information (ie. program start methods, program parent artifact), those information won't be shown and will be displayed as unknown.
2. Future timeline (design should get updated, grey out the statuses and manually started in graph). Only Time trigger schedules will be displayed.
4. How should the runs list be displayed, Batch vs Realtime vs Custom Apps (collapsed by workflow? What about if the programs started outside workflow?): at the frontend users can choose to expand the custom app to show details of different programs in the app.
5. In Dashboard view, we need to limit the time window to a fixed range such as 24 hours in order to display at realtime.
6. After user selects the options and click generate report, a (Spark?) job will be launched. If the job takes less than a specific time (10 sec?) to finish, UI will directly display the report. Otherwise, UI will ask user to wait for the report. When the job finishes, a permalink will be produced and it will be only accessible by the user who generated it. If the user chooses to share the report with others, a different link will be generated that will be viewable by other users.
7. The report will only contain programs that are readable to the user who generates the report.
Action Items:
1. Feasibility of features (core & memory usage, start methods for programs): Need to modify TWILL ApplicationMaster to get containers information. For MapReduce and Spark, how to get containers info is TBD.
2. Need to clarify in the Memory Usage chart, what's the difference between Namespace(s) Usage and App Usage
3. When zooming in to resolution of an hour, can multiple hours be selected? In each row, what are Detail and Summary?
4. Is it feasible to get resource usage for each namespace?
Created in 2020 by Google Inc.