Checklist
Nowadays the UI will only get the first 100 runs of the program no matter how many program runs there are. Our current run API does not have a way to retrieve the total run count unless we do a full table scan. Additionally, it is hard to use the current API to get the previous and next runs given a run id.
Introduce a new count API which will return the count of a program, modify the existing runs API to accept the cursor to support pagination, figure out the upgrade step which is needed for program runs prior to 5.1.
We currently have the following API to retrieve the run record information:
@GET @Path("/apps/{app-name}/{program-type}/{program-name}/runs") public void programHistory(HttpRequest request, HttpResponder responder, @PathParam("namespace-id") String namespaceId, @PathParam("app-name") String appName, @PathParam("program-type") String type, @PathParam("program-name") String programName, @QueryParam("status") String status, @QueryParam("start") String startTs, @QueryParam("end") String endTs, @QueryParam("limit") @DefaultValue("100") final int resultLimit) { ... } |
Currently it only supports query based on the start timestamp and end timestamp with a limit. And the only way to get the run count is to specify the limit to be Long.MAX_VALUE, which will do a full table scan on the AppMetadataStore and might cause tx timeout if there are too many run records. With this it will be hard to do pagination just with the timestamp range. Since it is impossible for the caller to know how many runs are there between two timestamps.
Also, specifying start ts and end ts with a limit actually will return the run records which is closer to the end timestamp now because we have the row key of our run record like following:
runRecordStarting|namespace|app|version|programtype|program|runid runRecordStarted|namespace|app|version|programtype|program|runid runRecordSuspended|namespace|app|version|programtype|program|runid runRecordCompleted|namespace|app|version|programtype|program|inverted start time|runid |
The current running program does not have a timestamp in their row key so their order is random in the response(and current they are always in the response no matter what timestamp is specified). For completed programs, we store them use the inverted start time which is Long.MAX_VALUE - startTime, so when we scan the table, we will get the most recent ones. There is no way to get the runs around the start time first according to the current API. We should also consider modifying the row keys to make them consistent because the current row key is adding unnecessary complexity to the store.
We should have our run records ordered based on 1). State(Program run active > completed) 2). Start time(the latest run should come first). The current API guarantees order of the state and the time order for completed runs. However, the time order is not guaranteed for active runs. All the active run records are returned in random order. Without this, it is also difficult for us to find the prev/next run record since we will have to scan all the three formats of row keys and decide.
To refactor, we can add the inverted start time also as part of the row key like what we have for the completed runs. Upon some investigation, it seems unnecessary to have 3 different row key formats for the active program runs. So we can refactor the row key to the following:
runRecordActive|namespace|app|version|programtype|program|inverted start time|runid runRecordCompleted|namespace|app|version|programtype|program|inverted start time|runid |
Strictly speaking, we will need a upgrade step after this change. But since we require user to stop all the programs before they do the upgrade, these active row keys should not exist before the upgrade. So we might be able to skip the upgrade step.
We can also handle this in a straightforward way. We can just scan for any records in the old format and write them in the new. Since they will be in an old format, they will be as if they are invisible, so we don't need to worry about any concurrent updates. After writing them to the new format, if they are not actually running, the run record corrector will change them to failed.
Or like the old versioned row key, we can get both the old format and new format. All the old format row key will be gone once the programs get transitioned to the next state.
To be able to navigate between run records, we will introduce a new query params: cursor, where each cursor is a run id. We can scan through the table to get the results we want. Since we store the start time in a inverted order, it will be easy if we scan from the latest run to the older runs(this means getting older runs will be easier). However, it is difficult from us to get the newer runs starting from an old run since we currently do not support reverse scan. We will have to start from the first, and keep scanning until we reach the run we want.
To fix this, we will have to add a not inverted time row key to each run record, which will basically doubles the number of row key, and it requires a complex upgrade step, we need to decide if it is worth to make this effort:
runRecordActive|namespace|app|version|programtype|program|inverted start time|runid runRecordCompleted|namespace|app|version|programtype|program|inverted start time|runid runRecordActive|namespace|app|version|programtype|program|start time|runid runRecordCompleted|namespace|app|version|programtype|program|start time|runid |
In 5.1, we will only support navigating forward(to older runs).
In order to get the total count without scanning the whole table, we will introduce some new row which records the count like following:
runCount|namespace|app|version|programtype|program |
Every time we add a new run record, basically when it is provisioning, we will increment this value. This change should be straightforward but the upgrade step will be messy.
If we want to have run count based on status, we need to be careful about state transition. We should decrement the active status count and increment the completed status count if and only if we transition from an active status to completed status. Having this will make the upgrade step more messy.
We need to scan all the run records in the app meta table and update the count for the upgrade step. We will basically have two steps in the upgrade. 1. Scan the old run records and get the old count, 2. Merge this count to the newly introduced count row. However, there are several things to take care of:
Pros:
Cons:
Approach 2 is similar to approach 1 but does not require upgrade step. We will do the scan for the program run record by the first time the count api is getting called. This means the first time query will take some time.
Pros:
Cons:
Path | Method | Description | Query param | Request | Response | ||
---|---|---|---|---|---|---|---|
/v3/namespaces/{namespace-id}/runcount | POST | Returns the count of the batched program | might need status(depending on UI decision) |
|
| ||
/v3/namespaces/{namespace-id}/apps/{app-name}/{program-type}/{program-name}/runcount | GET | Returns the count of program with specified version | might need status(depending on UI decision) | the program count | |||
/v3/namespaces/{namespace-id}/apps/{app-name}/versions/{app-version}/{program-type}/{program-name}/runcount
| GET | Returns the count of program with specified version | might need status(depending on UI decision) | the program count |
Not Applicable
User will need at least one privilege(READ, WRITE, EXECUTE, ADMIN) on the program to get the total run count.