Sorting and Pagination for Search
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
IntroductionÂ
The home page of the CDAP UI shows all the available entities in CDAP, along with the ability to narrow them using searching and filtering. The backend search API should support sorting and pagination for this view.
Goals
- Allow users to specify a field to sort by as well as a sorting order
- Support only name and creation-time as the sort fields for now
- Support sort order (ascending and descending)
- Support only one combination of sort parameters per request
- Support pagination by way of offset and limit parameters.
User StoriesÂ
- As a CDAP user, I would like to sort entities on the home page by specifying a sort field and a sort order
- As a CDAP user, I would like to paginate search results on the home page for easy navigation. I'd like to specify the page size, and be able to jump to a page from the specified options.
Design
The design for this feature can be broken down into the following high-level (ideal) steps, for a given search query, sort order, sort field, offset and limit:
- Fetch all the search results
- Sort them by the specified sort field in the specified sort order
- In the ordered search results, pick items between indexes offset and offset + limit.
The major problems with this approach are that step 1 could return a large amount of results, thus making it less performant to do steps 2 and 3 in-memory. Another major problem is having to do a full table scan for every search request. In some more detail, the problems with this approach:
Sorting: We would like to specify a sort field and sort order. However, given that scans in HBase are only lexicographically ordered by the rowkey, we would need to have rowkeys containing each of the supported sort fields, leading to duplicate storage.
Pagination: Ideally, we would like to only fetch results between offset and offset + limit from the storage (HBase). However, given that in HBase you cannot specify a scan to start at the nth row, this is not possible.
Another major problem is for complex queries. CDAP currently splits queries by special characters (such as whitespace), searches for each of the components and then weighs results based on how many components were contained in them, then returns the search results in the descending order of weights. To assign weights, the algorithm needs a view into the entire search results, hence pagination cannot be done before assigning weights.
Approach
Approach #1
This approach is similar to how pagination is done for logs.
- Maintain a startRow. If offset is 0, start row is the first row, else start row is part of the previous search response as the parameterÂ
cursor
. - FetchÂ
limit
number of results from the offset usingÂPageFilter
. - In the response, return the rowkey of the last result as theÂ
cursor
.
Pros
- Much less in-memory work.
Cons
- Depends on natural ordering from HBase for sorting. Hence, the storage footprint would increase, since we'd have to store duplicate rows for each sort field.
- The algorithm to assign weights will not be correct, since we only do a partial fetch from HBase.
- This algorithm works well for pagination when the usecase is to always return the next n results, like logs. Since we want to show the number of pages, and want users to be able to jump to pages, or go back a page, this algorithm will not work
Approach #2
To counter the problems specified above, the following algorithm is proposed:
- Define an in-memory constant:Â
DEFAULT_SEARCH_FETCH_SIZE
. Say this is 1000. - If offset+limit >Â
DEFAULT_SEARCH_FETCH_SIZE
, fetch size will beÂDEFAULT_FETCH_SIZE + offset + limit
- To assign weights correctly, returnÂ
fetch size
number of search results for each component of the search query.- So, for the query purchase dev history, this step will fetch 3 X the fetch size
- Sort the response of step 3 by the specified sort field in memory with the specified order
- Return the results in between offset and limit from 4.
Pros
- No extra storage footprint, since it does not depend on HBase's natural sorting order.
- For most common usecases in entity search, this algorithm will work (navigating through the first few pages, with reasonable offset and limit).
- Solves the pagination usecase of jumping to a given page
- The weight-assignment algorithm for complex queries will have approximations (will not be 100% accurate, because we still rely only upon a subset of results). However, it will be much more accurate than approach#1
Cons
- A lot more in-memory work than Approach #1
- In the worst case (whenÂ
offset + limit
is far greater thanÂDEFAULT_FETCH_SIZE
), this algorithm could potentially result in large, less performant scans.
Looking at the pros and cons, Approach #2 seems like a more feasible and correct approach of the two.
Approach #3
The complexity of this problem is mostly because of:
- We're using search as a means to also list entities.
- IndexedTable is backed by HBase and hence is not suitable for:
- multiple sorting (ordering) schemes
- offsetting - HBase is suited towards starting a scan with the specified rowkey, but not for starting from the 'nth' element in the scan
Search vs List
Let's think of the both these problems separately, even though we want to serve them using the same API. Their individual requirements are:
List:
Â
- Used in the home page
- Search query is *
- Search usually has a filter (apps, dataset, etc)
- Ordering depends on the specified sort parameters. For fully correct ordering, you need to get all results, then run a sort.
- Pagination is required
Search:
Â
- Used in global search and tracker
- Search query is an actual query
- Search usually does not have a filter
- Ordering depends on relevance. For fully correct ordering, you need to get all results, then weigh them, and return in descending order of weights
- Pagination is required
These requirements translate to:
List:
- offset and limit are required for pagination
- sortBy and sortOrder are required for ordering
- In the best case scenario, we want HBase to only return sorted results between offset and offset+limit using server side Filters
Search:
- offset and limit are required for pagination
- weight is required for relevance/ordering
- In the best case scenario, we want HBase to only return results sorted in the descending order of weight between offset and offset+limit using server side Filters.
To achieve this
List:
- Sorting can be solved by storing the following additional indexes for every entity:
- Name
- Reverse Name
- Creation time
- Reverse Creation time
- Based on the selected sortBy and sortOrder, the API will select the right index and run a scan by the selected index
- Offsetting can be solved using batching, which will have a performance hit for results towards the end.Â
Search:
- Search will always only scan by the default index - it will not use the 4 indexes mentioned above (for the time-being)
- Accurate weighting will require a full scan, since the last element in a scan can have the maximum weight for a given search query.
- However, since search is not as frequent an operation as list (list is on the homepage), can we afford to do a full scan for every search?
Optimizations for listing
When used for listing, the following optimizations can be made to the search API
Batching
HBase scans can be batched, so we don't always do a full scan. Only when offset + limit is greater than the batch size, there may be multiple batches, leading to a performance hit.
Cursors
For offsetting, one possible optimization is: for every page of search results (with the given sortBy, sortOrder, offset and limit combination), return a list of n previous and n next cursors. A cursor is a marker that can be used to derive the rowkey for the first element of a given page. The search request can then also include this marker, and the scan will start from the marker.
Upgrade
There is an upgrade step already that rebuilds indexes for all entities. Implementation should ensure that this step works with the new indexes.
API changes
New Programmatic APIs
N/A
Deprecated Programmatic APIs
N/A
Updated REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/namespaces/<namespace-id>/search | GET | Additional parameters will be added for this API:
 | 200 - On success 404 - When application is not available 500 - Any internal errors |  |
 |  |  |  |  |
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- The UI was already updated in the 4.0 preview release to send the new parameters. There may be minor updates during implementation
Security ImpactÂ
The search API already returns only authorized results, there will be no changes to that.
Impact on Infrastructure OutagesÂ
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ]Â component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
 |  |  |
 |  |  |
 |  |  |
 |  |  |
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3
Â
Future work
Created in 2020 by Google Inc.