Sorting and Pagination for Search

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

The home page of the CDAP UI shows all the available entities in CDAP, along with the ability to narrow them using searching and filtering. The backend search API should support sorting and pagination for this view.

Goals

  1. Allow users to specify a field to sort by as well as a sorting order
  2. Support only name and creation-time as the sort fields for now
  3. Support sort order (ascending and descending)
  4. Support only one combination of sort parameters per request
  5. Support pagination by way of offset and limit  parameters.

User Stories 

  • As a CDAP user, I would like to sort entities on the home page by specifying a sort field and a sort order
  • As a CDAP user, I would like to paginate search results on the home page for easy navigation. I'd like to specify the page size, and be able to jump to a page from the specified options.

Design

The design for this feature can be broken down into the following high-level (ideal) steps, for a given search query, sort order, sort field, offset and limit:

  1. Fetch all the search results
  2. Sort them by the specified sort field in the specified sort order
  3. In the ordered search results, pick items between indexes offset and offset + limit.

The major problems with this approach are that step 1 could return a large amount of results, thus making it less performant to do steps 2 and 3 in-memory. Another major problem is having to do a full table scan for every search request. In some more detail, the problems with this approach:

Sorting: We would like to specify a sort field and sort order. However, given that scans in HBase are only lexicographically ordered by the rowkey, we would need to have rowkeys containing each of the supported sort fields, leading to duplicate storage.

Pagination: Ideally, we would like to only fetch results between offset and offset + limit from the storage (HBase). However, given that in HBase you cannot specify a scan to start at the nth row, this is not possible.

Another major problem is for complex queries. CDAP currently splits queries by special characters (such as whitespace), searches for each of the components and then weighs results based on how many components were contained in them, then returns the search results in the descending order of weights. To assign weights, the algorithm needs a view into the entire search results, hence pagination cannot be done before assigning weights.

Approach

Approach #1

This approach is similar to how pagination is done for logs.

  1. Maintain a startRow. If offset is 0, start row is the first row, else start row is part of the previous search response as the parameter cursor.
  2. Fetch limit number of results from the offset using PageFilter.
  3. In the response, return the rowkey of the last result as the cursor.

Pros

  1. Much less in-memory work.

Cons

  1. Depends on natural ordering from HBase for sorting. Hence, the storage footprint would increase, since we'd have to store duplicate rows for each sort field.
  2. The algorithm to assign weights will not be correct, since we only do a partial fetch from HBase.
  3. This algorithm works well for pagination when the usecase is to always return the next n results, like logs. Since we want to show the number of pages, and want users to be able to jump to pages, or go back a page, this algorithm will not work

Approach #2

To counter the problems specified above, the following algorithm is proposed:

  1. Define an in-memory constant: DEFAULT_SEARCH_FETCH_SIZE. Say this is 1000.
  2. If offset+limit > DEFAULT_SEARCH_FETCH_SIZE, fetch size will be DEFAULT_FETCH_SIZE + offset + limit
  3. To assign weights correctly, return fetch size number of search results for each component of the search query.
    1. So, for the query purchase dev history, this step will fetch 3 X the fetch size
  4. Sort the response of step 3 by the specified sort field in memory with the specified order
  5. Return the results in between offset and limit from 4.

Pros

  1. No extra storage footprint, since it does not depend on HBase's natural sorting order.
  2. For most common usecases in entity search, this algorithm will work (navigating through the first few pages, with reasonable offset and limit).
  3. Solves the pagination usecase of jumping to a given page
  4. The weight-assignment algorithm for complex queries will have approximations (will not be 100% accurate, because we still rely only upon a subset of results). However, it will be much more accurate than approach#1

Cons

  1. A lot more in-memory work than Approach #1
  2. In the worst case (when offset + limit is far greater than DEFAULT_FETCH_SIZE), this algorithm could potentially result in large, less performant scans.

Looking at the pros and cons, Approach #2 seems like a more feasible and correct approach of the two.

Approach #3

The complexity of this problem is mostly because of:

  1. We're using search as a means to also list entities.
  2. IndexedTable is backed by HBase and hence is not suitable for:
    1. multiple sorting (ordering) schemes
    2. offsetting - HBase is suited towards starting a scan with the specified rowkey, but not for starting from the 'nth' element in the scan

Search vs List

Let's think of the both these problems separately, even though we want to serve them using the same API. Their individual requirements are:

List:

 

  1. Used in the home page
  2. Search query is *
  3. Search usually has a filter (apps, dataset, etc)
  4. Ordering depends on the specified sort parameters. For fully correct ordering, you need to get all results, then run a sort.
  5. Pagination is required

Search:

 

  1. Used in global search and tracker
  2. Search query is an actual query
  3. Search usually does not have a filter
  4. Ordering depends on relevance. For fully correct ordering, you need to get all results, then weigh them, and return in descending order of weights
  5. Pagination is required

These requirements translate to:

List:

  1. offset and limit are required for pagination
  2. sortBy and sortOrder are required for ordering
  3. In the best case scenario, we want HBase to only return sorted results between offset and offset+limit using server side Filters

Search:

  1. offset and limit are required for pagination
  2. weight is required for relevance/ordering
  3. In the best case scenario, we want HBase to only return results sorted in the descending order of weight between offset and offset+limit using server side Filters.

To achieve this

List:

  1. Sorting can be solved by storing the following additional indexes for every entity:
    1. Name
    2. Reverse Name
    3. Creation time
    4. Reverse Creation time
  2. Based on the selected sortBy and sortOrder, the API will select the right index and run a scan by the selected index
  3. Offsetting can be solved using batching, which will have a performance hit for results towards the end. 

Search:

  1. Search will always only scan by the default index - it will not use the 4 indexes mentioned above (for the time-being)
  2. Accurate weighting will require a full scan, since the last element in a scan can have the maximum weight for a given search query.
  3. However, since search is not as frequent an operation as list (list is on the homepage), can we afford to do a full scan for every search?

Optimizations for listing

When used for listing, the following optimizations can be made to the search API

Batching

HBase scans can be batched, so we don't always do a full scan. Only when offset + limit is greater than the batch size, there may be multiple batches, leading to a performance hit.

Cursors

For offsetting, one possible optimization is: for every page of search results (with the given sortBy, sortOrder, offset and limit combination), return a list of n previous and n next cursors. A cursor is a marker that can be used to derive the rowkey for the first element of a given page. The search request can then also include this marker, and the scan will start from the marker.

Upgrade

There is an upgrade step already that rebuilds indexes for all entities. Implementation should ensure that this step works with the new indexes.

API changes

New Programmatic APIs

N/A

Deprecated Programmatic APIs

N/A

Updated REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/namespaces/<namespace-id>/searchGET

Additional parameters will be added for this API:

  1. sortField
  2. sortOrder
  3. offset
  4. limit

 

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • The UI was already updated in the 4.0 preview release to send the new parameters. There may be minor updates during implementation

Security Impact 

The search API already returns only authorized results, there will be no changes to that.

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

Created in 2020 by Google Inc.