Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

The home page of the CDAP UI shows all the available entities in CDAP, along with the ability to narrow them using searching and filtering. The backend search API should support sorting and pagination for this view.

Goals

Allow users to specify a field to sort by as well as a sorting order
Support only name and creation-time as the sort fields for now
Support sort order (ascending and descending)
Support only one combination of sort parameters per request
Support pagination by way of offset and limit parameters.

User Stories

As a CDAP user, I would like to sort entities on the home page by specifying a sort field and a sort order
As a CDAP user, I would like to paginate search results on the home page for easy navigation. I'd like to specify the page size, and be able to jump to a page from the specified options.

Design

The design for this feature can be broken down into the following high-level (ideal) steps, for a given search query, sort order, sort field, offset and limit:

Fetch all the search results
Sort them by the specified sort field in the specified sort order
In the ordered search results, pick items between indexes offset and offset + limit.

The major problems with this approach are that step 1 could return a large amount of results, thus making it less performant to do steps 2 and 3 in-memory. Another major problem is having to do a full table scan for every search request. In some more detail, the problems with this approach:

Sorting: We would like to specify a sort field and sort order. However, given that scans in HBase are only lexicographically ordered by the rowkey, we would need to have rowkeys containing each of the supported sort fields, leading to duplicate storage.

Pagination: Ideally, we would like to only fetch results between offset and offset + limit from the storage (HBase). However, given that in HBase you cannot specify a scan to start at the nth row, this is not possible.

Another major problem is for complex queries. CDAP currently splits queries by special characters (such as whitespace), searches for each of the components and then weighs results based on how many components were contained in them, then returns the search results in the descending order of weights. To assign weights, the algorithm needs a view into the entire search results, hence pagination cannot be done before assigning weights.

Approach

Approach #1

This approach is similar to how pagination is done for logs.

Maintain a startRow. If offset is 0, start row is the first row, else start row is part of the previous search response as the parameter cursor.
Fetch limit number of results from the offset using PageFilter.
In the response, return the rowkey of the last result as the cursor.

Pros

Much less in-memory work.

Cons

Depends on natural ordering from HBase for sorting. Hence, the storage footprint would increase, since we'd have to store duplicate rows for each sort field.
The algorithm to assign weights will not be correct, since we only do a partial fetch from HBase.
This algorithm works well for pagination when the usecase is to always return the next n results, like logs. Since we want to show the number of pages, and want users to be able to jump to pages, or go back a page, this algorithm will not work

Approach #2

To counter the problems specified above, the following algorithm is proposed:

Define an in-memory constant: DEFAULT_SEARCH_FETCH_SIZE. Say this is 1000.
If offset+limit > DEFAULT_SEARCH_FETCH_SIZE, fetch size will be DEFAULT_FETCH_SIZE + offset + limit
To assign weights correctly, return fetch size number of search results for each component of the search query.
1. So, for the query purchase dev history, this step will fetch 3 X the fetch size
Sort the response of step 3 by the specified sort field in memory with the specified order
Return the results in between offset and limit from 4.

Pros

No extra storage footprint, since it does not depend on HBase's natural sorting order.
For most common usecases in entity search, this algorithm will work (navigating through the first few pages, with reasonable offset and limit).
Solves the pagination usecase of jumping to a given page
The weight-assignment algorithm for complex queries will have approximations (will not be 100% accurate, because we still rely only upon a subset of results). However, it will be much more accurate than approach#1

Cons

A lot more in-memory work than Approach #1
In the worst case (when offset + limit is far greater than DEFAULT_FETCH_SIZE), this algorithm could potentially result in large, less performant scans.

Looking at the pros and cons, Approach #2 seems like a more feasible and correct approach of the two.

Approach #3

The complexity of this problem is mostly because of:

We're using search as a means to also list entities.
IndexedTable is backed by HBase and hence is not suitable for:
1. multiple sorting (ordering) schemes
2. offsetting - HBase is suited towards starting a scan with the specified rowkey, but not for starting from the 'nth' element in the scan

Search vs List

Let's think of the both these problems separately, even though we want to serve them using the same API. Their individual requirements are:

List:

Used in the home page
Search query is *
Search usually has a filter (apps, dataset, etc)
Ordering depends on the specified sort parameters. For fully correct ordering, you need to get all results, then run a sort.
Pagination is required

Search:

Used in global search and tracker
Search query is an actual query
Search usually does not have a filter
Ordering depends on relevance. For fully correct ordering, you need to get all results, then weigh them, and return in descending order of weights
Pagination is required

These requirements translate to:

List:

offset and limit are required for pagination
sortBy and sortOrder are required for ordering
In the best case scenario, we want HBase to only return sorted results between offset and offset+limit using server side Filters

Search:

offset and limit are required for pagination
weight is required for relevance/ordering
In the best case scenario, we want HBase to only return results sorted in the descending order of weight between offset and offset+limit using server side Filters.

To achieve this

List:

Sorting can be solved by storing the following additional indexes for every entity:
1. Name
2. Reverse Name
3. Creation time
4. Reverse Creation time
Based on the selected sortBy and sortOrder, the API will select the right index and run a scan by the selected index
Offsetting can be solved using batching, which will have a performance hit for results towards the end.

Search:

Search will always only scan by the default index - it will not use the 4 indexes mentioned above (for the time-being)
Accurate weighting will require a full scan, since the last element in a scan can have the maximum weight for a given search query.
However, since search is not as frequent an operation as list (list is on the homepage), can we afford to do a full scan for every search?

Optimizations for listing

When used for listing, the following optimizations can be made to the search API

Batching

HBase scans can be batched, so we don't always do a full scan. Only when offset + limit is greater than the batch size, there may be multiple batches, leading to a performance hit.

Cursors

For offsetting, one possible optimization is: for every page of search results (with the given sortBy, sortOrder, offset and limit combination), return a list of n previous and n next cursors. A cursor is a marker that can be used to derive the rowkey for the first element of a given page. The search request can then also include this marker, and the scan will start from the marker.

Upgrade

There is an upgrade step already that rebuilds indexes for all entities. Implementation should ensure that this step works with the new indexes.

API changes

New Programmatic APIs

N/A

Deprecated Programmatic APIs

N/A

Updated REST APIs

Path Method Description Response Code Response

/v3/namespaces/<namespace-id>/search

GET

Additional parameters will be added for this API:

sortField
sortOrder
offset
limit

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

The UI was already updated in the 4.0 preview release to send the new parameters. There may be minor updates during implementation

Security Impact

The search API already returns only authorized results, there will be no changes to that.

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

CDAP

Sorting and Pagination for Search

Introduction

Goals

User Stories

Design

Approach

Approach #1

Approach #2

Approach #3

Search vs List

Optimizations for listing

Cursors

Upgrade

API changes

New Programmatic APIs

Deprecated Programmatic APIs

Updated REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work