Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

In CDAP 6.0.0, all user-annotated metadata property values are stored as String objects, making them ineligible for numeric search, even when users can understand them as numbers. To improve users’ metadata search experience, we can use Elasticsearch to introduce more specific representations for metadata values and allow users to search numerically.

User Stories

A pipeline developer attaches a “priority” property to their datasets, assigning to that property an integer value from 1-10. They would like to search for all of the datasets with a priority higher than 7.
A pipeline developer structures their datasets hierarchically and attaches a numeric “depth” property to them that specifies their distance from the first generation of datasets. They would like to search for all datasets before the 3rd generation.
A pipeline developer interacting with CDAP through the CLI programmatically assigns a numeric value to a “writes” property of their datasets based on the amount of writes each dataset receives. They would like to search for all datasets with at least 5 writes.

UI Impact or Changes

This feature introduces five new search operators (particularly, comparison operators) to the UI, which come after the key:value separator:

A "==" prefix indicates a search for values exactly matching the number given.
A “>” prefix indicates a search for values higher than the number given.
A “>=” prefix indicates a search for values higher than or equal to the number given.
A “<” prefix indicates a search for values lower than the number given.
A "<=" prefix indicates a search for values lower than or equal to the number given.

Some examples of this syntax in use are:

priority:>7
depth:<3
writes:>=5

Without a preceding comparison operator, a search containing numeric values will be interpreted as a String-based search. In the event that a search term contains both a preceding comparison operator and alphabetic characters, the search term will be interpreted as a String-based search.

Regardless of the presence or absence of requirement syntax (a "+" prefix), a specified numeric search will be considered a required term.

Discussions

Unspecified keys

CDAP metadata search supports both "key:value" syntax and simple "value" syntax. The expected behavior and use case for a key:value search (e.g. "key:>30") is well-defined, but what should one expect when a property or key is not specified (e.g. ">30")? Should this be considered a valid numeric search? There exist two options:

It may be considered a valid numeric search. Currently, when no key is specified, CDAP looks over all String-based representations of metadata values. If this were to be a valid numeric search, CDAP would have to search over all numeric representations of metadata values, as well.
It may not be considered a valid numeric search. This would require users to specify a key when attempting to conduct a numeric search. Given that all numeric search terms are also required search terms, this specificity requirement seems the most useful.

Conclusion: Solution #2 is desirable, and a search without a specified property will be ineligible for numeric search. All numeric search terms are also assumed to be required search terms, so this specificity requirement will maximize proper use of that assumption.

Number storage limitations

If we are to store numeric values as such, we must handle the limitations of number storage in Java. If we store numbers as integers, how do we respond when a user inputs a number larger than Integer.MAX_VALUE? The automatically-assigned Creation-Time property, for instance, has a value above this, and is stored as a Long. There exist at least three possible solutions to this problem:

Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
Store numbers as Longs or Doubles, and throw an exception when met with numeric values exceeding some cap (e.g. Long.MAX_VALUE).
Store numbers as Longs or Doubles, and interpret any numbers exceeding the cap as Strings.

Conclusion: Solution #3 is desirable; numbers will be stored as Doubles, and if they exceed Double.MAX_VALUE, they will be interpreted as Strings. Storing them as doubles allows for both decimal and integer formats to be accepted (e.g. "2.0" and "2"). Interpreting excessively large numbers as Strings is the most simple and sufficient solution currently, as string interpretations are the default; in the future, this behavior may be changed to throw a user-facing exception instead.

Design

New syntax will be introduced to allow searching metadata for numeric values. Greater than, greater than or equal to, less than, less than or equal to, and equality searches will be available.

The metadata indexing process will be changed to store valid numbers as numeric values in addition to being stored as Strings.

The Elasticsearch implementation of metadata storage will make use of the Elasticsearch Java API’s built-in classes to search metadata for numeric properties.

Implementation

Parsing search queries for numeric syntax

For a given search term string, we have to accurately tell whether it constitutes a numeric search.

Approach #1 - in QueryParser API

We can communicate additional information about a search term to ElasticsearchMetadataStorage through the QueryTerm. Much like the existing Qualifier enumerator, a SearchType and Comparison enumerator will hold information, and the Elasticsearch metadata storage implementation can use those enumerators to construct the relevant QueryBuilder objects.

Approach #2 - in ElasticsearchMetadataStorage

We can instead extract type information by parsing the search term within ElasticsearchMetadataStorage, requiring the createTermQuery method to conduct the checks listed above. Items 1 and 2 of the above checks are already conducted by the method. A disadvantage of this is that it takes some functionality from QueryParser that QueryParser could reasonably have available for possible future implementations of metadata storage.

Conclusion

Approach #1 is desirable. A natural benefit of this approach is that it is consistent with QueryParser's purpose. While Elasticsearch is currently the only CDAP metadata storage implementation to use numeric search, an added benefit of parsing in QueryParser is that it abstracts much of the conceptual work away from the ElasticsearchMetadataStorage class, enhancing its readability.

Indexing numeric values

For a given value entry, we have to accurately tell whether it constitutes a numeric value, much in the way we must do so for a numeric search. We must then store the value such that Elasticsearch can access it.

Approach #1 - Extending the Property class

We can create a NumericProperty class that extends MetadataDocument’s Property class, allowing its objects to store both the String and numeric representation of a numeric value.

Approach #2 - Augmenting the Property class

We can instead change the Property class to hold an extra numeric field (e.g. numericValue) that may or may not be null.

Conclusion

Approach #2 is desirable; it is the simpler approach while remaining effective, and requires few, straightforward changes to the codebase. The Property class will hold an extra Double field named numericValue, and will be assigned depending on whether the string representation can be parsed as a Double (through the Double.parseDouble method). Accompanying this change to the Property class, index.mapping.json will include a new nested property of type double, "numericValue". ElasticsearchMetadataStorage will then include a nested numeric value field that corresponds to this change.

Searching for numbers with Elasticsearch

Elasticsearch’s RangeQueryBuilder class provides a straightforward way to conduct greater than, greater than or equal to, less than, less than or equal to, and equality searches for numeric values. After parsing a numeric search term, we can detect the presence of a comparison operator—and if there is one, detect which one it is—and map that operator to a RangeQueryBuilder method. This can be executed within ElasticsearchMetadataStorage’s createTermQuery method.

API changes

Updated QueryParser

QueryParser.java

/**
 * A thread-safe class that provides helper methods for metadata search string interpretation,
 * and defines search syntax for various search term properties, i.e. the data stored in {@link QueryTerm} objects.
 */
public final class QueryParser {
  private static final Pattern SPACE_SEPARATOR_PATTERN = Pattern.compile("\\s+");
  private static final String KEYVALUE_SEPARATOR = ":";
  private static final String REQUIRED_OPERATOR = "+";

  // private constructor to prevent instantiation
  private QueryParser() {}

  /**
   * Organizes and separates a raw, space-separated search string
   * into multiple {@link QueryTerm} objects. Spaces are defined by the {@link QueryParser#SPACE_SEPARATOR_PATTERN}
   * field, the semantics of which are documented in Java's {@link Pattern} class.
   * Certain typical separations of terms, such as hyphens and commas, are not considered spaces.
   * This method preserves the original case of the query.
   *
   * QueryTerms are assigned a search type {@link QueryTerm.SearchType} based on their format. For instance,
   * if a string can be parsed as a numeric double, it will be assigned the NUMERIC type, which allows it to be used
   * in a numeric search. Search terms containing alphabetical characters and those exceeding {@link Double#MAX_VALUE}
   * will be assigned the String type.
   *
   * This method supports the use of certain search operators that, when placed before a search term,
   * denote qualifying information about that search term. When translated into a QueryTerm object, search terms
   * containing a qualifying operator have the operator removed from the string representation.
   * The {@link QueryParser#REQUIRED_OPERATOR} character signifies a search term that must receive a match.
   * By default, this method considers search items of {@link SearchType#STRING}
   * without a qualifying operator to be optional.
   * Search items of {@link SearchType#NUMERIC} are automatically required.
   *
   * For numeric searches, multiple comparison operators can be used.
   * >, >=, <, <=, or = can be placed before a numeric search field to denote a
   * greater-than, greater-than-or-equal-to, less-than, less-than-or-equal-to search, or equality search, respectively.
   * Search items without a comparison operator are considered string-based searches.
   *
   * @param query the raw search string
   * @return a list of QueryTerms
   */
  public static List<QueryTerm> parse(String query) {
	//...
  }

  /**
   * Extracts the raw value of the input term, given that terms can follow a key:[comparison-operator]value syntax.
   * This method removes any syntactic characters from the input string, including comparison and wildcard operators,
   * as well as the property qualifier, e.g. "key".
   * As an example, extractTermValue("key:>=30") returns "30".
   *
   * Note that this method removes comparison operators from alphabetic strings as well, even though they do not qualify
   * for numeric search.
   * As an example, extractTermValue("+>=thirty") returns "thirty".
   *
   * If the value consists entirely of a single operator (e.g. ">=" or "+"), that operator will be returned.
   * As an example, extractTermValue("key:>=") returns ">=", despite it typically being a comparison operator. In this
   * example, ">=" does not precede anything, and is thus considered its own search term.
   *
   * @param term the search term, with all syntactic operators included
   * @return the raw value of the search term, with all syntactic operators excluded
   */
  public static String extractTermValue(String term) {
	//...
  }

Updated QueryTerm

QueryTerm.java

/**
 * Represents a single item in a search query in terms of its content (i.e. the value being searched for)
 * and any useful properties of the search term, e.g. its qualifier and search type.
 * Is typically constructed in a list via {@link QueryParser#parse(String)}
 */
public class QueryTerm {
  private final String term;
  private final Qualifier qualifier;
  private final SearchType searchType;
  private final Comparison comparison;

  /**
   * Defines the different types of search operators that can be used.
   * A qualifier determines how the search implementation should prioritize the given term, e.g.
   * prioritizing required terms over optional ones.
   */
  public enum Qualifier {
    OPTIONAL, REQUIRED
  }

  /**
   * Defines the different types of search terms that can be used.
   * A search type describes the intuitive object type of the term;
   * for instance, the term may be intuited as a number and parsed as one, though internally represented as a String.
   * Its search type would be considered NUMERIC.
   */
  public enum SearchType {
    STRING, NUMERIC
  }

  /**
   * Defines the different relationships a search term can have to potential matches.
   * For a String or keyword search, only EQUALS is valid.
   */
  public enum Comparison {
    EQUALS, GREATER, GREATER_OR_EQUAL, LESS, LESS_OR_EQUAL
  }

  /**
   * Older constructor that assumes a simple String search. Ineligible for numeric search fields.
   *
   * @param term the search term
   * @param qualifier the qualifying information {@link Qualifier}
   */
  public QueryTerm(String term, Qualifier qualifier) {
    this(term, qualifier, SearchType.STRING, Comparison.EQUALS);
  }
  /**
   * Constructs a QueryTerm using the search term, qualifying information, search type, and comparison type.
   *
   * @param term the search term
   * @param qualifier the qualifying information {@link Qualifier}
   * @param searchType the intuitive object type {@link SearchType}
   * @param comparison the desired relative value of potential matches {@link Comparison}
   */
  public QueryTerm(String term, Qualifier qualifier, SearchType searchType, Comparison comparison) {
    this.term = term;
    this.qualifier = qualifier;
    this.searchType = searchType;
    this.comparison = comparison;
  }

  public String getTerm() {
    return term;
  }

  public Qualifier getQualifier() {
    return qualifier;
  }

  public SearchType getSearchType() {
    return searchType;
  }

  public Comparison getComparison() {
    return comparison;
  }

  @Override
  public boolean equals(Object o) {
    if (o == this) {
      return true;
    }
    if (o == null || getClass() != o.getClass()) {
      return false;
    }

    QueryTerm that = (QueryTerm) o;

    return Objects.equals(term, that.getTerm())
        && Objects.equals(qualifier, that.getQualifier())
        && Objects.equals(searchType, that.getSearchType())
        && Objects.equals(comparison, that.getComparison());
  }

  @Override
  public int hashCode() {
    return Objects.hash(term, qualifier, searchType, comparison);
  }

  @Override
  public String toString() {
    return "term:" + term
        + ", qualifier: " + qualifier
        + ", searchType: " + searchType
        + ", comparison: " + comparison;
  }
}

Related Jira

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Related Work

Future Work

With the introduction of several new and implementation-specific metadata search features, a user-friendly way of navigating what features are available should be implemented.

Numeric Search Fields

Introduction

User Stories

UI Impact or Changes

Discussions

Unspecified keys

Number storage limitations

Design

Implementation

Parsing search queries for numeric syntax

Approach #1 - in QueryParser API

Approach #2 - in ElasticsearchMetadataStorage

Conclusion

Indexing numeric values

Approach #1 - Extending the Property class

Approach #2 - Augmenting the Property class

Conclusion

Searching for numbers with Elasticsearch

API changes

Updated QueryParser

Related Jira

Related Work

Future Work