How Search of Records Works

Note

Keep in mind that the data must be reindexed in a timely manner for the search to work correctly. Reindexing is necessary if there are significant changes in the data structure (data model), and if many records have been loaded

Search Mechanism

When you search through the main search bar, the following happens:

Step 1. The client (frontend) sends a request to the server (backend) with the fields to search (searchFields), the text to search (text), the name of entity/reference set to search (entity) and the current time (asOf):

  • Customer selects the following search fields: all attributes labeled as "Searchable"; etalon ID; creation and update dates; validity period boundaries.

Step 2. The server adds a condition to the query that asOf must be within the validity period, as well as a condition that the record and validity period are not deleted (active).

Step 3. The search fields are split into two parts, several search subqueries are created, and the conditions of at least one query must be satisfied to find a record:

  • Not string - match query (if one field) or multi-match query (if several fields) is created. It is necessary to completely match the entered text with the value in the field:

    • For logical fields - true/false;

    • For numeric fields - with a dot as a separator of integer and fractional parts;

  • String - match query (if one field) or multi-match query (if several fields) is created, where the entered text is analyzed by analyzer standard (same for Opensearch) and split into tokens, at that each of the tokens should be found at least in one of the searched fields, i.e. match one of the tokens in it (for the attribute with the name attrName the search will go by the field with the name attrName). Refer to the "System Parameters" section for all of the following parameters.

    • The fuzzy search parameter org.unidata.mdm.search.fuzziness (non-negative integer, default 1) - the allowable difference between the search tokens and the field tokens - is taken into account. The difference (Levenshtein distance) is measured in the number of add, delete, and replace character operations required to get one token from the other;

    • The fuzzy search parameter org.unidata.mdm.search.fuzziness.prefix.length (non-negative integer, default is 4), which is the number of characters at the beginning of a token that are mandatory to match, is taken into account;

    • The number of possible tokens matched by fuzzy search is limited to 50, this number is hardwired into the server code (performance limitation);

Indexing Mechanism

For more information about what analyzer and its components character filter, tokenizer and token filter are, see official Elasticsearch documentation. The documentation is also relevant for Opensearch.

Custom token filters:

  • autocomplete_filter - edge_ngram with parameters min_gram = 1 and max_gram = 55

  • hunspell_ru_RU, hunspell_en_US - hunspell with parameters dedup = true, longest_only = true

analyzers for indexing:

  • unidata_default_analyzer

    • tokenizer - standard (or whitespace, if custom property named "tokenize_on_chars" and value "whitespace" is specified on the entity/reference set)

    • token filters - lowercase, autocomplete_filter

  • unidata_search_analyzer

    • tokenizer - standard (or whitespace, if custom property with the name "tokenize_on_chars" and value "whitespace" is specified on the entity/reference set)

    • token filters - lowercase

  • unidata_morph_analyzer

    • tokenizer - standard (or whitespace, if custom property with the name "tokenize_on_chars" and the value "whitespace" is specified on the entity/reference set)

    • token filters - lowercase, hunspell_ru_RU, hunspell_en_US

Usage Guideline

This is used when indexing string attributes (except code attributes).

Those attributes that are marked as "Searchable", "Displayable", "Main Displayable" or "Unique" are indexed, as well as if they are code attributes.

Indexing for a string attribute named attrName occurs in several ways to fields with different names:

  • attrName - text, used by unidata_default_analyzer

  • attrName.$default - text, used by unidata_search_analyzer

  • attrName.$nan - keyword, if the attribute has the "Registry-independent search" option enabled, the lowercase normalizer is used in addition.

  • If the attribute has the "Morphological search" option enabled: attrName.$morph - text, unidata_morph_analyzer is used.

Other string fields (etalon ID, code string attributes, etc.) are indexed as keyword (without analysis, i.e. as is). Other string fields (etalon ID, code string attributes, etc.) are indexed as keyword (without analysis, i.e., as is).