How Search of Records Works¶
Note
Keep in mind that the data must be reindexed in a timely manner for the search to work correctly. Reindexing is necessary if there are significant changes in the data structure (data model), and if many records have been loaded
Search Mechanism¶
When you search through the main search bar, the following happens:
Step 1. The client (frontend) sends a request to the server (backend) with the fields to search (searchFields), the text to search (text), the name of entity/reference set to search (entity) and the current time (asOf):
Customer selects the following search fields: all attributes labeled as "Searchable"; etalon ID; creation and update dates; validity period boundaries.
Step 2. The server adds a condition to the query that asOf must be within the validity period, as well as a condition that the record and validity period are not deleted (active).
Step 3. The search fields are split into two parts, several search subqueries are created, and the conditions of at least one query must be satisfied to find a record:
Not string - match query (if one field) or multi-match query (if several fields) is created. It is necessary to completely match the entered text with the value in the field:
For logical fields - true/false;
For numeric fields - with a dot as a separator of integer and fractional parts;
String - match query (if one field) or multi-match query (if several fields) is created, where the entered text is analyzed by analyzer standard (same for Opensearch) and split into tokens, at that each of the tokens should be found at least in one of the searched fields, i.e. match one of the tokens in it (for the attribute with the name attrName the search will go by the field with the name attrName). Refer to the "System Parameters" section for all of the following parameters.
The fuzzy search parameter
org.unidata.mdm.search.fuzziness
(non-negative integer, default 1) - the allowable difference between the search tokens and the field tokens - is taken into account. The difference (Levenshtein distance) is measured in the number of add, delete, and replace character operations required to get one token from the other;The fuzzy search parameter
org.unidata.mdm.search.fuzziness.prefix.length
(non-negative integer, default is 4), which is the number of characters at the beginning of a token that are mandatory to match, is taken into account;The number of possible tokens matched by fuzzy search is limited to 50, this number is hardwired into the server code (performance limitation);
Fuzzy Search¶
When org.unidata.mdm.search.fuzziness.with.wildcard
is enabled (enabled by default), wildcard query is added for each field (attrName.$nan field is used for analyzed attributes):
If the search text contains a single word, the field text must be satisfied with the "text*" condition, where text is the search text, * is any text (in versions 5.x and before version 6.5, the search was by "text");
If there are several words in the search text, it is necessary to satisfy the text of all conditions field text "word", where word is every word, * is any text.
With standard settings, all three conditions must be met to find a record without additional search criteria:
The record must be active (i.e., not deleted);
The validity period of the record must be active (i.e. not deleted) and contain the current time;
Fulfillment of at least one of the conditions:
Full text match on at least one of the non-string search fields;
Fuzzy text match on at least one of the string search fields;
wildcard text match for at least one of the string search fields.
Notes:
The "Morphological search" option at the attribute allows you to specify an additional criterion - search by attribute with the "Morphological" search type. The search by the main search string is not affected by this option.
The "Registry-independent search" option of the attribute affects only the wildcard part of the search query and the search by attribute (additional criterion).
Example¶
Suppose a record has one simple string attribute strAttr with the value "normal joy".
The strAttr.$default field will contain the tokens "normal" and "joy";
The strAttr field will contain the tokens "n", "no", "nor", "norm", "norma", "normal", "j", "jo", and "joy";
The strAttr.$nan field will have the value "normal joy".
A search for:
"joy" will find a record by matching text to the strAttr field with the token "joy";
"normal" will find a record by fuzzy text match with strAttr field with tokens "norm", "norma".
Morphological Search¶
Note
The "Morphological search" option of an attribute allows to specify an additional criterion - search by attribute with the "Morphological" search type. This option does not affect the search on the main search string.
Morphological search allows searching by attribute value by searching words with the same base (part of the word without the ending).
When entering several words, it is necessary to find all words in the attribute.
In addition, the org.unidata.mdm.search.fuzziness.with.wildcard
option affects the morphological search. If it is enabled, it is acceptable instead of finding all search words based on the word base:
If there is one word in the search text, then satisfying the attribute value to the condition "text", where text is the search text, * - any text;
If there are multiple words in the search text, then satisfying the attribute value of all conditions "word", where word is each word, * - any text.
Morphological search depends on installed Hunspell dictionaries (in Elasticsearch they are pre-installed, in Opensearch self-installation is required).
You can check what word or words the word or sentence of interest is reduced to as follows:
Obtain an index with the appropriate analyzer in one of the following ways:
Use an existing entity/reference set index with an attribute that has morphological search enabled (by default, for the funnyRegister entity, the index name will be default_default_funnyregister);
Create such an index with the following query PUT localhost:9200/test_index_name. Query example:
{ "settings": { "analysis" : { "analyzer" : { "unidata_morph_analyzer" : { "tokenizer" : "standard", "filter" : [ "lowercase", "hunspell_en_EN" ] } }, "filter" : { "hunspell_en_EN" : { "type" : "hunspell", "locale" : "en_EN", "dedup" : false, "longest_only": true } } } } }
Run the query POST localhost:9200/test_index_name/_analyze:
{ "text": "word or sentence to check", "analyzer": "unidata_morph_analyzer" }
Indexing Mechanism¶
For more information about what analyzer and its components character filter, tokenizer and token filter are, see official Elasticsearch documentation. The documentation is also relevant for Opensearch.
Custom token filters:
autocomplete_filter - edge_ngram with parameters min_gram = 1 and max_gram = 55
hunspell_ru_RU, hunspell_en_US - hunspell with parameters dedup = true, longest_only = true
analyzers for indexing:
unidata_default_analyzer
tokenizer - standard (or whitespace, if custom property named "tokenize_on_chars" and value "whitespace" is specified on the entity/reference set)
token filters - lowercase, autocomplete_filter
unidata_search_analyzer
tokenizer - standard (or whitespace, if custom property with the name "tokenize_on_chars" and value "whitespace" is specified on the entity/reference set)
token filters - lowercase
unidata_morph_analyzer
tokenizer - standard (or whitespace, if custom property with the name "tokenize_on_chars" and the value "whitespace" is specified on the entity/reference set)
token filters - lowercase, hunspell_ru_RU, hunspell_en_US
Usage Guideline¶
This is used when indexing string attributes (except code attributes).
Those attributes that are marked as "Searchable", "Displayable", "Main Displayable" or "Unique" are indexed, as well as if they are code attributes.
Indexing for a string attribute named attrName occurs in several ways to fields with different names:
attrName - text, used by unidata_default_analyzer
attrName.$default - text, used by unidata_search_analyzer
attrName.$nan - keyword, if the attribute has the "Registry-independent search" option enabled, the lowercase normalizer is used in addition.
If the attribute has the "Morphological search" option enabled: attrName.$morph - text, unidata_morph_analyzer is used.
Other string fields (etalon ID, code string attributes, etc.) are indexed as keyword (without analysis, i.e. as is). Other string fields (etalon ID, code string attributes, etc.) are indexed as keyword (without analysis, i.e., as is).