Data Matching Operation (matchingJob)¶
The operation is designed to search for new/update existing duplicate records in selected sets of matching rules. The operation updates the mapping tables, thereby forming clusters of duplicates.
Operation Parameters¶
User account name (text field). The login of the account used for the operation.
Rule sets block size (integer). The number of rule sets processed simultaneously when the operation is started. By default, 10.
Rule sets (drop-down list). A list of rule sets that operations should be processed.
Table update block size (integer). The number of simultaneously processed records (in the mapping table) when the operation is started. By default, 1024.
Notes:
The operation does not include the cluster consolidation function, regardless of whether auto-consolidation is enabled or not.
The notification of the completion of the operation displays the number of clusters received. You can download a csv file with their description.
Applying Operation¶
When updating the data mapping model (for example, if a new column is added). In this case, both tables and clusters of records should be recalculated.
When changing the search algorithm (case-independent --> case-dependent, exact --> fuzzy). In this case, the record clusters should be recalculated.
When batch loading records with real-time matching disabled (XLSX/REST/Custom). As a result, matching tables are formed, the matching should be recalculated. Then it is recommended to enable real-time matching so that duplicate search works after single inserts.
Comparing with ReindexDataJob¶
Reindex operation (reindexDataJob) should be started when the data model has changed and the search indexes need to be updated for it; or when something has happened to the indexes and they need to be repaired.
The data matching operation (matchingJob) should be started when the data matching model has changed or when it is necessary to perform a mass matching.
reindex Data Job with the "Update matching tables data" flag in the part of mapping indexes performs:
Updates the matching tables (matching algorithms work with them).
If real-time matching is enabled (parameter org.unidata.mdm.matching.data.real.time.matching.enabled in backend.properties):, calculates duplicate clusters.
Features of reindexDataJob: you can choose which records of entities/reference sets will be affected.
matchingJob does:
Updates the matching tables (matching algorithms work with them).
Calculates duplicate clusters regardless of whether real-time is enabled or not.
Features of matchingJob:
You can choose by which sets of matching rules records will be affected.
More detailed notification on the result of the operation.