Data Reindexing Operation (reindexDataJob)¶
The data reindexing operation is intended to start indexing the search system when there are significant changes in the record structure to speed up the search for records and attributes, as well as to search for duplicate records within the specified entity/reference set.
According to the licensing terms, the reindexDataJob may have multithreading disabled. In this case, the operation will run only on one node of the cluster (for Standard Edition only).
Operation Parameters¶
User account name (text field). Account login. Determines with which account the operation will be launched. If the field is empty, when starting by Cron-expression, the operation will have full rights for any entity/reference set; when starting through the UI, the operation will have rights for the current account. For a data steward account, you may need to configure rights for entities/reference sets.
Clean indexes (checkbox). Deletes old indexes and creates new ones.
Skip default report (checkbox). Disables the recording of events in audit logs.
Update mappings (checkbox). Updates the index mappings.
Entities to reindex (drop-down list). Entity/reference set for which reindexing will be performed (multiple selections are available). By default, set to All - the operation is performed for all entities/reference sets.
Block size (integer). Number of records to be loaded. By default = 1024.
Reindex workflow definitions (drop-down list). Allows you to select the names of workflows whose data will be reindexed. If this option is disabled, the processes are not indexed. The parameter is available if the workflow module is available.
Reindex workflow data (checkbox). Disables or enables the parameter. The parameter is available if the workflow module is available.
Reindex records (checkbox). Starts reindexing records.
Reindex relations (checkbox). Starts reindexing relations.
Reindex drafts (checkbox). Starts reindexing drafts.
Reindex classification (checkbox). Starts reindexing classification.
Update matching tables data (checkbox). Starts the process of searching for duplicate records according to the configured rules. See Compare with matchingJob.
Write IDs log (checkbox). Saves the data of unsuccessful commit intervals in the DB.
Process IDs log (checkbox). Activates the mode of additional processing of the accumulated unsuccessful requests.
Error Log¶
Note
Only one parameter should be enabled at a time: Write error log (writeIdLog) OR Process the error log (processIdLog)
Write/Process the error log saves information about records that are not included in the index in the database, so that when you run it again, you can only finish processing the failed part, which significantly saves time for a complete reindexing of a large number of records.
The parameters are used when the Opensearch indexing queue is interrupted and indexing queries are terminated with the error EsRejectedExecutionException. In all other cases, the parameters must be disabled.
In case it is necessary to reindex big data:
first, the operation with Write Error log is started.
then, if unindexed data remains, then run the operation again by turning off Write Error log and turning on Process error log.
Also, with big data, you can disable the parameters Clear indexes and Update mappings.
"Block Size" (blockSize) Parameter Description¶
The entire number of processed records is divided into parts by the blockSize of records.
Then, in each part, one thread is processed by com.unidata.mdm.job.reindex.data.commit.interval
records (information on this number of records is stored in memory, when moving to the next records, the memory is cleared) until the records run out.
Parameter com.unidata.mdm.job.reindex.data.commit.interval
as a rule, it does not need editing. The recommended value of 1024 is sufficient for most tasks. The larger this parameter is, the more memory can be used at one time. If this parameter is greater than blockSize, then in fact this parameter will be equal to blockSize.
org.unidata.mdm.job.reindex.data.threads
- the number of simultaneously processed threads.
Parameter com.unidata.mdm.job.reindex.data.commit.interval
and org.unidata.mdm.job.reindex.data.threads
are set in backend.properties.
So, you should choose org.unidata.mdm.job.reindex.data.threads
by the number of logical processor cores (use an equal or smaller number, depending on whether there is another load on the processor).
When specifying a small blockSize, it is easier to track the progress of the operation through the UI (startup manager > select startup > number of steps performed). From a performance point of view, it is better to use a sufficiently large blockSize so that the number of migrated records is approximately equal N * blockSize * com.unidata.mdm.job.reindex.data.threads, where N is a not too large natural number, for example, 1.
If blockSize too large (e.g. 500000), then part of the data may not be recorded, but the operation will be completed successfully.
The block Size setting is necessary to balance the amount of data being processed and the number of threads. It's bad when a lot of threads are created, and it's just as bad when 1 thread processes too much data at once. Therefore, it is advisable to choose average values based on the available server resources.
Also, blockSize must be selected according to the total amount of data so that the number of partitions is not too large. On such big data, as in the directory with addresses, the best option is 500-2000 partitions.
Data processing occurs sequentially: records > connections > classifiers > business processes > matching. First, processing of one data type is completed, then the transition to another occurs. The data types that are available are processed.