Data Reapplying Operation (reapplyDataJob)¶
The operation reapplies data quality rules to records, runs the rules, indexes and saves the results to the database. It is used when updating the data model, adding new attributes, as well as when creating new quality rules or modifying existing ones.
The operation parameters allow you to select which model objects (entities/reference sets) to run which sets of quality rules. For example, if rule set X is configured for entities 1 and 2, and in the operation parameters set X and only entity 1 are selected, only the rules of rule set X for entity 1 data will be reapplied.
Operation Parameters¶
User account name (input field). Account login. Determines on behalf of which user the changes are committed - writing audits and stamps in the index and database. The execution report comes to the user who started the operation. There is no rights check.
Rule sets to run (drop-down list). List of quality rule sets created under "Data quality".
Entities to reapply (drop-down list). The list of entities/reference sets, for which the operation will be applied.
Block size (integer). The size of the block of data to be loaded. Default is 1024.
Notes:
The operation does not perform model remapping;
Cannot re-run if errors occur;
Does not initiate record matching;
The execution report does not contain information about the number of records processed.
Connection with Pipelines¶
The reapply operation has its own pipelines: [BATCH_RECORD_UPSERT_START]${reapply-records-bulk-pipeline} (general mass pipeline) and [RECORD_UPSERT_START]${reapply-records-worker-pipeline} (pipeline that defines actions on each record separately).
To work correctly with rule sets configured for the ETALON phase, you must add a segment of Point type - RECORD_UPSERT_QUALITY_ETALON to the {reapply-records-worker-pipeline} pipeline.
The operation does not apply to work with rule sets configured for the ORIGIN phase.
"Block Size" (blockSize) Parameter Description¶
The entire number of processed records is divided into parts by the blockSize of records.
Then, in each part, one thread is processed by com.unidata.mdm.job.reapply.data.commit.interval
records (information on this number of records is stored in memory, when moving to the next records, the memory is cleared) until the records run out.
Parameter com.unidata.mdm.job.reapply.data.commit.interval
as a rule, it does not need editing. The recommended value of 1024 is sufficient for most tasks. The larger this parameter is, the more memory can be used at one time. If this parameter is greater than blockSize, then in fact this parameter will be equal to blockSize.
org.unidata.mdm.job.reapply.data.threads
- the number of simultaneously processed threads.
Parameters com.unidata.mdm.job.reapply.data.commit.interval
and org.unidata.mdm.job.reapply.data.threads
are set in backend.properties.
So, you should choose org.unidata.mdm.job.reapply.data.threads
by the number of logical processor cores (use an equal or smaller number, depending on whether there is another load on the processor).
When specifying a small blockSize, it is easier to track the progress of the operation through the UI. (startup manager > select startup > number of steps completed). From a performance point of view, it is better to use a sufficiently large blockSize so that the number of migrated records is approximately equal to N * blockSize * com.unidata.mdm.job.reapply.data.threads, where N is not too large a natural number, for example, 1.
If blockSize it is too large (for example, 500000), then part of the data may not be recorded, but the operation will be completed successfully.
The block Size setting is necessary to balance the amount of data being processed and the number of threads. It's bad when a lot of threads are created, and it's just as bad when 1 thread processes too much data at once. Therefore, it is advisable to choose average values based on the available server resources.
Also, blockSize must be selected according to the total amount of data so that the number of partitions is not too large. On such big data, as in the directory with addresses, the best option is 500-2000 partitions.
Data processing occurs sequentially: records > connections > classifiers > business processes > matching. First, processing of one data type is completed, then the transition to another occurs. The data types that are available are processed.