»

ID #1131

How do the Muse deduplication algorithms function?

The main deduplication algorithms used in the Muse application are:

1. Title
 - The DeDuplication algorithm performs a string comparison on the whole value of the TITLE field.
2. URL
 - The DeDuplication algorithm performs a string comparison on the whole value of the URL field.
3. Host
 - The DeDuplication algorithm first extracts the Host from the URL field and then performs a string comparison on the whole value of the Host Name.
4. Raw:
 - The DeDuplication algorithm performs a string comparison on the whole value of the RAWDATA field.
5. Title 3111
 - A string is formed by the first three letters from the first word, the first letter from the second word, the first letter from the third word, and the first letter from the fourth word of the title, ignoring the so-called stop words (such as: a, about, above, according, accordingly, across, after, afterward, the etc.). The resulting string is used for deduplication.
6. Field:
 - The DeDuplication algorithm performs a string comparison on the whole value of a specified fields.
7. Compare: 
 - The DeDuplication algorithm has the same functionality as for the DeDuplication algorithm using the Field key with some additional options:
  - for each Source Package one can specify the record fields used for computing the string used for Deduplication algorithm;
  - for each Source Package one can specify accuracy and precision degrees when testing two records for comparison and depending on these parameters the records can be considered duplicates or not. 

Note: By default, all comparisons are case insensitive.

Tags: algorithm, dedupe, deduplication, duplicate, same, title3111, title 3111

Related entries: -

Last update: 2013-09-02 15:52
Author: Administrator
Revision: 1.0

Print this record Send FAQ to a friend Show this as PDF file
Rate this FAQ

Average rating: 0 (0 Votes)

completely useless 1 2 3 4 5 most valuable

You cannot comment on this entry

powered by phpMyFAQ 2.7.2