Fuzzy Duplicate Checking

Download and get help for different MediaMonkey Addons.

Moderators: Peke, Gurus

uwuerfel
Posts: 76
Joined: Tue Jan 08, 2008 3:29 pm
Location: Germany

Fuzzy Duplicate Checking

Post by uwuerfel » Tue Apr 01, 2008 4:10 pm

Hi all,

i have experimented a little bit with fuzzy technologies.
So I created a duplicate checker, that uses that technique.
Since there exist plenty (and very good) duplicate checkers,
which do exact checks, I have implemented only fuzzy checks.
That's also the reason why it is quite time consuming.
Checking about 2000 songs needed several hours...

Here is what it does:

The script has two modes:
- check new songs against a base of "clean" songs (reference list)
- check a list of songs against themselves
(this is to create the base list of "clean" songs)

The script uses play lists as input and output.
The list of clean songs is a play list (typically an auto play list).
All the possible duplicates are stored in play lists.
The songs to check are the current selection.


First I do some pre-processing steps to normalize the text:
- (optional) remove text, that is between brackets
Brackets (and their content) are only removed from the "new" song titles.
From the "clean" reference list, the content of the brackets is NOT removed.
This is to be able to find duplicates like "Flashdance" and "What a feeling (Flashdance)".
- (optional) remove special characters like ",;.:'´`!?(){}..."
- (optional) convert to upper case
- convert multiple spaces to one space
- remove leading or trailing spaces


Then the main check routine which itself is divided in three phases:
1) check all the songs
(either against themselves or against the clean list)
The check is a substring search which is made word by word.
The substring search looks for the words of the (new) song title in
the song title of the compared name. The words need to be in the same
order, but there can be other words in between.
(e.g. "whole in head" matches "you have a whole in your head")
Optional: words with less than x characters can be ignored.
Ignoring words with just 1 character is sometimes useful.

Each word is checked with fuzzy methods (Damerau-Levenshtein).
This method defines an editing distance; i.e. how many editing steps
do you need, to convert one word into an other. An editing step is:
Insertion, deletion or substitution of a character and swapping 2 adjacent
characters. An edit distance of 1 per word covers most typing errors.
I typically use an edit distance of 2.
The algorithm automatically sets the edit distance to 1 if the word has
at most 3 characters and it sets the edit distance to 0 if the word has
at most 1 character (in case you didn't ignore words with 1 character).

The result is stored in play lists. The default name for the result node
is "FuzzyDupCheck_Results".
For each duplicate a play list is created below that node.
The name of the play list is the normalized name of the song.
Most of the time you will just have 2 or 3 songs in such a play list.

2) compare the names of the result play lists
Very often you will get result play lists with similar names which have
then the same songs as possible duplicates. This is due to the fuzzy
check.
In phase 2 such play lists are identified and merged into one single
play list.

3) delete "real" duplicates from the result play lists
Especially phase 2 can lead to the situation that the same file is
more than once in a resulting play list. These "real" duplicates
are removed from the result play lists now.


You can configure the behaviour of the script in some constants from line 65 to 76.
The constants and their meaning are:

strResultPlNode: Name of the root node below which all result list are created
strRefPlList: Name of the (auto) play list with the "clean" reference list
intMaxDistCount: Edit distance for Damerau-Levenshtein
intMinWordLength: Words of at most this length are ignored
strSpecChars: Regular expression with all the special characters, that are ignored
boIgnoreBrackets: Flag if brackets (and their content) are removed from the "new" song title
boIgnoreSpecialChars: Flag is special characters are removed
boUpperCase: Flag if check is case insensitive
boPhase1: Flag if phase 1 shall be executed
boPhase2: Flag if phase 2 shall be executed
boPhase3: Flag if phase 3 shall be executed


http://www.mediafire.com/?m0zlcmdfgpe


To test the script please copy it into the scripts folder and put s.th. like this in your scripts.ini:

[FuzzyDupCheck]
FileName=CheckForDuplicates_V0-3-0.vbs
ProcName=CheckForDuplicates
Order=1
DisplayName=FuzzyDupCheck
Description=FuzzyDupCheck
Language=VBScript
ScriptType=0

[FuzzyDupCheckNew]
FileName=CheckForDuplicates_V0-3-0.vbs
ProcName=CheckForNewDuplicates
Order=2
DisplayName=FuzzyDupCheckNewFiles
Description=FuzzyDupCheckNewFiles
Language=VBScript
ScriptType=0


The script works for me, but it is still experimental.
Right now I just use it for myself. Could this be useful for somebody else?
If so, I might add some GUI elements to simplify its usage...

Please tell me what you think about it.



CIAo, uwe..
Autor of Radio-DJ

trixmoto
Posts: 10024
Joined: Fri Aug 26, 2005 3:28 am
Location: Barton, UK
Contact:

Post by trixmoto » Wed Apr 02, 2008 3:27 am

Sounds pretty good, I'm looking forward to playing with it. Also interested to see if you implemented the Damerau-Levenshtein algorithm differently to me! :)
Download my scripts at my own MediaMonkey fansite.
All the code for my website and scripts is safely backed up immediately and for free using Dropbox.
Send me BTC: 34VQPVsf9mCeR4nfhFvvBYZqQ7LkqNZ8Mn
Send me LTC: 3P1mzrfbyscdhbxRpXLgKz7tufGAU3SrEG
Send me DOGE: 9xPpYSqgF7P5yQiqvE1VqWb4UjxVCCLFJ6
Check out these great cryptocurrency faucets... BTC / LTC / DOGE

GuHu
Posts: 63
Joined: Mon Feb 12, 2007 6:25 am

Post by GuHu » Thu Apr 03, 2008 4:11 am

Hi Uwuerfel,

thank you very much for spreading your script. Iam very interested in optimizing my database and played with the Levenshtein-Distance, too.

I used it to find possible typos in the artists and alum artists, see e.g.
http://www.mediamonkey.com/forum/viewtopic.php?t=25115

It works semiautomizid because it cannot decide wether ATC <-> ATB is a typo or two different artists.

A drawback is as you already mentioned the performance. I improved the script in the meantime, it remembers the already processed artists by former runs of the script.

I planned to expand the script to titles, bu now i will test your script.

GuHu

uwuerfel
Posts: 76
Joined: Tue Jan 08, 2008 3:29 pm
Location: Germany

Post by uwuerfel » Thu Apr 03, 2008 2:13 pm

Hi Trixmoto,
trixmoto wrote:Also interested to see if you implemented the Damerau-Levenshtein algorithm differently to me! :)
Well I just used what I got from wikibooks and what I left in your thread.
If you modified it more than just renaming some variables, we have different implementations.


CIAo, uwe..
Autor of Radio-DJ

trixmoto
Posts: 10024
Joined: Fri Aug 26, 2005 3:28 am
Location: Barton, UK
Contact:

Post by trixmoto » Thu Apr 03, 2008 5:12 pm

No, I did a bit of renaming and tidying up of the code (I'm a bit anal when it comes to code!) and changed the caller method but apart from that i didn't. I did a pre-check to make sure that the strings weren't too different because I found it was taking ages to process. Maybe I was just allowing the limit to be too high though.
Download my scripts at my own MediaMonkey fansite.
All the code for my website and scripts is safely backed up immediately and for free using Dropbox.
Send me BTC: 34VQPVsf9mCeR4nfhFvvBYZqQ7LkqNZ8Mn
Send me LTC: 3P1mzrfbyscdhbxRpXLgKz7tufGAU3SrEG
Send me DOGE: 9xPpYSqgF7P5yQiqvE1VqWb4UjxVCCLFJ6
Check out these great cryptocurrency faucets... BTC / LTC / DOGE

Post Reply