Page 3 of 9

Posted: Sat Dec 22, 2007 7:26 pm
by Bex
Kevinowpb,
The script hasn't yet been modified to work with MM3, hence the [MM2] in the thread title.

Posted: Wed Jan 02, 2008 4:50 am
by trixmoto
Updating this script to MM3 is on my todo list.

Posted: Mon Jan 14, 2008 10:03 pm
by Kevinowpb
done yet? LOL
:)

Posted: Tue Jan 15, 2008 4:08 am
by trixmoto
No, but it is still on my list! :)

Posted: Sun Jan 20, 2008 7:56 am
by sommo
Is there a MM3 of this plugin, cos I really would like it :D

Thanks

Posted: Sun Jan 20, 2008 8:05 am
by trixmoto
It's on my (incredibly long) list! :)

Posted: Sat Feb 02, 2008 10:57 am
by uwuerfel
Hi Trixmoto,

I really like your script.
The idea to create a report with the possibility to do some actions on the duplicates is brilliant.

The only thing I was missing is some sort of fuzzy or approximate string search.

More often than not, I don't have exact duplicate names.
There are some differences like:
- spelling errors ("abracadabra" -> "abracababra")
or
- One title has more info than the other ("I Want You" -> "I Want You (Chica Cherry Cola)")

For this type of coparision there exist metric functions that tell you the "distance" between to strings. The distance is a natural number that more or less says that you need n operations to transform String1 into string2. An operation is s.th. like insert, delete, exchange a character.

I searched a little bit in the internet and found the following informations:

http://en.wikipedia.org/wiki/Fuzzy_string_searching
http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Damerau-Le ... n_distance
http://en.wikipedia.org/wiki/Bitap_algorithm
http://en.wikibooks.org/wiki/Algorithm_ ... plications


I took the above VBA implementation and converted it into VBS.
It's a simple VBS script, that runs in WSH.

Code: Select all

' VB Script Document
option explicit

dim text1 : text1 = "abracadabra"
dim text2 : text2 = "abarcadabra"
dim text3 : text3 = "abarcababra"
dim text4 : text4 = "MozartSinatra"

wscript.echo  damerau_levenshtein( text1, text2, 3 )
wscript.echo  damerau_levenshtein( text2, text3, 3 )
wscript.echo  damerau_levenshtein( text1, text3, 3 )
wscript.echo  damerau_levenshtein( text1, text4, 20 )

Function damerau_levenshtein( s1, s2, limit )
    ReDim result(Len(s1), Len(s2))
    damerau_levenshtein = damerau_levenshtein_recurse( s1, s2, limit, result )
end function


Function damerau_levenshtein_recurse( s1, s2, limit, result )
'This function returns the Levenshtein distance capped by the limit parameter.
'Usage : e.g. damerau_levenshtein("Thibault","Gorisse") to get the exact distance
' or damerau_levenshtein("correctly written words","corectly writen words",4) to identify similar spellings
                    
    Dim diagonal 
    Dim horizontal 
    Dim vertical 
    Dim swap 
    Dim final 
    
    
    'Start of the strings analysis
    If result(Len(s1), Len(s2)) < 1 Then
        If Abs(Len(s1) - Len(s2)) >= limit Then
            final = limit
        Else
            If Len(s1) = 0 Or Len(s2) = 0 Then
                'End of recursivity
                final = Len(s1) + Len(s2)
            Else
            
                'Core of levenshtein algorithm
                If Mid(s1, 1, 1) = Mid(s2, 1, 1) Then
                    final = damerau_levenshtein_recurse(Mid(s1, 2), Mid(s2, 2), limit, result)
                Else
                    
                    If Mid(s1, 1, 1) = Mid(s2, 2, 1) And Mid(s1, 2, 1) = Mid(s2, 1, 1) Then
                        'Damerau extension counting swapped letters
                        swap = damerau_levenshtein_recurse(Mid(s1, 3), Mid(s2, 3), limit - 1, result)
                        final = 1 + swap
                    Else
                        'The function minimum is implemented via the limit parameter.
                        'The diagonal search usually reaches the limit the quickest.
                        diagonal = damerau_levenshtein_recurse(Mid(s1, 2), Mid(s2, 2), limit - 1, result)
                        horizontal = damerau_levenshtein_recurse(Mid(s1, 2), s2, diagonal, result)
                        vertical = damerau_levenshtein_recurse(s1, Mid(s2, 2), horizontal, result)
                        final = 1 + vertical
                    End If
                End If
                
            End If
        End If
    Else
        'retrieve intermediate result
        final = result(Len(s1), Len(s2)) - 1
    End If
        
    'returns the distance capped by the limit
    If final < limit Then
        damerau_levenshtein_recurse = final
        'store intermediate result
        result(Len(s1), Len(s2)) = final + 1
    Else
        damerau_levenshtein_recurse = limit
    End If
    
End Function

First I hoped, the comparison in your script is a function, that I could easily exchange with the above function, but then I discovered, that you use dictionary objects...
So to integrate the above function would mean a major change in your code structure, and I wasn't sure If you would like it when I copletely change your script...

So therefore, if you like my idea, maybe you can integrate the above function in your great script for the next release?

I would really love it :-)


CIAo, uwe..

Posted: Sat Feb 02, 2008 4:07 pm
by trixmoto
Yeah, fuzzy matching is certainly something that I plan to invest some time in and probably add to a number of my scripts once I've got something working well. Thanks for doing this work for me, I'll certainly try to make good use of it. :)

Bug

Posted: Mon Feb 11, 2008 4:43 pm
by chester
Hi Trixmoto

Really love this script! However, I get this bug when I try to flag a lot of files (500+).

Heres the errors:


Image

Image

By the way, I'm using MM3 and your version 2.1. Oh, and it's first now that I see it isn't even supposed to work in MM3. Sorry!

Posted: Sun Feb 17, 2008 1:18 pm
by trixmoto
Yeah, it's not ready for MM3 yet, but it is on my list to update. :)

Posted: Sun Mar 16, 2008 6:31 pm
by trixmoto
New version (2.2) is now available to download from my website. Changes include...

- Made compatible with MM3
- Fixed commit errors when deleting dupes
- Added option to fuzzy match titles (thanks to Uwuerfel)

The fuzzy matching is based on the Damerau-Levenshtein distance algorithm. Please increase the "distance" value slowly as it will add a significant delay to the processing. There is a timeout though so it shouldn't ever jam the script.

Posted: Sun Mar 16, 2008 6:49 pm
by nynaevelan
Are you trying to fake me out?? :-? I get the following error when trying to download the script:

Code: Select all

File does not exist. Make sure you specified correct file name.
Am I moving too fast??

Nyn

Posted: Sun Mar 16, 2008 6:52 pm
by Bex
I'm having plans to rob your script on the Damerau-Levenshtein distance algorithm. :lol:
But it cant be downloaded from your site:
File does not exist. Make sure you specified correct file name.

Posted: Mon Mar 17, 2008 3:51 am
by trixmoto
My apologies, I forgot to actually upload the files - a minor flaw! :oops:

@Bex - the algorithm works well if the two strings are actually similar (ie: Monkey and Mokney), but if they are completely different and you use a distance above about 4 it just takes forever (literally hours in some cases) - that's why I put in the "CheckFuzzy" function and the timeout. If you find a way to improve it, please let me know! :)

SQL errors

Posted: Wed Mar 19, 2008 10:17 am
by tbok
Hi trixmoto,

thanks for a great script - I love the concept, and finding the dupes is working well with my collection, but I get SQL errors when trying to do the processing (removal of dupes).

This is what I get:

Image

Any idea what the problem might be? :(

I'm using MM version 3.0.3.1140

Thanks.