MonkeyMatch 0.5.56 - Find & Fix Similar Spelling (7/7/13)

Download and get help for different MediaMonkey for Windows 4 Addons.

Moderators: Peke, Gurus

Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.34 - Find & Fix Similar Spelling (6/15/1

Post by Scottes »

dtsig wrote:Scottes, I am just now trying it out with limited testing. Although the majority of the changes are simple capitalization .. there currently have been no problems to report on. So far this is working as advertised. Thank you very much

Edited: Info on setup, Win7 Ultimate, 8gig, 144+tracks.
Well it was good to hear that it worked for a while. Until you posted again. :D

It is kinda frustrating because it works perfectly on my system. I use it weekly because I am re-ripping all my CDs to FLAC. As I add albums, I sync up the old MP3s with the new FLACs.

Granted, I only have 27,000 tracks and my database is extremely clean. Running this on larger, dirtier databases is punching some big holes in it.
dtsig
Posts: 3588
Joined: Mon Jan 24, 2011 6:34 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by dtsig »

Well, it crashed MMW again (note newest beta) after 20-30 updates.

Of my 144+K files there were 8000 matches of which many I am finding are not misspellings. Just very similar names from the 50's.

It is odd how long it takes to process though. The time is lengthly even on names that might only have 2-3 tracks.

Let me know what you might need to help.

Thanks for the quick reply
Where's the db and ini stored
Reporting Bugs
Where tags are stored

Not affiliated with MediaMonkey ... just a RABID user/lover
DTSig
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by Scottes »

144,000 tracks? Gawd, how long does the search take before it shows the first matches? It should take a VERY LONG time, since it's [basically] comparing every name against every other name. So 144,000 x 144,000 searches... (Well, it's about half that right off the bat, and it will ignoring pairs when the length is too different, and then there are more optimizations.)

Anyway, 144,000 tracks should take a while.


And yeah, the program can't tell if they are misspelled, just that they are very similar. So "Part 1" and "Part 3" are just as similar as "Part 1" and "Mart 1". I can't conceive of doing a typical spell-check against all the crazy band names and album names...


It is odd to me that it's crashing on you after making numerous changes. Just to make sure, it is finding all the matches, then starts displaying similar names, you correct about 20-30, and then it crashes?

Also, crashes - or stops responding?
If the former, any error message?
If the latter, have you tried walking away for 5 or 10 minutes to see if it comes back?


Lastly, would you mind sharing your database so I can see what you are seeing?
dtsig
Posts: 3588
Joined: Mon Jan 24, 2011 6:34 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by dtsig »

Yeah .. it is actually MMW that throws the error but (poor old MM asks "why did I do this?") but it is different than just taking a long time.

The initial scan really isnt that bad. I will share my db. I will PM you
Where's the db and ini stored
Reporting Bugs
Where tags are stored

Not affiliated with MediaMonkey ... just a RABID user/lover
DTSig
hintergrundrauschen
Posts: 211
Joined: Sat Mar 29, 2008 6:20 pm

Re: MonkeyMatch 0.5.34 - Find & Fix Similar Spelling (6/15/1

Post by hintergrundrauschen »

Scottes wrote:You have - to put it mildly - A LOT of similar Song names. I ran MonkeyMatch, set at Moderately Accurate, against your songs and I ran out of memory at 5,400 matches - and it was only 1/4 the way through the songs. And that took an hour and 18 minutes...
Maybe for a temporary workaround you could try to set some filter, say for a certain folder or for songs starting with 'a'. But then, again, I assume that might not even help since filtering for a% is probably just as computationally consuming as searching everything (since it basically is searching everything), but I can't judge the memory consumption.
Claude
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by Scottes »

I did some research over the last couple days, and have some ideas about what to look at. I just started re-running your database again about 30 minutes ago, trying to get the error to occur. Hopefully it will happen so that I can do some more pin-pointed research. Hmm, it's already found 31,000 match pairs... The other night it crashed around 5,500 every time. Maybe one of the small fixes I put in for dtsig and zuilserip also did some good for whatever was causing issues while processing your database. Wish me luck...


As for limiting memory usage, one could set both filters to "Artists starts with A" and run that, then set both to "Artist starts with B" and so on and on. It will be efficient and fast, but would need to be run 26 times. What a pain.

I have been thinking about the pains of working with such large databases. Ideas are going through my head. Multi-threading will be a huge help, since most PCs have multiple cores/threads. But running the entire database all at once may still be too much for some folks or systems, so I've been thinking about ways to process the database in sections.
zuilserip
Posts: 34
Joined: Wed Feb 22, 2012 8:00 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by zuilserip »

Thank you a ton for your work on this Scottes - this already is a hugely helpful tool in our MediaMonkey arsenal and will become even more powerful as you swat away the few remaining bugs!
Scottes wrote:I have been thinking about the pains of working with such large databases. Ideas are going through my head. Multi-threading will be a huge help, since most PCs have multiple cores/threads. But running the entire database all at once may still be too much for some folks or systems, so I've been thinking about ways to process the database in sections.
You may already be doing this, or considered it and dismissed for some reason - in that case please ignore. Otherwise, instead of comparing every item against all others, you could limit yourself to comparing just among unique items. So, if a datanase has 50K artist names, but only 25K unique names - your search space goes from 50K^2 down to 25K^2 which is a four-fold decrease in time/effort. You should be able to find the list of 25K unique names in linear time by using a hash table (make the hash table large enough to keep collisions close to zero).
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by Scottes »

I grab the unique names right off the bat. For instance, my database has 28,000 song files, but only 17,000 unique song titles.

Then I start stepping through them, but only if they have been not yet been compared. For instance, if A has been compared to B then there is no reason to compare B to A.

Then I check the name lengths. As an example, if A is 15 characters long then B must be between 13 and 17 characters long, or else I skip it.

Then there's the Blacklist. If you have run it before and specified that A is definitely not a match to B, then the A-B pair will be skipped.

Then there's a numeric check, so that names like "Part 1" and "Part 2" are skipped, and "CD 1" and "CD 2" and so on.

If it passes all of those checks, THEN it will do the expensive compare process.

But if the names are wildly different, the compare process exits as soon as the number of differences cross a threshold.


Even with all this, it is going to run tens of thousands - or hundreds of thousands - of compares. My instinct right now says to break the database into chunks based on name lengths, and just process each chunk independently. So compared all names under 10 characters long. Then compare all names between 11 and 20 characters. And so on. Basically this means you'd run MonkeyMatch 5 times, each time just looking at 1/5 of the database. It would *seem* faster - and would definitely take less memory - but it wouldn't actually be faster in the long run.


(If anyone else has any optimization ideas PLEASE let me know. I definitely appreciate it, because I doubt that I have thought of everything.)
zuilserip
Posts: 34
Joined: Wed Feb 22, 2012 8:00 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by zuilserip »

Scottes wrote: Even with all this, it is going to run tens of thousands - or hundreds of thousands - of compares. My instinct right now says to break the database into chunks based on name lengths, and just process each chunk independently. So compared all names under 10 characters long. Then compare all names between 11 and 20 characters. And so on. Basically this means you'd run MonkeyMatch 5 times, each time just looking at 1/5 of the database. It would *seem* faster - and would definitely take less memory - but it wouldn't actually be faster in the long run.
(If anyone else has any optimization ideas PLEASE let me know. I definitely appreciate it, because I doubt that I have thought of everything.)
How about comparing all unique strings of size N to all other unique strings of size N±1? Or even size N±b where b is a small user defined integer. This would miss situations where the lengths are very different, but this shouldn't be a problem as the strings would be significantly different anyway.
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by Scottes »

zuilserip wrote:How about comparing all unique strings of size N to all other unique strings of size N±1? Or even size N±b where b is a small user defined integer. This would miss situations where the lengths are very different, but this shouldn't be a problem as the strings would be significantly different anyway.
Maybe I'm confused by your suggestion, or maybe I was not clear in my explanation, because I think I'm already doing what you suggest.
Then I check the name lengths. As an example, if A is 15 characters long then B must be between 13 and 17 characters long, or else I skip it.
That is, every name in the database is checked against every other name. But if the other name is much longer or much shorter (+/- 2) then I stop working on that name. If the lengths of both names are within 2 then I will continue to work on that name.

As an example, Let's say that the current name is "Led Zeppelin" which is 12 characters long. I am comparing that against every other name in the database. The first name up is "Los Lobos" which is 9 characters long. 12 versus 9 is too different, so I ignore "Los Lobos" and immediately get the next name. "Manfredd Mann's Earth Band" is much longer than Led Zeppelin, so I move to the next name, "Mercy Train". "Led Zeppelin" is 12 characters and "Mercy Train" is 11, so that is close enough so I continue to compare those two names.
zuilserip
Posts: 34
Joined: Wed Feb 22, 2012 8:00 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by zuilserip »

Sorry Scottes - I missed that part of your explanation. It looks like you are already doing something like what I proposed.

My suggestion was to order all the (unique) names by size. That way you would know where in the list you would start and end the comparisons. I.e., if the current name is of size 10, you would go down the list starting with strings of size 8 and go until you get to last of the strings of size 12. It sounds like you are already doing something equivalent.

Another way that might (further) speed up the process would be to create a hash for each name to immediately flag non-matches.

Here is an example of how this might work - as you create the list of unique names, you can attach a "quick out" flag to each name that would have, say the first 2 characters (all uppercase, any non alpha characters mapped to a wild-card, say '#' ). So:
Led Zeppelin -> LE
Mercy Train -> ME
2Pac -> #P
¡Cubanismo! -> #C
.38 Special -> ##
etc.

So there would be 729 possible flags (i.e., 27^2 where 27 = 26 alphas +1 wildcard). Now when you are comparing unique strings you can first quickly check whether it is possible for the two names to be similar enough based on their respective flags or - if not - you can skip comparing them. To check if a match is possible, you can have a bit matrix that maps every flag to every other possible flag (so 729x729). If it is possible for the strings to be similar enough, the matrix would have a 1, otherwise a 0. So if one flag is 'LE' and the other is 'L?' or '?E' or '?L' (where ? can be any other character) then it is conceivable that there is a match and the full check must take place. Otherwise, you can skip the comparison and go to the next one.

This could replace most of the very expensive string comparisons (I would guess more than 90%) with a single table lookup. A three character flag should be a lot more effective even, but then the lookup table might get too big.
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by Scottes »

That is definitely an interesting concept. I am going to let my brain ponder on that.

There are numerous small string comparisons, like length comparisons, but the expensive full comparison is probably run less than 20% of the time. A wild guess, but now that I have several large databases I plan on doing some tests to check such things.

I am afraid that many optimization methods would cause a loss of accuracy though. Making it case-insensitive would probably speed things up tremendously, but is a huge concern if some other process considers "This" to not consider "this" to be a duplicate.
zuilserip
Posts: 34
Joined: Wed Feb 22, 2012 8:00 pm

Re: MonkeyMatch 0.5.49 - Find & Fix Similar Spelling (6/30/1

Post by zuilserip »

Scottes wrote:That is definitely an interesting concept. I am going to let my brain ponder on that.

There are numerous small string comparisons, like length comparisons, but the expensive full comparison is probably run less than 20% of the time. A wild guess, but now that I have several large databases I plan on doing some tests to check such things.

I am afraid that many optimization methods would cause a loss of accuracy though. Making it case-insensitive would probably speed things up tremendously, but is a huge concern if some other process considers "This" to not consider "this" to be a duplicate.
That makes sense Scottes - if you can profile the code you should be able to get a much better idea of where to focus any optimization efforts.

Regarding capitalization, the 'quick out' flag suggestion I made above should handle that correctly. So "Led Zeppelin", "LED ZEPPELIN", and "lEd zEpPeLiN" would all map to the same flag ("LE") and would be sent to the 'full comparison' check that would identify them as being different.
Scottes
Posts: 150
Joined: Sat Mar 21, 2009 6:51 am

Re: MonkeyMatch 0.5.53 - Find & Fix Similar Spelling (7/4/13

Post by Scottes »

Beta 0.5.53 - July 4, 2013
Download Here

Fixed a bug caused by extremely long names. The original limit was set to 250 characters so it choked on a name with 440 characters. The limit is now 1,000.

Added a counter to show how many songs have been changed. Just to let you know it's doing something and hasn't hung.

Changed the way that changed songs are saved and updated by MediaMonkey. Both dtsig and I saw an issue where both MonkeyMatch and MediaMonkey stopped responding and had to be manually killed. I believe this was because MonkeyMatch would change a bunch of songs and then tell MediaMonkey to update them all at once. MonkeyMatch now has MediaMonkey update a song as soon as it gets changed. It's a little bit slower, but seems much more stable.

Added an option to stop matching once a certain number of Match Pairs have been found. The default is 500, but this can be changed in the Configure menu, anywhere between 1 and 100,000. Without this stop MonkeyMatch would keep going until all matches were found or - even worse - when it ran out of memory and crashed. Even if it didn't run out of memory, who really wants to process 100,000 matches???

Clicking Help in the Configure menu now works.

When all Match Pairs have been processed, the Match Pair list disappears.

Made some minor tweaks and stability improvements, and added some more debugging capabilities.


At this point I think I have taken care of all known bugs, though I'm not 100% positive of the issue where both programs stop responding. What I did to address that issue is certainly a stability improvement, but I could not find a definitive root cause.

I want to thank everyone who sent me their database. These bugs would not have been found without them, and having them allowed me to do a LOT of testing. Thanks!!


I'm going back to work on the multithreading work to improve the matching speed - unless someone hits another bug.
dtsig
Posts: 3588
Joined: Mon Jan 24, 2011 6:34 pm

Re: MonkeyMatch 0.5.53 - Find & Fix Similar Spelling (7/4/13

Post by dtsig »

I am giving this a try right now and really like the inclusion of visual clues that something is happening.

Would it be possible to get more information. For example, when looking at matches for artists there are many many artist with very similar names ... possibly just gender changes. If you included maybe the album name with the artist name it might go a way to be sure we are changing the right things.

Just a thought.

Thanks again
Where's the db and ini stored
Reporting Bugs
Where tags are stored

Not affiliated with MediaMonkey ... just a RABID user/lover
DTSig
Post Reply