AutoTag Web from Amazon - Encoding Error (Unicode)

tonydl · Post by **tonydl** » Tue Jan 24, 2012 7:43 am

Hello,

I tried to AutoTag an audiobook from Amazon Germany (de) - it seems like there's a Unicode error on the German umlauts (ä, ö, ü) - and I guess more characters are affected - like ß.

Example: Search for the following album: "Die drei ??? - Folge 137/Pfad der Angst"

Screenshot: http://img717.imageshack.us/img717/1444 ... 133832.png

At the Website all information are displayed correctly: http://www.amazon.de/gp/product/B003273MKS/

I'm using the latest build (4.0.2) on Windows 7 x64.

Thanks, best regards,
tony.

Post by **Lowlander** » Tue Jan 24, 2012 12:08 pm

It said to be a bug on Amazon's side: http://www.ventismedia.com/mantis/view.php?id=5974

tonydl · Post by **tonydl** » Tue Jan 24, 2012 6:52 pm

I'm not sure if my issue is related to that as there are no "unknown characters" (like the box or the question mark at the issue you posted).
The bug you posted states an "invalid unicode detection - but what I'm posting is is just a decoding issue.

It's actually a pretty common unicode (UTF8) decoding error.
I'll try to expand on it further:

These umlauts are encoded with 16 bits in UTF-8 - "normal" characters are encoded with 8 bits.
If an ISO-8859-1 interpreter parses unicode it will generate two chars because it can't handle 16 bit characters and will interpret the umlaut as two 8 bit characters.

Example:
UTF-8 Character "ü": 11000011 10011100
will get interpreted as two 8bit blocks:
11000011 --> "Ã"
10011100 --> "¼"
which will result in the two characters "Ã¼" instead of "ü".

There are tons of frameworks which will help with converting.
As a workaround a simple replacer would be possible which would replace "Ã¼" with "ü" and the most common other chars - there are lists for that on the internet, too.

tonydl · Post by **tonydl** » Thu Jan 26, 2012 6:26 pm

*push*

MiPi · Post by **MiPi** » Fri Jan 27, 2012 2:48 am

tonydl: we naturally handle UTF8 strings, but sometimes, the problem is on Amazon side - the string MM receives as XML response from their server is not in UTF8, but some parts of the response are encoded twice to UTF8. So when MM decode it, it is still in UTF8 as you described. It is related only to some records, the same album with the same umlauts could be sometimes received correctly from another Amazon server or another related record on the same Amazon server. They have it correctly on web, but sometimes they send it incorrectly in XML response.

tonydl · Post by **tonydl** » Sun Jan 29, 2012 3:28 pm

Thanks for the answer.

Could you maybe implement a replace-workaround?
For the German letters I'm using the following replace()'s myself:

Code: Select all

Ã„ --> Ä
Ã¤ --> ä 
Ãœ --> Ü 
Ã¼ --> ü 
Ã– --> Ö 
Ã¶ --> ö 
ÃŸ --> ß

As it's highly unlikely that the first chars are used in this combination it should be pretty failsafe. And the chars on the right are used quite a lot.

What do you think?

Edit: I found a larger table on the web, from which people from Spain, Portugal, Greek, etc. would benefit, too.
The replace should again be pretty failsafe (no "wrong replaces") because the char-combinations on the right won't make any sense.
I think it would be a pretty good solution.

(the table is reversed compared to the one above - correct char on the left, String to replace on the right)

Code: Select all

    "¡" = "Â¡"
    "¢" = "Â¢"
    "£" = "Â£"
    "¤" = "Â¤"
    "¥" = "Â¥"
    "¦" = "Â¦"
    "§" = "Â§"
    "¨" = "Â¨"
    "©" = "Â©"
    "ª" = "Âª"
    "«" = "Â«"
    "¬" = "Â¬"
    "®" = "Â®"
    "¯" = "Â¯"
    "°" = "Â°"
    "±" = "Â±"
    "²" = "Â²"
    "³" = "Â³"
    "´" = "Â´"
    "µ" = "Âµ"
    "¶" = "Â¶"
    "·" = "Â·"
    "¸" = "Â¸"
    "¹" = "Â¹"
    "º" = "Âº"
    "»" = "Â»"
    "¼" = "Â¼"
    "½" = "Â½"
    "¾" = "Â¾"
    "¿" = "Â¿"
    "À" = "Ã€"
    "Â" = "Ã‚"
    "Ã" = "Ãƒ"
    "Ä" = "Ã„"
    "Å" = "Ã…"
    "Æ" = "Ã†"
    "Ç" = "Ã‡"
    "È" = "Ãˆ"
    "É" = "Ã‰"
    "Ê" = "ÃŠ"
    "Ë" = "Ã‹"
    "Ì" = "ÃŒ"
    "Î" = "ÃŽ"
    "Ñ" = "Ã‘"
    "Ò" = "Ã’"
    "Ó" = "Ã“"
    "Ô" = "Ã”"
    "Õ" = "Ã•"
    "Ö" = "Ã–"
    "×" = "Ã—"
    "Ø" = "Ã˜"
    "Ù" = "Ã™"
    "Ú" = "Ãš"
    "Û" = "Ã›"
    "Ü" = "Ãœ"
    "Þ" = "Ãž"
    "ß" = "ÃŸ"
    "à" = "Ã "
    "á" = "Ã¡"
    "â" = "Ã¢"
    "ã" = "Ã£"
    "ä" = "Ã¤"
    "å" = "Ã¥"
    "æ" = "Ã¦"
    "ç" = "Ã§"
    "è" = "Ã¨"
    "é" = "Ã©"
    "ê" = "Ãª"
    "ë" = "Ã«"
    "ì" = "Ã¬"
    "í" = "Ã"
    "î" = "Ã®"
    "ï" = "Ã¯"
    "ð" = "Ã°"
    "ñ" = "Ã±"
    "ò" = "Ã²"
    "ó" = "Ã³"
    "ô" = "Ã´"
    "õ" = "Ãµ"
    "ö" = "Ã¶"
    "÷" = "Ã·"
    "ø" = "Ã¸"
    "ù" = "Ã¹"
    "ú" = "Ãº"
    "û" = "Ã»"
    "ü" = "Ã¼"
    "ý" = "Ã½"
    "þ" = "Ã¾"
    "ÿ" = "Ã¿"
    "†" = "â€ "
    "Š" = "Å "

MiPi · Post by **MiPi** » Mon Jan 30, 2012 5:23 am

I agree, it could be good improvement. Reopened issue for it: http://www.ventismedia.com/mantis/view.php?id=5974

AutoTag Web from Amazon - Encoding Error (Unicode)

AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)

Re: AutoTag Web from Amazon - Encoding Error (Unicode)