Skip to content Skip to sidebar Skip to footer

Double-decoding Unicode In Python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings. I send the string u'XüYß' encoded using UTF-8, thus becoming

Solution 1:

What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:

def double_decode(bstr):
    return bstr.decode("utf-8").encode("latin-1").decode("utf-8")

Solution 2:

Don't use this! Use @hop's solution.

My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)

defdouble_decode_unicode(s, encoding='utf-8'):
    return''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)

Then,

>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'>>> print _
XüYß

Solution 3:

Here's a little script that might help you, doubledecode.py -- https://gist.github.com/1282752

Post a Comment for "Double-decoding Unicode In Python"