Double-decoding Unicode In Python
I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings. I send the string u'XüYß' encoded using UTF-8, thus becoming
Solution 1:
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Solution 2:
Don't use this! Use @hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
defdouble_decode_unicode(s, encoding='utf-8'):
return''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'>>> print _
XüYß
Solution 3:
Here's a little script that might help you, doubledecode.py -- https://gist.github.com/1282752
Post a Comment for "Double-decoding Unicode In Python"