Utf-8 Coding In Python

March 31, 2024 Post a Comment

I have an UTF-8 character encoded with `_' in between, e.g., '_ea_b4_80'. I'm trying to convert it into UTF-8 character using replace method, but I can't get the correct encoding.

Solution 1:

\x is only meaningful in string literals, you're can't use replace to add it.

To get your desired result, convert to bytes, then decode:

import binascii

r ='_ea_b4_80'

rhexonly = r.replace('_', '')          # Returns'eab480'
rbytes = binascii.unhexlify(rhexonly)  # Returns b'\xea\xb4\x80'
rtext = rbytes.decode('utf-8')         # Returns'관' (unicode if Py2, str Py3)
print(rtext)

which should get you 관 as you desire.

If you're using modern Py3, you can avoid the import (assuming r is in fact a str; bytes.fromhex, unlike binascii.hexlify, only take str inputs, not bytes inputs) using the bytes.fromhex class method in place of binascii.unhexlify:

rbytes = bytes.fromhex(rhexonly)  # Returns b'\xea\xb4\x80'

Free Interactive Python Tutorial

Utf-8 Coding In Python

Solution 1:

Post a Comment for "Utf-8 Coding In Python"