Why Does Python Write One Wrongly Encoded Line Every Two Lines?
Solution 1:
Your corrupted data is UTF-16, using big-endian byte order:
>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'>>> line.decode('utf-16-be')
u'\t as SoldDebitor,\r\n'
but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:
>>>print data.decode('utf-16-le')
ऀ 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ
That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.
You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.
Use io.open()
to read unicode data instead:
import io
with io.open('input', 'r', encoding='utf16') as infh:
string = infh.read()
# Do stuff
with io.open('output', 'w+', encoding='utf16') as outfh:
outfh.write(string)
because it appears your input file already has a UTF-16 BOM.
This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.
Post a Comment for "Why Does Python Write One Wrongly Encoded Line Every Two Lines?"