Skip to content Skip to sidebar Skip to footer

Why Does Python Write One Wrongly Encoded Line Every Two Lines?

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files. My problem is that I can't

Solution 1:

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'>>> line.decode('utf-16-be')
u'\t                 as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>>print data.decode('utf-16-le')
ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:
    string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:
    outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.

Post a Comment for "Why Does Python Write One Wrongly Encoded Line Every Two Lines?"