Why Does Python Write One Wrongly Encoded Line Every Two Lines?

July 31, 2024 Post a Comment

I am trying to dump the content of a table column from SQL Server 2K into text files, that I want to later treat with Python and output new text files. My problem is that I can't

Solution 1:

Your corrupted data is UTF-16, using big-endian byte order:

>>> line = '\x00\t\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00a\x00s\x00 \x00S\x00o\x00l\x00d\x00D\x00e\x00b\x00i\x00t\x00o\x00r\x00,\x00\r\x00\n'>>> line.decode('utf-16-be')
u'\t                 as SoldDebitor,\r\n'

but whatever is reading your file again is interpreting the data UTF-16 in little endian byte order instead:

>>>print data.decode('utf-16-le')
ऀ                 愀猀 匀漀氀搀䐀攀戀椀琀漀爀Ⰰഀ਀

That's most likely because you didn't include a BOM at the start of the file, or you mangled the input data.

You really should not be reading UTF-16 data, in text modus, without decoding, as newlines encoded in two bytes are almost guaranteed to be mangled, leading to off-by-one byte order errors, which can also lead to every other line or almost every other line being mangled.

Use io.open() to read unicode data instead:

import io

with io.open('input', 'r', encoding='utf16') as infh:
    string = infh.read()

# Do stuff

with io.open('output', 'w+', encoding='utf16') as outfh:
    outfh.write(string)

because it appears your input file already has a UTF-16 BOM.

This does mean the rest of your code needs to be adjusted to handle Unicode strings instead of byte strings as well.

Free Interactive Python Tutorial

Why Does Python Write One Wrongly Encoded Line Every Two Lines?

Solution 1:

Post a Comment for "Why Does Python Write One Wrongly Encoded Line Every Two Lines?"