Problems Extracting The Xml From A Word Document In French With Python: Illegal Characters Generated
Solution 1:
The problem is that you are accidentally changing the encoding on word/document.xml
in template2.docx
. word/document.xml
(from template.docx
) is initially encoded as UTF-8 (as is the default encoding for XML documents).
xmlString = zip.read("word/document.xml").decode("utf-8")
However, when you copy it for template2.docx
you are changing the encoding to CP-1252. According to the documentation for open(file, "w")
,
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
You indicated that calling locale.getpreferredencoding(False)
gives you cp1252
which is the encoding word/document.xml
is being written.
Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?>
to the beginning of word/document.xml
, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.
So you want to specify the encoding as UTF-8 when writing by using the encoding
argument to open()
:
with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
f.write(xmlString)
Post a Comment for "Problems Extracting The Xml From A Word Document In French With Python: Illegal Characters Generated"