How to load UTF8 data with python minidom ?

Wednesday, 22 August 2012
|
Écrit par
Grégory Soutadé

For the dynastie project, I need to load data encoded in UTF-8 with Python minidom XML parser. But when I wrote node.toxml('utf-8') to display the XML tree, I get this error :

UnicodeDecodeError at /generate/1

'ascii' codec can't decode byte 0xc2 in position 187: ordinal not in range(128)

In facts Python thinks that all data in XML tree are in ASCII and try to encode it into UTF-8 (or anything else you supplied). The solution is to use your own writer that will convert all non utf-8 strings in unicode string which can be then re-encoded in every format (like utf-8). This doesn't appears in Python 3 because, in Python 3, all strings are already in unicode. Add the following class to your code :

class UnicodeWriter(codecs.StreamWriter): encode = codecs.utf_8_encode def __init__(self): self.value = u'' def write(self, object): if not type(object) == unicode: self.value = self.value + unicode(object, 'utf-8') else: self.value = self.value + object return self.value def reset(self): self.value = u'' def getvalue(self): return self.value

And our node.toxml('utf-8') becomes :

writer = UnicodeWriter() node.writexml(writer) writer.getvalue().encode('utf-8')
Auteur :


e-mail* :


Le commentaire :




* Seulement pour être notifié d'une réponse à cet article
* Only for email notification