python parsestring / silently skips entities
The Python xml.dom.minidom parseString
silently skips over unknown
entities.
The only entities it does know, are <
, >
, &
, '
and "
and of course the numeric entities &#nn;
and &#xhh;
.
That’s obvious, because those are the only ones defined in the XML 1.0 spec.
However, if you’re parsing XHTML documents, it’s not nice that the entity references to special characters silently get dropped.
Other people have stubled on the same issue, like in parsing xml containing &entities; with minidom and Problem with minidom and special chars in HTML.
The Python minidom documentation for the parse
states that “[the]
function will change the document handler of the parser and activate
namespace support; other parser configuration (like setting an entity
resolver) must have been done in advance.”
Ah! Something about entities, but no example or further explanation.
So, how do I tell the parseString
function what the defined entities
are?
That’s where minidom_xhtml
comes in. The parseStringXHTML
function
as defined therein handles adding all the XHTML entities you need into
the DOCTYPE declaration.
Download as a package (includes the xhtml*.ent
files):
minidom_xhtml-1.tar.gz
(or view
the code)
Example usage:
from minidom_xhtml import parseStringXHTML
doc = parseStringXHTML('<html><body>Voilà!</body></html>')
body = doc.getElementsByTagName('body')[0]
print body.firstChild.wholeText.encode('utf-8')