python
BeautifulSoup - convert/decode HTML entity codes into regular python string
Dear BeautifulSoup users,
Use convertEntities=BeautifulSoup.HTML_ENTITIES to decode or convert HTML entity codes into regular python strings.
Example:
‘>’ converts to ‘>’
‘&’ decodes ‘&’.
Background
I am working with an XML feed that has HTML embedded in it. By default BeautifulSoup will encodes characters into SGML (or XML or HTML) entities.
This is the XML message I receive.
<message><strong>Hello</strong> world</message>
Since BeautifulSoup automatically encodes the contents of the message to be safe for XML the string you get will be different from the raw XML I expected.
What I wanted
<strong>Hello</strong> world
Instead I got
<strong>Hello</strong> world
Using convertEntites resolves the problem.
soup = BeautifulSoup(content, convertEntities=BeautifulSoup.HTML_ENTITIES)
Answer from StackOverflow
But available in BeautifulSoup Documentation I just didn’t understand their examples.
Django-tagging
Useful examples on how to effectively use the Django-Tagging application (plugin?)
Installation is simple. I believe easy_install django-tagging works.
Django-tagging code and documentation
Introduction to using django-tagging
Use the django-tagging tag cloud
Use django-tagging built-in views
Sanitize, tidy, or cleanup Word HTML using word unmunger
Word Un-munger python script will convert and clean crappy Word HTML to decent HTML.
Automator Script will make it easier to convert clean the Word HTML of several pages in a batch job.
