python

BeautifulSoup - convert/decode HTML entity codes into regular python string

Dear BeautifulSoup users,

Use convertEntities=BeautifulSoup.HTML_ENTITIES to decode or convert HTML entity codes into regular python strings.

Example:

‘>’ converts to ‘>’
‘&’ decodes ‘&’.

Background

I am working with an XML feed that has HTML embedded in it. By default BeautifulSoup will encodes characters into SGML (or XML or HTML) entities.

This is the XML message I receive.

<message><strong>Hello</strong> world</message>

Since BeautifulSoup automatically encodes the contents of the message to be safe for XML the string you get will be different from the raw XML I expected.

What I wanted

<strong>Hello</strong> world

Instead I got

&lt;strong&gt;Hello&lt;/strong&gt; world

Using convertEntites resolves the problem.

soup = BeautifulSoup(content, convertEntities=BeautifulSoup.HTML_ENTITIES)

Answer from StackOverflow

But available in BeautifulSoup Documentation I just didn’t understand their examples.

Django-tagging

Useful examples on how to effectively use the Django-Tagging application (plugin?)

Installation is simple. I believe easy_install django-tagging works.

Django-tagging code and documentation

Introduction to using django-tagging

Use the django-tagging tag cloud

Use django-tagging built-in views

Sanitize, tidy, or cleanup Word HTML using word unmunger

Word Un-munger python script will convert and clean crappy Word HTML to decent HTML.

Automator Script will make it easier to convert clean the Word HTML of several pages in a batch job.