BeautifulSoup - convert/decode HTML entity codes into regular python string

Submitted by jeff on Tue, 07/27/2010 - 18:43.

Dear BeautifulSoup users,

Use convertEntities=BeautifulSoup.HTML_ENTITIES to decode or convert HTML entity codes into regular python strings.

Example:

‘>’ converts to ‘>’
‘&’ decodes ‘&’.

Background

I am working with an XML feed that has HTML embedded in it. By default BeautifulSoup will encodes characters into SGML (or XML or HTML) entities.

This is the XML message I receive.

<message><strong>Hello</strong> world</message>

Since BeautifulSoup automatically encodes the contents of the message to be safe for XML the string you get will be different from the raw XML I expected.

What I wanted

<strong>Hello</strong> world

Instead I got

&lt;strong&gt;Hello&lt;/strong&gt; world

Using convertEntites resolves the problem.

soup = BeautifulSoup(content, convertEntities=BeautifulSoup.HTML_ENTITIES)

Answer from StackOverflow

But available in BeautifulSoup Documentation I just didn’t understand their examples.