BeautifulSoup - convert/decode HTML entity codes into regular python string
Submitted by jeff on Tue, 07/27/2010 - 18:43.
Answer from StackOverflow
Dear BeautifulSoup users,
Use convertEntities=BeautifulSoup.HTML_ENTITIES to decode or convert HTML entity codes into regular python strings.
Example:
‘>’ converts to ‘>’
‘&’ decodes ‘&’.
Background
I am working with an XML feed that has HTML embedded in it. By default BeautifulSoup will encodes characters into SGML (or XML or HTML) entities.
This is the XML message I receive.
<message><strong>Hello</strong> world</message>
Since BeautifulSoup automatically encodes the contents of the message to be safe for XML the string you get will be different from the raw XML I expected.
What I wanted
<strong>Hello</strong> world
Instead I got
<strong>Hello</strong> world
Using convertEntites resolves the problem.
soup = BeautifulSoup(content, convertEntities=BeautifulSoup.HTML_ENTITIES)
Answer from StackOverflow
But available in BeautifulSoup Documentation I just didn’t understand their examples.
Recent blog posts
- Amazon S3 Website CNAME
- Button labels for checkout
- Insert html django contrib messages
- Twitter Bootstrap not working with LESS.js
- Javascript libs that offer basic subset of jquery features
- Building the donor-matic
- Playing with free Google Map alternatives.
- Some flows are not as complicated as they appear
- Form design crib sheet
- Script to get IE9 Windows Virtual PC images into Virtual Box