Official ObjectGraph Blog
Saturday, November 18, 2006
Beautiful Soup
The Python HTTPParser both in HTTPParser and htmllib are very flexible to provide your own implementations for handling start tag, end tags and data elements, but it has limitations. For example if i wanted to preserve input formatting of HTML, but change just a few tags it would be hard to do.
I found a better solution.
Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.
It provides very easy functions to search the entire tree and returns references to those.
Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below
Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick
The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just
I found a better solution.
Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.
It provides very easy functions to search the entire tree and returns references to those.
Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below
from BeautifulSoup import BeautifulSoup import urllib2; data=urllib2.urlopen("http://www.cnn.com") soup=BeautifulSoup(data.read()) resultset=soup.findAll("a") for i in range(len(resultset)): print resultset[i]
Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick
from BeautifulSoup import BeautifulSoup import urllib2 def relativetoabsolute(resultset,tag,url): for i in range(len(resultset)): try: link=str(resultset[i][tag]) if not link.lower().startswith("http"): s[i][tag]=urljoin(url,link) except: pass data=urllib2.urlopen("http://www.cnn.com") soup=BeautifulSoup(data.read()) resultset=soup.findAll("a") relativetoabsolute(resultset,'href','http://www.cnn.com') print soup
The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just
print soup.prettify()
posted by gavi at 8:52 PM
2 Comments:
you can make your code more readable and abstract like so:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
site = urlopen("http://www.host.com")
soup = BeautifulSoup(site)
for i in soup('element'):
print i
as you can see this code is a lot more abstract.
By the way python "FOR" loop is a bit different then most of other languages (thought you can use this method on most of other languages). If you can iterate over an object then you don't have to use the range(len(obj)) method. But you should instead write the object. the loop variable will change to the next element in the object on each return.
By Anonymous, at 6:38 AM
Thank you for the suggestion. I was exploring python at that time and now looking back the code looks awful :-)
By gavi, at 7:24 PM
Post a Comment
<< Home