Official ObjectGraph Blog

Saturday, November 18, 2006

Beautiful Soup
The Python HTTPParser both in HTTPParser and htmllib are very flexible to provide your own implementations for handling start tag, end tags and data elements, but it has limitations. For example if i wanted to preserve input formatting of HTML, but change just a few tags it would be hard to do.

I found a better solution.

Beautiful Soup is a good HTML parser from the initial impressions. I tried many public sites like cnn.com, news.com etc and able to parse it into a tree and access the elements very easily.

It provides very easy functions to search the entire tree and returns references to those.

Lets say, you want to get all hyperlink (a) tags, the code would be as simple as below

from BeautifulSoup import BeautifulSoup
import urllib2;

data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())

resultset=soup.findAll("a")
for i in range(len(resultset)):
 print resultset[i]

Now say, you want to make all the links absolute instead of relative, a simple function that takes the resultset would do the trick


from BeautifulSoup import BeautifulSoup
import urllib2

def relativetoabsolute(resultset,tag,url):
  for i in range(len(resultset)):
      try:
          link=str(resultset[i][tag])
          if not link.lower().startswith("http"):
              s[i][tag]=urljoin(url,link)
      except:
          pass

data=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(data.read())

resultset=soup.findAll("a")
relativetoabsolute(resultset,'href','http://www.cnn.com')
print soup


The output HTML would have all relative URLs while preserving input formatting. The cool thing is you could prettify the output by just
print soup.prettify()

posted by gavi at 8:52 PM

2 Comments:

  • you can make your code more readable and abstract like so:

    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup

    site = urlopen("http://www.host.com")
    soup = BeautifulSoup(site)

    for i in soup('element'):
    print i

    as you can see this code is a lot more abstract.

    By the way python "FOR" loop is a bit different then most of other languages (thought you can use this method on most of other languages). If you can iterate over an object then you don't have to use the range(len(obj)) method. But you should instead write the object. the loop variable will change to the next element in the object on each return.

    By Anonymous Anonymous, at 6:38 AM  

  • Thank you for the suggestion. I was exploring python at that time and now looking back the code looks awful :-)

    By Blogger gavi, at 7:24 PM  

Post a Comment

<< Home