Python - XML parsing using Beautiful Soup

Posted by Colonel Patch on December 14, 2010

Parsing using Beautiful Soup, Example of using the Beautiful Soup html / xml parser

Beautiful soup (BS) is an html/xml parser, which despite it's trivial name is very useful for searching web content and extracting information. Written in python, BS intelligently makes sense of even badly written web pages. Why not use Pythons' existing URL libs I hear you ask? Well, you can use these libraries but all too quickly you will get a "malformed tag" error, then you are faced with trying to deal with all the multitude of authoring errors that the web has to offer (and there are awful lot of poorly written pages out there!).

Using BS couldn't be easier: simply import the BS library, and point BS at a web page. From here on BS builds the DOM of the page and then allows the user to pick out or edit in the bits needed.

Probably the most useful function of BS is to find stuff within a page, as the following example will demonstrate.

 
import urllib2   # use urlib2 to grab a page
 
from BeautifulSoup import BeautifulSoup  #import beautiful soup parser
 
page = urllib2.urlopen("http://www.last.fm/music/Mike+Oldfield/Discovery")  # a random web page!
soup = BeautifulSoup(page)   #point BS at  a web page
 
 
# find all img tags which are 64*64 (this is a pretty standard size for a web album cover)
result=soup.findAll('img',attrs={"width" : "64","height" : "64"}) 
 
# BS returns a list of hits that match critria
if len(result)==1:print "unique solution found! :)"  # in this example we ideally want to only have one result
 
# do stuff with results
for items in result:
 
    for item in items.attrs:
        if item[0]=="src": print "src=" +item[1]
        if item[0]=="width": print "width="+ item[1]
        if item[0]=="height": print "height=" +item[1]
 
 

And that is it!

Copy and paste above code into your python code and run it.

I hope you can see BS saves an awful lot of coding for exceptions and headaches.

Comments:

Leave a Reply



(Your email will not be publicly displayed.)

Please type the letters and numbers shown in the image.Captcha Code