Saturday, March 23, 2013

Web Scraping using Python

Python is a powerful and versatile programming language. Its versatility is increased by the various modules developed for wide applications. It even has a MATLAB library module NumPy.

Many times I needed to download multiple files from a webpage, say wallpapers, and all the usual methods don't work. Here I list the "usual" methods :
  1. wget :  download webpage content from the linux terminal
  2. Download them all plugins in firefox and their counterparts in google chrome
Many times I tried to extract the jpg links  in a webpage using python regular expression (regexp) . Recently I needed to download a large database provided on a webpage which would not yield to any of the above methods. The webpage contained a javascript form to provide the download link  and it used some validation methods which rendered creating the download links using simple regexp useless.

Here is a powerful alternative to avoiding the drudgery of manually downloading files, using a python module :  MechanizeBelow is the code I developed to download a webpage.

First, this is the webpage that I wanted to download. Note that the download link is provided using a javascript form. Also note the onSubmit attribute in the form tag which is the validation mechanism. So if you simply create the download link using the data provided in the html page, the downloaded file would be corrupted.


The following code first sets the appropriate variable to their appropriate values and then uses mechanize to download the file.


Now,  to create the list of pages containing the download links, I used the BeautifulSoup and the re (regexp) module to parse HTML.

To save the data structure (lists) used to store the download links pickle is a useful library. It can also be made searchable as shown here.

Mechanize has some major disadvantages that I quickly realized. The biggest is that you cannot download webpages that are dynamically loaded i.e. downloaded in multiple tries using javascript i.e. When you first download a webpage you do not download the complete webpage but HTML+javascript code. The browser then executes that javascript code to download the rest of the webpage. Now mechanize does not automatically execute the javascript code to complete the download, so you may not get the content that you want using mechanize.

Another option is easy to use is windmill . Windmill is a powerful library but it is not properly documented and hence difficult to use. The limited documentation can be found here .

Another python module for web scraping is scrapy . This one I have yet to use.

A few good beginner tutorials for python :
  1. MIT OCW : Introduction to programming
  2. Python for beginners
Usually, its faster to just dive in, choose a project and complete it using online documentation and user forums. In this regard stackoverflow.com is a particularly good place to post your queries.

No comments:

Post a Comment