Trivial Botherations: programming

Having spent a lot of time finding tools for file syncing without a cloud and those that would work behind proxies, I decided to write my own Dropbox app to synchronize data across computers.

But Dropbox already does that. The point is to be able to do that using the free 2GB space that Dropbox provides. The motivation is that this way the synchronization should work behind every network routers (which must not block dropbox) that blocks torrent data. (ofcourse you can also avoid that using SSH tunneling).

So I decided to register a Dropbox developer app and use the Dropbox API to synchronize files using only the free 2GB storage. The only constraint now is that the individual files cannot be more than 2GB, but that is easily solved by breaking up the file using rar. Note that, the same could be done using Google Drive which provides 15GB of free space.

The Dropbox API documentation is very neat. I used the core Dropbox API with full access to user data for this app. The code below just authenticates the app to allow access to the user account. Once provided the authentication is then saved to a file, which when detected is used again.

import dropbox
import email, imaplib, os
import urllib2

##sign in to DROPBOX
app_key = '<key>'
app_secret = '<secret>'

flow = dropbox.client.DropboxOAuth2FlowNoRedirect(app_key, app_secret)
app_auth = False # set to false initially
if os.path.exists('accesstoken'):
 app_auth = True

if (app_auth == False):
 ### authorize server
 authorize_url = flow.start()
 print '1. Go to: ' + authorize_url
 print '2. Click "Allow" (you might have to log in first)'
 print '3. Copy the authorization code.'
 code = raw_input("Enter the authorization code here: ").strip()
 access_token, user_id = flow.finish(code)
 ## save access token once
 f = open( 'accesstoken', 'w' )
 f.write( access_token )
 f.close()
else:
 f = open( 'accesstoken', 'r' )
 access_token = f.read()
 f.close()

client = dropbox.client.DropboxClient(access_token)
print 'linked account: ', client.account_info()

The server and client communicate uses gmail. When the server uploads a file it emails the client the share link for the data. The client continuously monitors it's mailbox. As soon as it receives an email from the server, it downloads the data and sends back an acknowledgement mail. The server receives the acknowledgement, deletes the previous file, uploads another and sends the share link to the client again.

##sign in to GMAIL
user = "<server>"
pwd = "<password>"

m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
m.list()
m.select("inbox")

Here's how I implemented the rest of it (thanks to these webpages 1, 2, 3) . The script for the client side would be similar

##main for loop
while True:
 resp, items = m.search(None, 'TO', '"<server>+python@gmail.com"')
 items = items[0].split()
 print items

 emailid=items[-1]
 resp, data = m.fetch(emailid, "(RFC822)")
 email_body = data[0][1]
 mail = email.message_from_string(email_body)

 if mail['Subject'] == "done"
  ##delete file
  client.file_delete('/'+os.path.basename(uploadfilepath))
  
  ##ask for another file (or just read from a file list) to upload directory in dropbox = main
  uploadfilepath = raw_input("enter (absolute) path to file").strip()
  uploadfile = open(uploadfilepath)
  response = client.put_file('/'+os.path.basename(uploadfilepath), uploadfile)
  print "uploaded:", response

  ##create share link
  sharelink = client.share("/"+os.path.basename(uploadfilepath), False)
  print sharelink['url']+"   "+sharelink['expires']

  ##send email
  from send_email import mail
  mail("client+python@gmail.com",   sharelink['url'],    "")
  print "email sent"
  
 time.sleep(25000)

The send_email script called above is the following :

import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email import Encoders
import os

gmail_user = "<id>@gmail.com"
gmail_pwd = "<passwd>"

def mail(to, subject, text):
 msg = MIMEMultipart()

 msg['From'] = gmail_user
 msg['To'] = to
 msg['Subject'] = subject

 msg.attach(MIMEText(text))

 #part = MIMEBase('application', 'octet-stream')
 #part.set_payload(open(attach, 'rb').read())
 #Encoders.encode_base64(part)
 #part.add_header('Content-Disposition','attachment; filename="%s"' % os.path.basename(attach))
 #msg.attach(part)

 mailServer = smtplib.SMTP("smtp.gmail.com", 587)
 mailServer.ehlo()
 mailServer.starttls()
 mailServer.ehlo()
 mailServer.login(gmail_user, gmail_pwd)
 mailServer.sendmail(gmail_user, to, msg.as_string())
 # Should be mailServer.quit(), but that crashes...
 mailServer.close()

Python is a powerful and versatile programming language. Its versatility is increased by the various modules developed for wide applications. It even has a MATLAB library module NumPy.

Many times I needed to download multiple files from a webpage, say wallpapers, and all the usual methods don't work. Here I list the "usual" methods :

wget : download webpage content from the linux terminal
Download them all plugins in firefox and their counterparts in google chrome

Many times I tried to extract the jpg links in a webpage using python regular expression (regexp) . Recently I needed to download a large database provided on a webpage which would not yield to any of the above methods. The webpage contained a javascript form to provide the download link and it used some validation methods which rendered creating the download links using simple regexp useless.

Here is a powerful alternative to avoiding the drudgery of manually downloading files, using a python module : Mechanize. Below is the code I developed to download a webpage.

First, this is the webpage that I wanted to download. Note that the download link is provided using a javascript form. Also note the onSubmit attribute in the form tag which is the validation mechanism. So if you simply create the download link using the data provided in the html page, the downloaded file would be corrupted.

The following code first sets the appropriate variable to their appropriate values and then uses mechanize to download the file.

Now, to create the list of pages containing the download links, I used the BeautifulSoup and the re (regexp) module to parse HTML.

To save the data structure (lists) used to store the download links pickle is a useful library. It can also be made searchable as shown here.

Mechanize has some major disadvantages that I quickly realized. The biggest is that you cannot download webpages that are dynamically loaded i.e. downloaded in multiple tries using javascript i.e. When you first download a webpage you do not download the complete webpage but HTML+javascript code. The browser then executes that javascript code to download the rest of the webpage. Now mechanize does not automatically execute the javascript code to complete the download, so you may not get the content that you want using mechanize.

Another option is easy to use is windmill . Windmill is a powerful library but it is not properly documented and hence difficult to use. The limited documentation can be found here .

Another python module for web scraping is scrapy . This one I have yet to use.

A few good beginner tutorials for python :

MIT OCW : Introduction to programming
Python for beginners

Usually, its faster to just dive in, choose a project and complete it using online documentation and user forums. In this regard stackoverflow.com is a particularly good place to post your queries.

Trivial Botherations

Friday, December 13, 2013

Hacking Dropbox

Saturday, March 23, 2013

Web Scraping using Python