Whenever I use some service over the web, I look for several things. Ease of use and customisability are important factors.
However, the most important thing I consider is vendor lock-in (or rather the lack of it). Let's say I am using a particular mail service (ex, GMail). If someday, I find a better email service, would it be easy for me to switch to that service? How easy is it for me to transfer my data from my old service to my new service?
For services like Mail, there are standard protocols for data access. So this is not an issue. However for the more recent services, like blogging, micro-blogging etc, the most widely used data access methodology/format is 'HTTP' via 'RSS' or 'ATOM'.
However, it's not the case that all services provide data as RSS (or XML or in any other parseable form). For example, suppose I make a list of movies I have watched, in some Facebook application, or a list of restaurants I visited, how do I download this list? If I cannot download it, does it mean I am tied to this application provider forever? What if I have added 200 movies in my original service and I come across another service that has better interface and more features and I want to switch to this new service but not lose the data that I have invested time to enter in my original service?
In fact, recently when I tried to download all my Twitters, I realized that this feature has been disabled. You are not able to get your old Twitters in XML format.
So what do we do when a service does not provide data as XML and we need to somehow scrape that data and store it?
This is kind of related to my last blog entry.
So I started thinking of ways in which I could download my Twitters. The solution I thought of initially was using Rhino and John Resig's project (mentioned in my previous blog entry). However, I ran into parse issues like before. So I had to think of alternative ways.
Now I took advantage of the fact that Twitters are short (and not more than 140 characters).
The solution I came up with uses a combination of Greasemonkey and PHP on the server side:
Here is the GM script:
If you intend to use this, do remember to change the URL to post data to.
// @name Twitter Downloader
// @namespace http://buzypi.in/
// @author Gautham Pai
// @include http://www.twitter.com/*
// @description Post Twitters to a remote site
// ==/UserScript==
function twitterLoader (){
var timeLine = document.getElementById('timeline');
var spans = timeLine.getElementsByTagName('span');
var url = 'http://buzypi.in/twitter.php';
var twitters = new Array();
for(var i=0;i<spans.length;i++){
if(spans[i].className != 'entry-title entry-content'){
continue;
}
twitters.push(escape(spans[i].innerHTML));
}
for(var i=0;i<twitters.length;i++){
var last = 'false';
if(i == twitters.length - 1)
last = 'true';
var scriptElement = document.createElement('script');
scriptElement.setAttribute('src',url+'?last='+last+'&data='+twitters[i]);
scriptElement.setAttribute('type','text/javascript');
document.getElementsByTagName('head')[0].appendChild(scriptElement);
}
}
window.addEventListener('load',twitterLoader,true);
The server side PHP code is:
<?php
global $_REQUEST;
$data = $_REQUEST['data'];
//Store data in the DB, CouchDB (or some other location)
$last = $_REQUEST['last'];
if($last == 'true'){
echo "
var divs = document.getElementsByTagName('div');
var j= 0;
for(j=0;j<divs.length;j++){
if(divs[j].className == 'pagination')
break;
}
var sectionLinks = divs[j].getElementsByTagName('a');
var href = '';
if(sectionLinks.length == 2)
href = sectionLinks[1].href;
else
href = sectionLinks[0].href;
var presentPage = parseInt(document.location.href[document.location.href.indexOf('page')+'page'.length+1]);
var nextPage = parseInt(href[href.indexOf('page')+'page'.length+1]);
if(nextPage < presentPage)
alert('No more pages to parse');
else {
alert('Changing document location');
document.location.href = href;
}
";
} else {
echo "
var recorder = 'true';
";
}
?>
The GM script scrapes the twitters from a page and posts it to the server using <script> includes. The server stores the twitters in some data store. The server also checks if the twitter posted was the last twitter in the page. If so, it sends back code to change to the next page.
Thus the script when installed, will post twitters from the most recent to the oldest.
Ok, now how would this work with other services?
The pattern seems to be:
* Get the data elements from the present page – data elements could be movie details, restaurant details etc.
* Post data elements to the server.
** The posting might require splitting the content if the length is more than the maximum length of the GET request URL.
* Identify how you can move to the next page and when to move to the next page. Use this to hint the server to change to the next page.
* Write the server side logic to store data elements.
* Use the hint from the client to change to the next page when required.
The biggest advantage of this method is we make use of the browser to do authentication with the remote service and also to do the parsing of the HTML (which, as I mentioned in my previous post, browsers are best at).