Downloading your data using Greasemonkey

Whenever I use some service over the web, I look for several things. Ease of use and customisability are important factors.

However, the most important thing I consider is vendor lock-in (or rather the lack of it). Let's say I am using a particular mail service (ex, GMail). If someday, I find a better email service, would it be easy for me to switch to that service? How easy is it for me to transfer my data from my old service to my new service?

For services like Mail, there are standard protocols for data access. So this is not an issue. However for the more recent services, like blogging, micro-blogging etc, the most widely used data access methodology/format is 'HTTP' via 'RSS' or 'ATOM'.

However, it's not the case that all services provide data as RSS (or XML or in any other parseable form). For example, suppose I make a list of movies I have watched, in some Facebook application, or a list of restaurants I visited, how do I download this list? If I cannot download it, does it mean I am tied to this application provider forever? What if I have added 200 movies in my original service and I come across another service that has better interface and more features and I want to switch to this new service but not lose the data that I have invested time to enter in my original service?

In fact, recently when I tried to download all my Twitters, I realized that this feature has been disabled. You are not able to get your old Twitters in XML format.

So what do we do when a service does not provide data as XML and we need to somehow scrape that data and store it?

This is kind of related to my last blog entry.

So I started thinking of ways in which I could download my Twitters. The solution I thought of initially was using Rhino and John Resig's project (mentioned in my previous blog entry). However, I ran into parse issues like before. So I had to think of alternative ways.

Now I took advantage of the fact that Twitters are short (and not more than 140 characters).

The solution I came up with uses a combination of Greasemonkey and PHP on the server side:

Here is the GM script:
If you intend to use this, do remember to change the URL to post data to.

// @name           Twitter Downloader

// @namespace      http://buzypi.in/

// @author         Gautham Pai

// @include        http://www.twitter.com/*

// @description    Post Twitters to a remote site

// ==/UserScript==

function twitterLoader (){
	var timeLine = document.getElementById('timeline');
	var spans = timeLine.getElementsByTagName('span');
	var url = 'http://buzypi.in/twitter.php';
	var twitters = new Array();
	for(var i=0;i<spans.length;i++){
		if(spans[i].className != 'entry-title entry-content'){
			continue;
		}
		twitters.push(escape(spans[i].innerHTML));
	}

	for(var i=0;i<twitters.length;i++){
		var last = 'false';
		if(i == twitters.length - 1)
			last = 'true';
		var scriptElement = document.createElement('script');
		scriptElement.setAttribute('src',url+'?last='+last+'&data='+twitters[i]);
		scriptElement.setAttribute('type','text/javascript');
		document.getElementsByTagName('head')[0].appendChild(scriptElement);
	}
}

window.addEventListener('load',twitterLoader,true);

The server side PHP code is:

<?php

global $_REQUEST;
$data = $_REQUEST['data'];
//Store data in the DB, CouchDB (or some other location)
$last = $_REQUEST['last'];
if($last == 'true'){
	echo "
	var divs = document.getElementsByTagName('div');
	var j= 0;
	for(j=0;j<divs.length;j++){
		if(divs[j].className == 'pagination')
		break;
	}
	var sectionLinks = divs[j].getElementsByTagName('a');
	var href = '';
	if(sectionLinks.length == 2)
		href = sectionLinks[1].href;
	else
		href = sectionLinks[0].href;
	var presentPage = parseInt(document.location.href[document.location.href.indexOf('page')+'page'.length+1]);
	var nextPage = parseInt(href[href.indexOf('page')+'page'.length+1]);
	if(nextPage < presentPage)
		alert('No more pages to parse');
	else {
		alert('Changing document location');
		document.location.href = href;
	}
	";
} else {
	echo "
	var recorder = 'true';
	";
}

?>

The GM script scrapes the twitters from a page and posts it to the server using <script> includes. The server stores the twitters in some data store. The server also checks if the twitter posted was the last twitter in the page. If so, it sends back code to change to the next page.

Thus the script when installed, will post twitters from the most recent to the oldest.

Ok, now how would this work with other services?

The pattern seems to be:
* Get the data elements from the present page – data elements could be movie details, restaurant details etc.
* Post data elements to the server.
** The posting might require splitting the content if the length is more than the maximum length of the GET request URL.
* Identify how you can move to the next page and when to move to the next page. Use this to hint the server to change to the next page.
* Write the server side logic to store data elements.
* Use the hint from the client to change to the next page when required.

The biggest advantage of this method is we make use of the browser to do authentication with the remote service and also to do the parsing of the HTML (which, as I mentioned in my previous post, browsers are best at).

Site specific browsers or desktop enabling web applications

The concept of site specific browsers has been around for quite some time now. While on one side, people are moving desktop application functionality to the web so as to take advantage of the features that the WWW provides, there is another group trying to bring closer integration of web applications with the desktop.

This is expected until we reach a point where there is seamless integration of the Internet and the WWW with the gadgets (laptops, cell-phones, etc) that we use everyday and it is difficult to define what a client is and what a server is!

So what does site specific browser mean?
Well, at the simplest, it means running your web application in its own separate process. However as this concept evolves, this will be more interesting. Imagine having a separate process run your favorite web application with look and feel (using UserStyles) and functionality (using Greasemonkey) tweaked according to your needs. I am not sure how this is going to look, but I am imagining some support for tweaking a web application being built into the webapp – showing both properties that the web application developer has provided (ex: offlining with Google Gears) and properties that are customizable using tools like Userstyles and Greasemonkey. Also expect better support for mashupability and better integration with other desktop applications and processes.

So what applications/runtimes do we have as of today?
Mozilla is working on WebRunner. Meanwhile, Adobe has been betting huge on AIR. At this stage, it looks like WebRunner is doing a better job primarily because WebRunner is from Mozilla and is open-source, so you can expect features like extensibility and being based on open formats, while AIR is based on a proprietary runtime. Can you expect extensibility and customization of AIR based applications? Well, it is a bit too early to be comparing technologies when the idea is still new.

I feel the primary value add in site specific browsers is in the customization of the application and the increase in usability when compared to running it in a browser.

On a separate note, this reminds me of the blog entry that I made about the death of browsers or about desktop enabling web applications back in 2005!

User adoption and its effect on technological evolution

The IPv6 specification has been around for about 9 years now. People understand the problems with IPv4. Yet the world is struggling to move to IPv6.

Ruby has been around for about 14 years now. It had enough time to evolve before it was widely adopted.

Java has been around for about 12 years. The adoption was quite fast. Now although newer versions of Java are being released, it is slowly losing its charm. There are some known problems, which would rather have not been there in the first place, but now it is too late or difficult to correct.

Browsers were meant to be used to browse documents containing hypertext. HTTP is a request/response application level protocol. People are now using both of these for things that they are not designed to do; browsers as application platforms (mashups from multiple sites) and HTTP for pushing data from the server aka Comet.

I can go on. But, do you see a pattern here? Or are these just random bits from history?

The point I am trying to make is that it is very difficult for technology to evolve once it has been widely adopted. In such cases, it is considered okay to bypass the working of the technology without breaking it by identifying loopholes!

So do we blame adoption before a technology matures? Or should there be a way that technology can mature and evolve without carrying its sins forward? Is it possible to mandate a 'big-bang' when specifications can move from one version to another or when there can be radical changes in technology without backward compatibility? How can this be achieved by ensuring that there is minimal side-effect? Does the answer lie in The Tipping Point?!

Seems like an interesting problem to solve.

Flock – a Web 2.0 browser?

I have been trying Flock for about a month now. And I am stuck to it.

What I liked:

  1. Blogging support – I am making this blog entry from within Flock. Also there is Technorati publishing support and all that.
  2. Flickr support – you get to know if someone adds new photos and you get to see them in a neat view.
  3. Delicious support – one of my favorite features here. There is a neat sync between your local bookmarks and your delicious bookmarks. You just click on the 'star' next to the address bar and you get a popup where you indicate whether the bookmark should be posted to del.icio.us.
  4. An improved search bar – there is live Yahoo search, local search history and the usual search engine support.
  5. Performance – somehow seems better than Firefox. Dunno why? 😐 (However, see hate point 2)
  6. Web snippets – I don't use this much, but there is a snippets bar, where you can copy snippets of your interest.
  7. News – This is where you get to manage your RSS feeds. But I don't use this either, not better than Blogbridge. 🙂

What I hated:

  1. I sometimes feel they should have gone with making an extension over Firefox rather than a separate browser. Some extensions might not work in Flock. Developers have to adhere to Flock separately. This is not good.
  2. Sometimes, there is some backend process which runs for a long time and results in a 'Unresponsive script' warning. This stops the working of the browser for a while.

Overall, I strongly recommend Flock for people who use the utilities mentioned and were craving for integration of these.

technorati tags:, , , , , , , ,

Blogged with Flock