Categories
World Wide Web

Downloading your data using Greasemonkey

Whenever I use some service over the web, I look for several things. Ease of use and customisability are important factors.

However, the most important thing I consider is vendor lock-in (or rather the lack of it). Let's say I am using a particular mail service (ex, GMail). If someday, I find a better email service, would it be easy for me to switch to that service? How easy is it for me to transfer my data from my old service to my new service?

For services like Mail, there are standard protocols for data access. So this is not an issue. However for the more recent services, like blogging, micro-blogging etc, the most widely used data access methodology/format is 'HTTP' via 'RSS' or 'ATOM'.

However, it's not the case that all services provide data as RSS (or XML or in any other parseable form). For example, suppose I make a list of movies I have watched, in some Facebook application, or a list of restaurants I visited, how do I download this list? If I cannot download it, does it mean I am tied to this application provider forever? What if I have added 200 movies in my original service and I come across another service that has better interface and more features and I want to switch to this new service but not lose the data that I have invested time to enter in my original service?

In fact, recently when I tried to download all my Twitters, I realized that this feature has been disabled. You are not able to get your old Twitters in XML format.

So what do we do when a service does not provide data as XML and we need to somehow scrape that data and store it?

This is kind of related to my last blog entry.

So I started thinking of ways in which I could download my Twitters. The solution I thought of initially was using Rhino and John Resig's project (mentioned in my previous blog entry). However, I ran into parse issues like before. So I had to think of alternative ways.

Now I took advantage of the fact that Twitters are short (and not more than 140 characters).

The solution I came up with uses a combination of Greasemonkey and PHP on the server side:

Here is the GM script:
If you intend to use this, do remember to change the URL to post data to.

// @name           Twitter Downloader

// @namespace      http://buzypi.in/

// @author         Gautham Pai

// @include        http://www.twitter.com/*

// @description    Post Twitters to a remote site

// ==/UserScript==

function twitterLoader (){
	var timeLine = document.getElementById('timeline');
	var spans = timeLine.getElementsByTagName('span');
	var url = 'http://buzypi.in/twitter.php';
	var twitters = new Array();
	for(var i=0;i<spans.length;i++){
		if(spans[i].className != 'entry-title entry-content'){
			continue;
		}
		twitters.push(escape(spans[i].innerHTML));
	}

	for(var i=0;i<twitters.length;i++){
		var last = 'false';
		if(i == twitters.length - 1)
			last = 'true';
		var scriptElement = document.createElement('script');
		scriptElement.setAttribute('src',url+'?last='+last+'&data='+twitters[i]);
		scriptElement.setAttribute('type','text/javascript');
		document.getElementsByTagName('head')[0].appendChild(scriptElement);
	}
}

window.addEventListener('load',twitterLoader,true);

The server side PHP code is:

<?php

global $_REQUEST;
$data = $_REQUEST['data'];
//Store data in the DB, CouchDB (or some other location)
$last = $_REQUEST['last'];
if($last == 'true'){
	echo "
	var divs = document.getElementsByTagName('div');
	var j= 0;
	for(j=0;j<divs.length;j++){
		if(divs[j].className == 'pagination')
		break;
	}
	var sectionLinks = divs[j].getElementsByTagName('a');
	var href = '';
	if(sectionLinks.length == 2)
		href = sectionLinks[1].href;
	else
		href = sectionLinks[0].href;
	var presentPage = parseInt(document.location.href[document.location.href.indexOf('page')+'page'.length+1]);
	var nextPage = parseInt(href[href.indexOf('page')+'page'.length+1]);
	if(nextPage < presentPage)
		alert('No more pages to parse');
	else {
		alert('Changing document location');
		document.location.href = href;
	}
	";
} else {
	echo "
	var recorder = 'true';
	";
}

?>

The GM script scrapes the twitters from a page and posts it to the server using <script> includes. The server stores the twitters in some data store. The server also checks if the twitter posted was the last twitter in the page. If so, it sends back code to change to the next page.

Thus the script when installed, will post twitters from the most recent to the oldest.

Ok, now how would this work with other services?

The pattern seems to be:
* Get the data elements from the present page – data elements could be movie details, restaurant details etc.
* Post data elements to the server.
** The posting might require splitting the content if the length is more than the maximum length of the GET request URL.
* Identify how you can move to the next page and when to move to the next page. Use this to hint the server to change to the next page.
* Write the server side logic to store data elements.
* Use the hint from the client to change to the next page when required.

The biggest advantage of this method is we make use of the browser to do authentication with the remote service and also to do the parsing of the HTML (which, as I mentioned in my previous post, browsers are best at).

Categories
World Wide Web

HTML parsing and Rhino

About a year back I was working on a personal project in IBM. This was a clone of YubNub for the IBM intranet.

For those of you who don’t know YubNub, it is a simple but powerful tool, which allows you to define keywords to reach pages. One of the popular examples is gim which will take you to the Google Image Search results page for the keywords that you entered.

When I built this YubNub clone, I had plans to introduce the feature of defining commands to get data from specific portions of a page. For example, you would be able to fetch the telephone number of a person using a command like: telephone . The way this works is by scraping the content off a page containing the telephone number at a specific section in the person’s profile page.

But wouldn’t it be cool to provide the flexibility to the user to define what to fetch from a page on the Intranet? You can ask the user to define what content to fetch from a page when he creates the command.

Look at the YubNub create command interface. The basic information asked in the page is:

  • Name of the command
  • URL
  • Description

Now imagine having an extra text-field which asks you to enter the XPath to the content that you want to scrape from the resultant page.

In simple words, this means, you are saying, fetch this page, then get this specific portion of the page and only give me that content. You could perhaps pipe that content to some other command or play with that content in umpteen ways. I haven’t followed YubNub of-late, but I am sure there are many commands in YubNub which have similar functionality.

Now in principle, although this is possible there was one major issue I faced. The server had to do the page fetch and then page scraping. Now although there are very good XML parsers out there, there is no good ‘XML’ parser for HTML. And XPath does not work unless the page is XML.

Most pages on the Internet are HTML (or XHTML) and although it is straight-forward to transform them to XML, anyone who has tried it will see that this is not a simple solution. When you try to parse an XHTML page (even popular pages out there) you will run into issues like ‘entity not defined’ or ‘matching element not found’ etc. Although there are tools like Tidy or TagSoup, you are not guaranteed that the output of such tools is a well-formed XML.

On the other hands, browsers are extremely flexible in the way they handle HTML. Traversing through the HTML DOM is really simple and many a times you don’t even realize that your browser has silently corrected 10’s of errors in the page. You can get to any specific portion of the page using HTML DOM functions or using libraries like JQuery.

So what I was looking for, was some tool which had the flexibility of the browser’s HTML handling, but at the same time was able to function on the server.

As if by co-incidence, I ran into this post from John Resig (the person popular for JQuery). John describes one of his projects on bringing the browser environment to Rhino. He also gives an example of how to scrape content from a web-page and send the result to a file.

Wow! This is exactly what I had been looking for. Since Rhino can be embedded in Java, all you would need to do is to make a call to the JS function to scrape content and then pass the content back to Java and continue with your processing.

Although I don’t work on the project anymore, I see requirement of this functionality in many other places. For example, just sometime back, I was looking for a simple tool to fetch Tiddlers from Tiddlywiki and convert them into a simple HTML page. This will help in supporting those browsers which don’t have Javascript enabled. I tried some of the tools out there, but most of them failed. So I planned to write my own. And lo, I came across this same issue. TiddlyWiki content is in HTML and this content is not easy to parse using XML parsers (which is perhaps why many of those tools failed). So how about using Rhino and John’s project to scrape content from the wiki and sending it to a file in a different format?

The project looks very promising. I should follow it closely.

Categories
World Wide Web

Bulls and cows and the Javascript challenge

About 2 years back, I had conducted an experiment with the Bulls and Cows game[1] [2]. I now wanted to see what the 'human average' for the game is. So I wanted to build a small Facebook application to add the social aspect to the game and conduct my experiments.

But before I continued, I had to solve a major problem.

If I continue to make it a Javascript game, as is hosted here, I need to ensure that the random number generated by the browser is secure and not manipulated or found out by the player using illegal ways.

Anyone who knows a bit of Javascript and is used to looking at code using Firebug will soon be able to 'guess' the number in one step:

Yeah, that's right. I store the random number generated in a variable randomNo. And you can find out the value using Firebug. Now this is fine, as long as it is not a competition and you play the game because you actually like it and not because you are winning a million dollars. But what if this game was being played for money?

So my next attempt was to think of storing a MD5 of the number and then match it with the MD5 of the number entered by the player. This works well as long as the random number is generated on the server side and only the MD5 is sent to the client.

Can the random number and its MD5 be generated on the client side without the user being able to 'debug' and get the random number?

My first attempt towards this was the following piece of code:

function getRandomNo(){
        var md5OfRandomNo = MD5(Math.floor(Math.random()*10001)+'');
	return md5OfRandomNo;
}

But unfortunately:

and you step into the function and:

🙁

Right now, I am still not able to find a fool-proof way to generate the random number on the client side. Is there a solution?

Ok, let's say the number is securely generated in some way (client or server) and we only store the MD5 value on the client. Now, there is a second problem:

What if the player just changes the random number altogether?

>>> randomNo
"948f847055c6bf156997ce9fb59919be"
>>> randomNo = MD5('7839')
"ca91c5464e73d3066825362c3093a45f"

We need to maintain a session and include some verification code to ensure that the MD5 was not manipulated.

Is there a solution for this if we want to write the entire game using only Javascript? Are there any other issues other than the 2 described?

Categories
My Updates

Bye bye IBM, hello Ugenie

If 2007 has been a very long year, December has been a very long month!

I quit IBM this week and took up a new position in Ugenie today [1][2].

This news came as a surprise to many, who considered that I was quite loyal to IBM. IBM has been a splendid place. There is no dearth of opportunities there. The more you are ready to take up responsibilities, the more you are given.

So what on earth made me switch?
The primary reason for the switch is, I wanted to work in a startup on something that is directly used by non-technical end users.

How do we serve a large user base? How do we keep up with the ever increasing and conflicting demands of users? How are things prioritized? How is it that a small group of 15-20 individuals can do something in a matter of days, that large organizations take weeks to implement?

The equation in a startup is quite different from that of large organizations. I have read this before, but have never had first hand experience. So I decided to take the plunge and experience it myself.

And then there was the question of the 'right time'.
Is this a good time? Should I wait? What will I gain, what will I miss? The more I thought about it, the more it confused me. So finally I just chose to go with Ugenie.

The work seems to be interesting. I am looking forward to it!

With some people predicting a dot-com crash in 2008, was this a good idea?
Time will tell. But whatever the case, I am not quite concerned.

Categories
World Wide Web

Google and innovation – take 2

About a couple of years back, I wrote about how Google had a tight integration between its various services, and how Yahoo lacked it.

However when I made that entry, Google had very few services and Yahoo had lots of them. In fact, Google was primarily a search company and Gmail and Calendar were just new arrivals in the scene.

However now that Google has already been Yahoo 2.0, it's time to look at Google's offerings again and see how they have fared.

The first impression is that Google has done tremendously well. Although they have acquired several companies in the last couple of years, they have been very quick in integrating these applications with their portfolio. Orkut, Gmail/Gtalk integration, Gmail/Google docs integration, Google Mashup Editor are some examples.

However on second thoughts, it looks like there is a lot that is still to be done.

What kind of integration can we expect?

  • You bookmark resources in various services
    • Starring entries in Google Reader
    • Starring posts in Google Groups
    • Starring Google search results
    • Noting down items or clipping entries in Google Notebook
    • Indicating your favorite books in Google Books.
Why is there no single 'Google bookmarks' service?
  • Social network everywhere
Mail and IM are inherently social applications. However with the new Facebook revolution, a social network revolves around everything we do over the web. Google already has its own social network. How well is this integrated with its various services? More on Google Reader social network integration.
  • Presence awareness everywhere
A related expectation is presence awareness in various Google services. GMail has a tight integration with GTalk. Why is a similar presence awareness not available in Google Reader, Google Docs etc?
  • Uniform look and feel
Google has been doing very well here. However, some bit of work is required on sites like Orkut and Youtube.
  • Attention profiling
Google has a log of your search history. However, I guess it would be interesting to integrate this data with your mail interactions, your Google reader trends, your group activities, interactions on Orkut, sites you browse etc.

It is a complex problem to solve. You don't know what the various interaction points between the various services are or what the various dimensions are for these applications. However, we learn some of these over time. For example, it looks like social networks and attention profiling are here to stay. So if you are building an application, ensure that it is integrated well with some social network and also takes into account the attention information of a user.

Categories
General

Random thought

“There are some points in your life, when you have to just let your brain on auto-pilot and work on intuition and co-incidences than on logic and reasoning.”

Categories
World Wide Web

2007 – a recapitulation

Warning: Some bit of self praise ahead.

Nearing the end of yet another successful year. The year 2007 has been one of the ‘longest’ year in my life. With milestones achieved and several noteworthy events in both my personal and professional life, I should say, this perhaps could not be better!

Here are some important milestones/events in the last year (in no particular order):

  • Learnt car driving and also got my 4 wheeler license. This was pending for a very long time.
  • Part of the organization committee of the Web 2.0 conference organized by CSI Bangalore. I also delivered a talk on Enterprise application of Web 2.0 here.
  • Attended swimming classes, another thing that had been pending since a long time.
  • Tried my hand at cooking (could have done better).
  • Got myself an iPod Nano.
  • Delivered several talks and conducted hands-on workshop in colleges mostly on Eclipse.
  • A few solo bike rides including the ring-road trip (This got more traction than I had expected with casual readers noting this more than my other entries!).
  • Went to a few places and clicked a lot of photos.
  • Guided a successful intern project.
  • Eclifox release in alphaWorks.
  • A few awards, papers and recognitions.
  • Several other professional achievements – which unfortunately I cannot mention here.
  • Last but not the least: experimented with my hair style and got a whole bunch of comments from family and friends!
Categories
General World Wide Web

Prove that you are a human being

In continuing with my observations made here, I wonder if the following way of preventing someone from spamming a commenting system really helps:

____________ Spam protection: Sum of II plus III plus IV ?

If I really want to spam this site, I can write a simple routine to scrape the string matching 'Sum of ___ ?' and then feed that into Google like this:

Google search: II plus III plus IV

And there you go! You are spammed!

Categories
World Wide Web

Speed reading by hacking the column count in Firefox

Recently, I came across a Greasemonkey script for Wikipedia. The script helps us to view Wikipedia articles in multiple columns.

I found this to be useful and in fact saw that it improved my reading speed. In the last one week, I have referred to a lot of Wikipedia articles, and I am really addicted to this multi-column hack.

So now, when I am reading some article, if the article spans the entire width of the page, I open Firebug, 'Inspect' the element displaying the content under consideration and add:

-moz-column-count: 3;
-moz-column-gap: 50px;
font-family: Calibri;
font-size: 11px;

to the element.

And if I end up visiting this site frequently, then I can add a Greasemonkey script or a Userstyle for the page or set of pages.

The above screenshot shows a Wikipedia page as displayed in my browser.

So why is this so useful?
Sometime back, when I was reading an article on usability, I learnt that the reading speed depends on the width of the column. This is one of the reasons why you are able to read news articles faster in newspapers than online. You end up spanning the page vertically rather than horizontal + vertical eye movements. Rather than point to a single article, I would like to point you to the Google search for the study around this topic.

Some of the popular pages where I have added this multi-column functionality are: Wikipedia, Developerworks and Javadocs.

Categories
My Updates

Internship 2008

I had been to SJCE last weekend to conduct internship interview for the current final year engineering students. The turn-up was much less than expected. We went quite late this year and many students were already offered projects in other companies. Also we had a cut-off of 70%. About 40 students took the written test. We short-listed 13 and ended up selecting 3 students. All 3 were from Information Science. We had expected 120 students to turn up and we were planning to select 9-12 students.

Congratulations to those selected. And a suggestion to those who were not: It is not the number of companies that you are placed in that matters, nor is it the pay. It is your fundamentals that will take you a long way. Don't blame the company if you are not offered a quality job. Ensure that you have what it takes to get one.