YQL – Yahoo’s query language for the web

This post is a part of the AfterThoughts series of posts.

Post: A query language for searching websites
Originally posted on: 2005-01-27

I blogged about the idea of a query language for websites back in 2005. Today, when I was doing my feed sweep, I came across YQL, a query language with SQL-like syntax from Yahoo that allows you to query for structured data from various Yahoo services.

There is one thing that I found interesting. The ability for you to query ‘any’ HTML page for data at a specific XPath. There are some details in the official Yahoo Developer blog.

The intent of YQL is not the same as what I had blogged about. While YQL allows you to get data from a specific page, what I had intended was something more generic – an ability for you to query a set of pages or the whole of the web for specific data, which is a tougher problem to solve.

In order to fetch specific data from a HTML page using YQL, all you have to do is:
1. Go to the page that you want to extract data from.
2. Open up Firebug and point to the data that you want to extract (using Inspect).
3. Right click the node in Firebug and click on ‘Copy XPath’.
4. Now create a query in YQL like this:
select * from html where url=”” and xpath=”

Although the idea seems promising I wasn’t able to get it to work for most XPaths.

I guess the reason is the difference between the way the browser interprets the HTML and the way a server would interpret it. For example, if there is no ‘tbody’ tag in your table, the Firefox browser inserts a ‘tbody’ tag and that would be present in your XPath, while a server that interprets the HTML after Tidying it wouldn’t see one. One way we can solve this is to have the same engine interpret the XPath on the server side as well or be as lenient as possible when matching the XPaths. I had similar discussions with the research team in IRL when I was working on my idea of MySearch, which had similar issues, and there were some interesting solutions that we discussed.

I would say it is only a matter of time when someone will crack the issue of fetching structured data from semi-structured data present in the web and make it available to other services. Tools like Dapper, Yahoo Pipes, YubNub and YQL are just the beginning.

I have made several attempts at this right from using one of these tools, to building my own using Rhino, Jaxer etc, but until now the most content solution is a combination of curl, grep, awk and sed.

How to ensure that your extensions work on Firefox 3.0

Here are the steps that I found useful to port my extensions from Firefox 2.0 to 3.0:

  • Step 1: Just start Firefox and allow it to update the extensions. You could go to: Tools -> Add-ons -> Extensions -> Find updates.
    This should update many of the extensions. Restart Firefox.
  • Step 2: For those extensions where the auto-update has not functioned properly, you might want to manually see if an update is available. This is because for some extensions, the auto-update may not recognize that a new version is available.
    • Uninstall the older version and restart Firefox.
    • Search for the addons here and add them.
  • Step 3: Install the MR Tech Toolkit extension.
  • Step 4: For those extensions that have still not been updated and you need desperately, just see if the option 'Make compatible' from MR Tech. This option is available when you right click an extension in the Extension tab. If the compatibility range is upto some older version of 3.0 (for example 3.0b5) then this might work.
  • Step 5: Look for updates at a higher frequency over the next few days. Developers will be forced to ensure that their extension works in new version of Firefox so you can expect an update soon.

Downloading your data using Greasemonkey

Whenever I use some service over the web, I look for several things. Ease of use and customisability are important factors.

However, the most important thing I consider is vendor lock-in (or rather the lack of it). Let's say I am using a particular mail service (ex, GMail). If someday, I find a better email service, would it be easy for me to switch to that service? How easy is it for me to transfer my data from my old service to my new service?

For services like Mail, there are standard protocols for data access. So this is not an issue. However for the more recent services, like blogging, micro-blogging etc, the most widely used data access methodology/format is 'HTTP' via 'RSS' or 'ATOM'.

However, it's not the case that all services provide data as RSS (or XML or in any other parseable form). For example, suppose I make a list of movies I have watched, in some Facebook application, or a list of restaurants I visited, how do I download this list? If I cannot download it, does it mean I am tied to this application provider forever? What if I have added 200 movies in my original service and I come across another service that has better interface and more features and I want to switch to this new service but not lose the data that I have invested time to enter in my original service?

In fact, recently when I tried to download all my Twitters, I realized that this feature has been disabled. You are not able to get your old Twitters in XML format.

So what do we do when a service does not provide data as XML and we need to somehow scrape that data and store it?

This is kind of related to my last blog entry.

So I started thinking of ways in which I could download my Twitters. The solution I thought of initially was using Rhino and John Resig's project (mentioned in my previous blog entry). However, I ran into parse issues like before. So I had to think of alternative ways.

Now I took advantage of the fact that Twitters are short (and not more than 140 characters).

The solution I came up with uses a combination of Greasemonkey and PHP on the server side:

Here is the GM script:
If you intend to use this, do remember to change the URL to post data to.

// @name           Twitter Downloader

// @namespace      http://buzypi.in/

// @author         Gautham Pai

// @include        http://www.twitter.com/*

// @description    Post Twitters to a remote site

// ==/UserScript==

function twitterLoader (){
	var timeLine = document.getElementById('timeline');
	var spans = timeLine.getElementsByTagName('span');
	var url = 'http://buzypi.in/twitter.php';
	var twitters = new Array();
	for(var i=0;i<spans.length;i++){
		if(spans[i].className != 'entry-title entry-content'){
			continue;
		}
		twitters.push(escape(spans[i].innerHTML));
	}

	for(var i=0;i<twitters.length;i++){
		var last = 'false';
		if(i == twitters.length - 1)
			last = 'true';
		var scriptElement = document.createElement('script');
		scriptElement.setAttribute('src',url+'?last='+last+'&data='+twitters[i]);
		scriptElement.setAttribute('type','text/javascript');
		document.getElementsByTagName('head')[0].appendChild(scriptElement);
	}
}

window.addEventListener('load',twitterLoader,true);

The server side PHP code is:

<?php

global $_REQUEST;
$data = $_REQUEST['data'];
//Store data in the DB, CouchDB (or some other location)
$last = $_REQUEST['last'];
if($last == 'true'){
	echo "
	var divs = document.getElementsByTagName('div');
	var j= 0;
	for(j=0;j<divs.length;j++){
		if(divs[j].className == 'pagination')
		break;
	}
	var sectionLinks = divs[j].getElementsByTagName('a');
	var href = '';
	if(sectionLinks.length == 2)
		href = sectionLinks[1].href;
	else
		href = sectionLinks[0].href;
	var presentPage = parseInt(document.location.href[document.location.href.indexOf('page')+'page'.length+1]);
	var nextPage = parseInt(href[href.indexOf('page')+'page'.length+1]);
	if(nextPage < presentPage)
		alert('No more pages to parse');
	else {
		alert('Changing document location');
		document.location.href = href;
	}
	";
} else {
	echo "
	var recorder = 'true';
	";
}

?>

The GM script scrapes the twitters from a page and posts it to the server using <script> includes. The server stores the twitters in some data store. The server also checks if the twitter posted was the last twitter in the page. If so, it sends back code to change to the next page.

Thus the script when installed, will post twitters from the most recent to the oldest.

Ok, now how would this work with other services?

The pattern seems to be:
* Get the data elements from the present page – data elements could be movie details, restaurant details etc.
* Post data elements to the server.
** The posting might require splitting the content if the length is more than the maximum length of the GET request URL.
* Identify how you can move to the next page and when to move to the next page. Use this to hint the server to change to the next page.
* Write the server side logic to store data elements.
* Use the hint from the client to change to the next page when required.

The biggest advantage of this method is we make use of the browser to do authentication with the remote service and also to do the parsing of the HTML (which, as I mentioned in my previous post, browsers are best at).

Speed reading by hacking the column count in Firefox

Recently, I came across a Greasemonkey script for Wikipedia. The script helps us to view Wikipedia articles in multiple columns.

I found this to be useful and in fact saw that it improved my reading speed. In the last one week, I have referred to a lot of Wikipedia articles, and I am really addicted to this multi-column hack.

So now, when I am reading some article, if the article spans the entire width of the page, I open Firebug, 'Inspect' the element displaying the content under consideration and add:

-moz-column-count: 3;
-moz-column-gap: 50px;
font-family: Calibri;
font-size: 11px;

to the element.

And if I end up visiting this site frequently, then I can add a Greasemonkey script or a Userstyle for the page or set of pages.

The above screenshot shows a Wikipedia page as displayed in my browser.

So why is this so useful?
Sometime back, when I was reading an article on usability, I learnt that the reading speed depends on the width of the column. This is one of the reasons why you are able to read news articles faster in newspapers than online. You end up spanning the page vertically rather than horizontal + vertical eye movements. Rather than point to a single article, I would like to point you to the Google search for the study around this topic.

Some of the popular pages where I have added this multi-column functionality are: Wikipedia, Developerworks and Javadocs.

Eclifox – bringing Eclipse to the browser

We finally made it! Eclifox is now an Alphaworks technology.

So what is Eclifox?
In order to understand what Eclifox is, look at the screenshot below:


What do you see?

If you think that this is the screenshot of the Eclipse IDE, you are only partially right. Look again. It is Eclipse running in the Firefox browser!

Here is a flash demo of Eclifox. (Run it in full-screen). The demo shows the usage of Python and Ruby plugins from Eclifox.

A bit of history:
About a year and a half back I came up with a thought. How would it be if we were able to provide web based access to Eclipse functionality? Initially it was not clear how we could achieve this, but the idea seemed promising. So we thought we would give it a try by giving this of to a bunch of interns.

In came a group of 6 students who not only had the passion to complete this, but also the zeal to learn the technology required to make it work.

Hats off to the following interns who made it a reality:

  • Adarsh Ramamurthy
  • Karthik Ananth
  • Mohd Amjed Chand
  • Prasanna V. Pandit
  • Srirang G. Doddihal
  • Vikas Patil

The above interns from SJCE put their heart and soul to this effort and developed the whole thing in less than 4 months time. Personally, I enjoyed the 6 months time I spent on guiding these students. We have a stunned a lot of people within IBM with this idea. No one expected an internship project to get so much praise (or even criticism!).

Thanks to Kiran who provided guidance throughout the course of this project. Also for spending umpteen hours in getting this on alphaWorks! Also thanks to several people who provided the support when it was required.

And now about the technologies used:
The basic idea is to include a plug-in in Eclipse that helps us interact with Eclipse to fetch UI definitions and also to simulate events on Eclipse. The technologies used are primarily Javascript (with XMLHttp) on the client side and Jetty as the server embedded in Eclipse. For more information read the alphaWorks page.

So try it out and let us know what you feel!

Screen real estate optimization in Firefox

So here is a collection of hacks that I have made in my Firefox browser to optimize the usage of real estate and still have control over my browser

  • Install the TinyMenu extension to replace your menu with a single menu item.
  • Now that you have space in your main menu-bar, move all your navigation toolbar items to the main menu-bar and then hide the navigation bar. Also hide the toolbar. You can do this by right-clicking on one of the toolbars and unchecking all of them!
  • Right click the Main menu bar, choose Customize. Remove the Home and other icons that you hardly use.
  • Remove the 'Go' button next to the address bar, remove the magnifying glass, remove Back/Forward/Reload/Stop buttons when disabled as shown here.
  • Install the Searchbar Autosizer extension to make your search bar extremely small and then to expand when you type characters in the search box.
  • Auto-hide the tab bar.
  • Use Fuller screen to remove the menu-bar and status bar when not required!
  • Preview your tabs and search for one using Firefox Showcase extension.
  • Clean up your menu's using Menu editor extension.
  • Keyboar shortcuts are extremely important to increase your productivity. Reconfigure your shortcuts using Keyconfig extension.
  • Also you can add these new key combinations in Keyconfig to cycle through your tabs:

Move one tab left (Ctrl+Left)

if(gBrowser.mTabContainer.selectedIndex == 0)
gBrowser.mTabContainer.selectedIndex = gBrowser.mTabContainer.childNodes.length - 1;
else
gBrowser.mTabContainer.advanceSelectedTab(-1);

Move one tab right (Ctrl+Right)

if(gBrowser.mTabContainer.selectedIndex == gBrowser.mTabContainer.childNodes.length - 1)
gBrowser.mTabContainer.selectedIndex = 0;
else
gBrowser.mTabContainer.advanceSelectedTab(1);

And once you do this, you will definitely want to take backup of your profile. Use FEBE and you are done!

Have I stopped blogging?

“Have you stopped blogging?”, people ask me. I don't have a definite answer. True, it has been a long time since I blogged. Although I have been quite active in my online activity as is apparent here I somehow couldn't blog about anything for the last 3 months!

During the course of the 3 months since my last blog entry, I came across quite a lot of things, which I found interesting and would have normally blogged about. However, I got into this vicious circle where I thought that it is not worth blogging about it, after such a long gap and that added to the time, and now it became more and more difficult to blog about something.

Ok, so what's keeping me interested?

  • Lotus connections, specially the idea of Activities. This has been an eye-opener regarding the way I organize information in my system/s.
  • The emergence of a new web pattern of “server pushing information” to the browser, commonly referred to as Comet.
  • ProjectZero, which is IBM's answer to rapid and 'Zero' obstacle development of Web oriented applications.
  • TiddlyWiki – I wonder why I did not come across this before! It is absolutely fabulous and the idea of a single page self-contained wiki is just too good to believe and sometimes scary. 🙂

Also, of late, I started getting interested in analysis of web activities.

A day of effort, and some hacking of the Firefox history and a tool called RapidMiner helped me get some insights into my browsing habits, which I had never thought about before. I noticed a pattern in the way I come across new topics. Also I learnt about the way I get to certain frequently accessed sites and what I can do to get to certain information quicker than ever before. Finally, I realized that the new del.icio.us Firefox extension has helped me improve my browsing habits and made my bookmarks more valuable.

It is really interesting to see what other 'inferences' are possible with the data that is already available! Considering the fact that data today is available in a wide-variety of open-formats and the data also being openly available, it is possible to fetch all this data, feed it to some analyzer and get some interesting insights and use that to make your web journey more fruitful. The Flickr Cluster experiments are just tip of the iceberg!

Some projects/tools related to this are APML, ManyEyes.

Ok, I have written about a wide variety of topics that I am currently finding interesting.

So, finally, back to the question I started off with. Have I stopped blogging? The answer definitely has got to be a 'No'!

Orkut microsummaries

I wrote some simple micro-summary generators for Orkut. Here is what each one does:

Last visitor: If profile visit is enabled, this micro-summary will tell you who visited your profile last.
Karma: This indicates your 3 karma values.
Next birthday: This indicates whose birthday is next.
Fans: This indicates the number of fans you have.

Here are the micro-summary generators.

In order to learn more about microsummaries click here. I will describe the steps below for using my orkut microsummary generators.

1. Install all the generators from the link given above.
2. Go to Orkut home.
3. Click on Bookmarks -> Bookmark this page.
4. In the Name option, click on the drop down menu. You should see the microsummaries in action. Select one and click OK.
5. Repeat steps 3 and 4 for each of the microsummaries.
6. For the karma microsummary, go to your profile page and repeat steps 3 and 4.

(Please note: Some values have been deliberately removed.)

The fun is it is live, which means any changes are almost instantaneously reflected.

Firefox 2 experiments

Previous related entries:
Firefox extensions – my picks
Firefox extensions – my picks II (a web developer's heaven)
Microsummaries – a new feature in Firefox 2

Firefox 2 was released recently. This week I got a chance to dabble with it. Integrated spell-check is a new feature that I liked.

I cannot stop talking about Microsummaries. So I continued my experiments and I found some useful extensions.

Here they are:
Microsummary Generator Builder
XPath Checker

Also for Web developers, here are some more useful extensions (other than the ones that I have mentioned in my previous blogs)

UrlParams
Execute JS

Also I found these extensions very useful:
All-in-One Sidebar

So here's how it looks now:

Flock – a Web 2.0 browser?

I have been trying Flock for about a month now. And I am stuck to it.

What I liked:

  1. Blogging support – I am making this blog entry from within Flock. Also there is Technorati publishing support and all that.
  2. Flickr support – you get to know if someone adds new photos and you get to see them in a neat view.
  3. Delicious support – one of my favorite features here. There is a neat sync between your local bookmarks and your delicious bookmarks. You just click on the 'star' next to the address bar and you get a popup where you indicate whether the bookmark should be posted to del.icio.us.
  4. An improved search bar – there is live Yahoo search, local search history and the usual search engine support.
  5. Performance – somehow seems better than Firefox. Dunno why? 😐 (However, see hate point 2)
  6. Web snippets – I don't use this much, but there is a snippets bar, where you can copy snippets of your interest.
  7. News – This is where you get to manage your RSS feeds. But I don't use this either, not better than Blogbridge. 🙂

What I hated:

  1. I sometimes feel they should have gone with making an extension over Firefox rather than a separate browser. Some extensions might not work in Flock. Developers have to adhere to Flock separately. This is not good.
  2. Sometimes, there is some backend process which runs for a long time and results in a 'Unresponsive script' warning. This stops the working of the browser for a while.

Overall, I strongly recommend Flock for people who use the utilities mentioned and were craving for integration of these.

technorati tags:, , , , , , , ,

Blogged with Flock