YQL – Yahoo’s query language for the web

This post is a part of the AfterThoughts series of posts.

Post: A query language for searching websites
Originally posted on: 2005-01-27

I blogged about the idea of a query language for websites back in 2005. Today, when I was doing my feed sweep, I came across YQL, a query language with SQL-like syntax from Yahoo that allows you to query for structured data from various Yahoo services.

There is one thing that I found interesting. The ability for you to query ‘any’ HTML page for data at a specific XPath. There are some details in the official Yahoo Developer blog.

The intent of YQL is not the same as what I had blogged about. While YQL allows you to get data from a specific page, what I had intended was something more generic – an ability for you to query a set of pages or the whole of the web for specific data, which is a tougher problem to solve.

In order to fetch specific data from a HTML page using YQL, all you have to do is:
1. Go to the page that you want to extract data from.
2. Open up Firebug and point to the data that you want to extract (using Inspect).
3. Right click the node in Firebug and click on ‘Copy XPath’.
4. Now create a query in YQL like this:
select * from html where url=”” and xpath=”

Although the idea seems promising I wasn’t able to get it to work for most XPaths.

I guess the reason is the difference between the way the browser interprets the HTML and the way a server would interpret it. For example, if there is no ‘tbody’ tag in your table, the Firefox browser inserts a ‘tbody’ tag and that would be present in your XPath, while a server that interprets the HTML after Tidying it wouldn’t see one. One way we can solve this is to have the same engine interpret the XPath on the server side as well or be as lenient as possible when matching the XPaths. I had similar discussions with the research team in IRL when I was working on my idea of MySearch, which had similar issues, and there were some interesting solutions that we discussed.

I would say it is only a matter of time when someone will crack the issue of fetching structured data from semi-structured data present in the web and make it available to other services. Tools like Dapper, Yahoo Pipes, YubNub and YQL are just the beginning.

I have made several attempts at this right from using one of these tools, to building my own using Rhino, Jaxer etc, but until now the most content solution is a combination of curl, grep, awk and sed.

Speed reading by hacking the column count in Firefox

Recently, I came across a Greasemonkey script for Wikipedia. The script helps us to view Wikipedia articles in multiple columns.

I found this to be useful and in fact saw that it improved my reading speed. In the last one week, I have referred to a lot of Wikipedia articles, and I am really addicted to this multi-column hack.

So now, when I am reading some article, if the article spans the entire width of the page, I open Firebug, 'Inspect' the element displaying the content under consideration and add:

-moz-column-count: 3;
-moz-column-gap: 50px;
font-family: Calibri;
font-size: 11px;

to the element.

And if I end up visiting this site frequently, then I can add a Greasemonkey script or a Userstyle for the page or set of pages.

The above screenshot shows a Wikipedia page as displayed in my browser.

So why is this so useful?
Sometime back, when I was reading an article on usability, I learnt that the reading speed depends on the width of the column. This is one of the reasons why you are able to read news articles faster in newspapers than online. You end up spanning the page vertically rather than horizontal + vertical eye movements. Rather than point to a single article, I would like to point you to the Google search for the study around this topic.

Some of the popular pages where I have added this multi-column functionality are: Wikipedia, Developerworks and Javadocs.