YQL – Yahoo’s query language for the web

This post is a part of the AfterThoughts series of posts.

Post: A query language for searching websites
Originally posted on: 2005-01-27

I blogged about the idea of a query language for websites back in 2005. Today, when I was doing my feed sweep, I came across YQL, a query language with SQL-like syntax from Yahoo that allows you to query for structured data from various Yahoo services.

There is one thing that I found interesting. The ability for you to query ‘any’ HTML page for data at a specific XPath. There are some details in the official Yahoo Developer blog.

The intent of YQL is not the same as what I had blogged about. While YQL allows you to get data from a specific page, what I had intended was something more generic – an ability for you to query a set of pages or the whole of the web for specific data, which is a tougher problem to solve.

In order to fetch specific data from a HTML page using YQL, all you have to do is:
1. Go to the page that you want to extract data from.
2. Open up Firebug and point to the data that you want to extract (using Inspect).
3. Right click the node in Firebug and click on ‘Copy XPath’.
4. Now create a query in YQL like this:
select * from html where url=”” and xpath=”

Although the idea seems promising I wasn’t able to get it to work for most XPaths.

I guess the reason is the difference between the way the browser interprets the HTML and the way a server would interpret it. For example, if there is no ‘tbody’ tag in your table, the Firefox browser inserts a ‘tbody’ tag and that would be present in your XPath, while a server that interprets the HTML after Tidying it wouldn’t see one. One way we can solve this is to have the same engine interpret the XPath on the server side as well or be as lenient as possible when matching the XPaths. I had similar discussions with the research team in IRL when I was working on my idea of MySearch, which had similar issues, and there were some interesting solutions that we discussed.

I would say it is only a matter of time when someone will crack the issue of fetching structured data from semi-structured data present in the web and make it available to other services. Tools like Dapper, Yahoo Pipes, YubNub and YQL are just the beginning.

I have made several attempts at this right from using one of these tools, to building my own using Rhino, Jaxer etc, but until now the most content solution is a combination of curl, grep, awk and sed.

HTML parsing and Rhino

About a year back I was working on a personal project in IBM. This was a clone of YubNub for the IBM intranet.

For those of you who don’t know YubNub, it is a simple but powerful tool, which allows you to define keywords to reach pages. One of the popular examples is gim which will take you to the Google Image Search results page for the keywords that you entered.

When I built this YubNub clone, I had plans to introduce the feature of defining commands to get data from specific portions of a page. For example, you would be able to fetch the telephone number of a person using a command like: telephone . The way this works is by scraping the content off a page containing the telephone number at a specific section in the person’s profile page.

But wouldn’t it be cool to provide the flexibility to the user to define what to fetch from a page on the Intranet? You can ask the user to define what content to fetch from a page when he creates the command.

Look at the YubNub create command interface. The basic information asked in the page is:

  • Name of the command
  • URL
  • Description

Now imagine having an extra text-field which asks you to enter the XPath to the content that you want to scrape from the resultant page.

In simple words, this means, you are saying, fetch this page, then get this specific portion of the page and only give me that content. You could perhaps pipe that content to some other command or play with that content in umpteen ways. I haven’t followed YubNub of-late, but I am sure there are many commands in YubNub which have similar functionality.

Now in principle, although this is possible there was one major issue I faced. The server had to do the page fetch and then page scraping. Now although there are very good XML parsers out there, there is no good ‘XML’ parser for HTML. And XPath does not work unless the page is XML.

Most pages on the Internet are HTML (or XHTML) and although it is straight-forward to transform them to XML, anyone who has tried it will see that this is not a simple solution. When you try to parse an XHTML page (even popular pages out there) you will run into issues like ‘entity not defined’ or ‘matching element not found’ etc. Although there are tools like Tidy or TagSoup, you are not guaranteed that the output of such tools is a well-formed XML.

On the other hands, browsers are extremely flexible in the way they handle HTML. Traversing through the HTML DOM is really simple and many a times you don’t even realize that your browser has silently corrected 10’s of errors in the page. You can get to any specific portion of the page using HTML DOM functions or using libraries like JQuery.

So what I was looking for, was some tool which had the flexibility of the browser’s HTML handling, but at the same time was able to function on the server.

As if by co-incidence, I ran into this post from John Resig (the person popular for JQuery). John describes one of his projects on bringing the browser environment to Rhino. He also gives an example of how to scrape content from a web-page and send the result to a file.

Wow! This is exactly what I had been looking for. Since Rhino can be embedded in Java, all you would need to do is to make a call to the JS function to scrape content and then pass the content back to Java and continue with your processing.

Although I don’t work on the project anymore, I see requirement of this functionality in many other places. For example, just sometime back, I was looking for a simple tool to fetch Tiddlers from Tiddlywiki and convert them into a simple HTML page. This will help in supporting those browsers which don’t have Javascript enabled. I tried some of the tools out there, but most of them failed. So I planned to write my own. And lo, I came across this same issue. TiddlyWiki content is in HTML and this content is not easy to parse using XML parsers (which is perhaps why many of those tools failed). So how about using Rhino and John’s project to scrape content from the wiki and sending it to a file in a different format?

The project looks very promising. I should follow it closely.

Service integration using YubNub

I have never been so excited by 'simple' ideas since the time I saw RSS way back in Aug, 2004.

When I had first seen YubNub, I knew that this idea was here to stay. But then it was quite in its early days and so was not quite usable other than like the Yahoo open shortcuts that I blogged about here.

Today, I happened to revisit that site again. Whew! What a wonder! It has pipes, multiple parameters, string utilities, conditional constructs etc. So a mere combination of commands and you can be working wonders!

To give you an example, suppose you have a server, you can host a set of JSP's (or any other dynamic page) that takes parameterized input and process it and pipe it between sites. This is how you can create personal agents that extract info from one place, automatically blog about it, add sites to bookmarks… the possibilities are unlimited.

An other advantage of this is you just have to remember commands to get your things done. No need of URL's or even short URL's for that matter.

An example is:

garfield -year {rand -min 1979 -max 2005} -month {rand -min 1 -max 12}

will show you the garfield comic for a month randomly. (This is a command I created to check the site out).

Try running it here.
(In order to be convinced by the power of the command, try running this command 3-4 times and see the results).

And if you have a Googlenym, then you can use YubNub to publish your site/page.

Ex: This is mine:

gfl threepointsomething

Try running it here.

An interesting observation is the movement from the GUI back to the command line way of working. The sheer expression power of the command line is unmatched compared to the GUI and that is what is making this click.

And if you are interested there are a host of utilities like Konfabulator widgets, FF integration, FF extension etc that you can use… and enjoy!

And if you are not impressed, it is nothing to do with YubNub; perhaps my explanation was not good and you should go and check it out yourself. 🙂

And these are the YubNub commands that I wrote:

garfield
diggspy

Want to experiment more with this.