YQL – Yahoo’s query language for the web

This post is a part of the AfterThoughts series of posts.

Post: A query language for searching websites
Originally posted on: 2005-01-27

I blogged about the idea of a query language for websites back in 2005. Today, when I was doing my feed sweep, I came across YQL, a query language with SQL-like syntax from Yahoo that allows you to query for structured data from various Yahoo services.

There is one thing that I found interesting. The ability for you to query ‘any’ HTML page for data at a specific XPath. There are some details in the official Yahoo Developer blog.

The intent of YQL is not the same as what I had blogged about. While YQL allows you to get data from a specific page, what I had intended was something more generic – an ability for you to query a set of pages or the whole of the web for specific data, which is a tougher problem to solve.

In order to fetch specific data from a HTML page using YQL, all you have to do is:
1. Go to the page that you want to extract data from.
2. Open up Firebug and point to the data that you want to extract (using Inspect).
3. Right click the node in Firebug and click on ‘Copy XPath’.
4. Now create a query in YQL like this:
select * from html where url=”” and xpath=”

Although the idea seems promising I wasn’t able to get it to work for most XPaths.

I guess the reason is the difference between the way the browser interprets the HTML and the way a server would interpret it. For example, if there is no ‘tbody’ tag in your table, the Firefox browser inserts a ‘tbody’ tag and that would be present in your XPath, while a server that interprets the HTML after Tidying it wouldn’t see one. One way we can solve this is to have the same engine interpret the XPath on the server side as well or be as lenient as possible when matching the XPaths. I had similar discussions with the research team in IRL when I was working on my idea of MySearch, which had similar issues, and there were some interesting solutions that we discussed.

I would say it is only a matter of time when someone will crack the issue of fetching structured data from semi-structured data present in the web and make it available to other services. Tools like Dapper, Yahoo Pipes, YubNub and YQL are just the beginning.

I have made several attempts at this right from using one of these tools, to building my own using Rhino, Jaxer etc, but until now the most content solution is a combination of curl, grep, awk and sed.

Google and innovation – take 2

About a couple of years back, I wrote about how Google had a tight integration between its various services, and how Yahoo lacked it.

However when I made that entry, Google had very few services and Yahoo had lots of them. In fact, Google was primarily a search company and Gmail and Calendar were just new arrivals in the scene.

However now that Google has already been Yahoo 2.0, it's time to look at Google's offerings again and see how they have fared.

The first impression is that Google has done tremendously well. Although they have acquired several companies in the last couple of years, they have been very quick in integrating these applications with their portfolio. Orkut, Gmail/Gtalk integration, Gmail/Google docs integration, Google Mashup Editor are some examples.

However on second thoughts, it looks like there is a lot that is still to be done.

What kind of integration can we expect?

  • You bookmark resources in various services
    • Starring entries in Google Reader
    • Starring posts in Google Groups
    • Starring Google search results
    • Noting down items or clipping entries in Google Notebook
    • Indicating your favorite books in Google Books.
Why is there no single 'Google bookmarks' service?
  • Social network everywhere
Mail and IM are inherently social applications. However with the new Facebook revolution, a social network revolves around everything we do over the web. Google already has its own social network. How well is this integrated with its various services? More on Google Reader social network integration.
  • Presence awareness everywhere
A related expectation is presence awareness in various Google services. GMail has a tight integration with GTalk. Why is a similar presence awareness not available in Google Reader, Google Docs etc?
  • Uniform look and feel
Google has been doing very well here. However, some bit of work is required on sites like Orkut and Youtube.
  • Attention profiling
Google has a log of your search history. However, I guess it would be interesting to integrate this data with your mail interactions, your Google reader trends, your group activities, interactions on Orkut, sites you browse etc.

It is a complex problem to solve. You don't know what the various interaction points between the various services are or what the various dimensions are for these applications. However, we learn some of these over time. For example, it looks like social networks and attention profiling are here to stay. So if you are building an application, ensure that it is integrated well with some social network and also takes into account the attention information of a user.

Search results and relevancy

Search engines suggest alternative keywords when you mistype keywords.

I was looking for a Wikipedia article on Liskov substitution principle. I came across this when I was reading about Design By Contract elsewhere and the article had 'mistyped' the phrase as Lyskov substitution principle.

I first entered it in my Firefox Wikipedia search engine plugin and got no results. My next target was Google and this is what I got:

Not knowing that I had mistyped the phrase, I did not click on the suggestion. I was in fact surprised that Wikipedia does not have an article on this!

Then I searched in Yahoo and this is what I got:

Wow! I had indeed mistyped the phrase and Yahoo turned out to be intelligent in guessing what I was interested in.

Google's approach is like: 'I guess you have made a mistake, but I am not sure, here is the result for what you typed. However, I think you are looking for this.' Yahoo's approach is: 'I guess you have made a mistake and this is what I think you are looking for, if you are interested in search results for only what you typed, click here.'

I am not sure which approach is better, but I definitely like Yahoo's approach because it saves me a page load and a click.

Google, Yahoo! and innovation

Google recently released “Google calendar“. Time and again, Google reminds me of Jeremy Zawodny's blog, Google is building Yahoo 2.0 – Google trying to re-build what Yahoo and others have built, but provide one killer feature that makes it irresistible.

Ok, if you search for comparsions of the Yahoo and Google services, you are bound to get thousands of entries. I don't want to do the same here. But there are some things that I would like to highlight from my own personal experience.

I have tried out a lot of the Yahoo services. Same is the case with Google. Although Yahoo has a lot of features, the innovation seems to have stopped. The mail, address book, calendar, note services are still in the pre-Web 2.0 phase. (Yeah they have been promising a new look and feel, but where is it??? I am waiting). Google on the other hand started off in the early Web 2.0 phase, and has added some product or the other to its portfolio, not to mention adding petty features to existing products.

Another striking difference has been the kind of integration that exists between the services. Yahoo started with lots of services. Each service was on offer individually, least bothered about what other services offer and how the 2 could be related. For example, Yahoo's calendar service seems disintegrated from Mail. Then there is a briefcase service to store files and attachments. The chat service is different; there are different kinds of searches. There are different kinds of bookmarking services. The list goes on and on.

Contrast this with Google. Google started off providing services one after the other, carefully keeping them tightly integrated. (Is this slow poison? 🙂 Get users to use one service and lure them into the rest?) Google seems to be building a 'single page interface'. “For all your requirements on the web use Google.”, that's what they seem to say. You can use the calendar from the mail interface, your chat logs are in your mail. You have ample space to store all your mail (you don't need a briefcase), the search is always there no matter where you are, search something, if you find it interesting save it, label it and search for it later.

This does not mean Google has done it all right. There is a lot still left to be done. The ultimate aim seems to be – get me all my information on demand – get me the information, wherever I want it, whenever I want it, get me only the information I want, and all the information I want, instantly.

All this translates to: A great expectation from Yahoo's new service. Do they have this kind of service integration? Or is it just old things in new clothing? I am waiting.

Service integration using YubNub

I have never been so excited by 'simple' ideas since the time I saw RSS way back in Aug, 2004.

When I had first seen YubNub, I knew that this idea was here to stay. But then it was quite in its early days and so was not quite usable other than like the Yahoo open shortcuts that I blogged about here.

Today, I happened to revisit that site again. Whew! What a wonder! It has pipes, multiple parameters, string utilities, conditional constructs etc. So a mere combination of commands and you can be working wonders!

To give you an example, suppose you have a server, you can host a set of JSP's (or any other dynamic page) that takes parameterized input and process it and pipe it between sites. This is how you can create personal agents that extract info from one place, automatically blog about it, add sites to bookmarks… the possibilities are unlimited.

An other advantage of this is you just have to remember commands to get your things done. No need of URL's or even short URL's for that matter.

An example is:

garfield -year {rand -min 1979 -max 2005} -month {rand -min 1 -max 12}

will show you the garfield comic for a month randomly. (This is a command I created to check the site out).

Try running it here.
(In order to be convinced by the power of the command, try running this command 3-4 times and see the results).

And if you have a Googlenym, then you can use YubNub to publish your site/page.

Ex: This is mine:

gfl threepointsomething

Try running it here.

An interesting observation is the movement from the GUI back to the command line way of working. The sheer expression power of the command line is unmatched compared to the GUI and that is what is making this click.

And if you are interested there are a host of utilities like Konfabulator widgets, FF integration, FF extension etc that you can use… and enjoy!

And if you are not impressed, it is nothing to do with YubNub; perhaps my explanation was not good and you should go and check it out yourself. 🙂

And these are the YubNub commands that I wrote:

garfield
diggspy

Want to experiment more with this.

Yahoo open shortcuts

This is a superb gift from Yahoo for the new year. This allows you to use the Yahoo search box like a command line via Yahoo Open Shortcuts. Although this is quite similar to the Yubnub tool, I found this very interesting.

In fact, with a small tweak, you can use this from the Firefox address bar.

Here's how you do it:

1. Create a bookmark (Bookmarks->Manage Bookmarks… and then New Bookmark…) with the following entries:

Name: Yahoo
Location: http://search.yahoo.com/search?p=%s&ei=UTF-8&fr=sfp&fl=0&x=wrt
Keyword: yahoo

2. Now enter the following in your search bar:

yahoo !set ljupdate http://www.livejournal.com/update.bml

and press enter.

3. You should get the following message:

Open Shortcuts allows you to use custom keywords to directly search or jumpstart a task on any site from the convenience of a Yahoo! Search box. Please confirm that you would like to add the following Open Shortcut:

* !ljupdate http://www.livejournal.com/update.bml

Press “OK”.

4. Next time you want to update your lj, you just enter:

yahoo !ljupdate

and voila!

This is a simple example I have given. You can try out many things similar to this.

For example, I have created shortcuts to open my bookmark archive, search it. In fact, Yahoo has provided shortcuts to open a particular wikipedia article, compose a mail using Yahoo mail etc.

Some problems though:

* It does not work in AJAXian interfaces or even with POST URL's.
* You cannot have atmost one parameter defined.

Anyways, save time in the new year. 🙂