YQL – Yahoo’s query language for the web

This post is a part of the AfterThoughts series of posts.

Post: A query language for searching websites
Originally posted on: 2005-01-27

I blogged about the idea of a query language for websites back in 2005. Today, when I was doing my feed sweep, I came across YQL, a query language with SQL-like syntax from Yahoo that allows you to query for structured data from various Yahoo services.

There is one thing that I found interesting. The ability for you to query ‘any’ HTML page for data at a specific XPath. There are some details in the official Yahoo Developer blog.

The intent of YQL is not the same as what I had blogged about. While YQL allows you to get data from a specific page, what I had intended was something more generic – an ability for you to query a set of pages or the whole of the web for specific data, which is a tougher problem to solve.

In order to fetch specific data from a HTML page using YQL, all you have to do is:
1. Go to the page that you want to extract data from.
2. Open up Firebug and point to the data that you want to extract (using Inspect).
3. Right click the node in Firebug and click on ‘Copy XPath’.
4. Now create a query in YQL like this:
select * from html where url=”” and xpath=”

Although the idea seems promising I wasn’t able to get it to work for most XPaths.

I guess the reason is the difference between the way the browser interprets the HTML and the way a server would interpret it. For example, if there is no ‘tbody’ tag in your table, the Firefox browser inserts a ‘tbody’ tag and that would be present in your XPath, while a server that interprets the HTML after Tidying it wouldn’t see one. One way we can solve this is to have the same engine interpret the XPath on the server side as well or be as lenient as possible when matching the XPaths. I had similar discussions with the research team in IRL when I was working on my idea of MySearch, which had similar issues, and there were some interesting solutions that we discussed.

I would say it is only a matter of time when someone will crack the issue of fetching structured data from semi-structured data present in the web and make it available to other services. Tools like Dapper, Yahoo Pipes, YubNub and YQL are just the beginning.

I have made several attempts at this right from using one of these tools, to building my own using Rhino, Jaxer etc, but until now the most content solution is a combination of curl, grep, awk and sed.

The Afterthoughts – If Google came up with an RSS Reader

So here is another post in The Afterthoughts series.

Post: If Google came up with an RSS Reader
Originally posted on: 2005-01-30

This post was made long before Google came up with Google Reader. I was experimenting with RSS readers and started wondering what it would be like if Google came up with an RSS reader.

Now that we have one from Google, it is time to look back and see how my expectations matched with the actual product.

> * It would first buy the domain “greader” or something similar.
This didn't happen. However, Google Reader is popularly called GReader. I guess I made this comment because of Gmail.
On a side note, Google does own greader.net.

> * It would have an index of more than 8 million different feeds.
This is not how an RSS reader has evolved. Google Reader does have recommendations based on the feeds you already have. It would be good to see an integration of Google Blogsearch or even Google News with Google Reader. The only integration I see is the subscription of search results from both of these in Google Reader (a 'new' feature).

> * It would offer 1 GB space for storing posts.
The storage in most online readers is unlimited.

> * It would have an excellent search feature for searching posts.
This was a surprise! The feature came in so late. Totally unexpected.

> * The interface would be simple, but at the same time powerful.
You bet this has been true. The keyboard shortcuts are just superb. The speed with which you can navigate and read feeds is extremely good. (You will need my script to make it even faster. :))

> * We would be able to mail any post just at the click of a button.
I guess this feature has been around since quite some time now.

> * It would allow us to filter posts and also label them for future reference.
With tagging and folders, this has been better than expected.

> * It would also allow us to make blog entries (of course the service would be integrated with Blogger.)
Again, this is a surprise. Google has not provided any integration with Blogger. However, recently Google added a feature to share an item with notes. With the microblogging revolution, and Google having acquired Jaiku, I guess that integration will happen first.

> * It would integrate greader with other offerings like mail, groups etc.
The integration is not that great as of now. It would be cool to see posts related to a mail, or a message in a group etc.

> It would be Beta forever. 🙂
Surprise! This isn't true!

Final thoughts:
So after more than 3 years since I made the original post, (which is a lot of time in technological evolution) I should say, Google did match most of the expectations that I had back then, some features were developed much better than what I had expected. However the integration with other services is one thing where it could have done better.

The Afterthoughts – Gmail forwarding and service interoperability – an interesting observation

“The Afterthoughts” is a series where I revisit some of my older blog entries and see how things have changed since the time I made the blog post and now.

The posts that I will choose initially will be from 2004 to 2006.

So here is the first one in the series:

Post: Gmail forwarding and service interoperability – an interesting observation
Originally posted on: 2005-11-21

The entry goes about explaining how when you connect various services together, you could end up with the same information multiple times.

This is increasingly becoming a problem these days. Services like Twitter and Friendfeed are not solving the problem elegantly, so you see more and more duplicates and links to the original post.

Here is a typical scenario today:
I make a blog entry. In order to ensure that my readers see my post immediately, I have a service that automatically posts a message in Twitter. This is like instantly messaging my friends (actually Twitter followers) telling them, “Look, I made a blog entry”.

Now, I use a lot of Web 2.0 services. So, in order to ensure that all my friends have a single feed to follow my activities, I use some aggregator like FriendFeed or Tumblr.

Some friend of yours (let's call him Bob) likes your blog entry and bookmarks it on del.icio.us. Another friend, Andrews bookmarks it in Magnolia.

Let us now say, there is another person Dave, who is a friend of you, Bob and Andrews. He is following all 3 of us in Friendfeed.

How many entries is Dave going to see of the original entry?
6 in total! 3 from you – 1 from your blog post directly, 1 from Twitter, 2 from Tumblr (1 via the blog post and 1 via Twitter), 1 from Bob via del.icio.us and 1 from Andrews via Magnolia.

The screenshot shows duplicate entries from mashable's blog feed and from Twitter:

Now this is real noise. And this is more true if Dave is not even interested in the blog post to begin with.

So the solution?
Friendfeed allows you to hide specific feeds from specific people. For example, Dave can hide all bookmarks from Bob or all Tumblr entries from me.

Now that is not a good solution because not all bookmarks from Bob are duplicates.

Tools like Feedblendr and Blogbridge have solved this problem for simple RSS aggregation. However things are different when it comes to social network and aggregation.

So right now there is no simple way of detecting duplicates and more and more people are complaining about this in the blogosphere explaining how Friendfeed is more noise than information and why the good old Google Reader is still relevant.

Here is one such discussion. As the discussion suggests, it is not just about eliminating duplicates; it also requires you to merge discussions/comments in each of these posts keeping in mind that not everyone is a friend of everyone else.

So what has changed over the last 2 years?
If anything, the problem has become a tougher one. I am sure the startup that does duplicate elimination and gives you a filtered feed taking your social networks into consideration is going to be the next hyped startup in the Web 2.0 world.