A semantic grabber

So what's a semantic grabber? If you do a Google search, you get, umm, '0' results (as on 08-March-2006).

So this definitely is not the word used in the wild. So what's it then?

Well, the story began like this. I started off experimenting the evolving pub-sub model wherein you give a list of keywords and you get the latest feeds for it based on the keywords specified. I was trying to come up with an optimum filter that would give me really crisp information. This is a tough job especially in the as yet semantically immature WWW.

My first requirement was to get a good list of keywords. For example, I would like to know all keywords related to semantic-web. I know words like RDF, OWL, RDQL etc are related to semantic-web. But I want a bigger list. (Does this remind you of Google sets?)

Where can I get a list of keywords? I turned to Delicious. If you are a Web 2.0 geek, you would definitely be aware of the rdf:Bag tag, where you get the list of all tags for a particular link.

For example, an rss page for the tag 'rss' has a link which has the following tags:

<taxo:topics>
  <rdf:Bag>
    <rdf:li resource=”http://del.icio.us/tag/rss”/>
    <rdf:li resource=”http://del.icio.us/tag/atom”/>
    <rdf:li resource=”http://del.icio.us/tag/validator”/>
  </rdf:Bag>
</taxo:topics>

So you know that rss, atom and validator are some 'related' keywords. Of course, there is no context here, so there could be possibilities of people tagging http://www.google.com/ as 'irc'. (This is true. I have seen people tag Google as IRC). But if you consider a weightage for tag relationships, then soon you can come up with a model where you get to see tag clusters.

Ok, now back to the topic on Semantic grabbers. The idea came to my mind when I thought of writing a crawler that crawls on Delicious RSS feeds and tries to find out tag clusters. So this crawler is not interested in links, but is actually interested in data that resides in the links. That clearly distinguishes it from a normal HTTP grabber, which blindly follows links and grabs pages.

Soon, with the evolution of RDF, I guess there will be more such crawlers on the web (what are agents?) and people are already talking about how we can crawl such a web. This is my first attempt at it.

So ditch Google sets (if at all you have tried it) and use a 'semantic grabber'. 😉

The evolution of the pub-sub model on the web

Recently, I have seen a new trend emerging on the web. Until quite recently, we had people publishing their information as RSS feeds and others subscribing to it. This was the first step towards the pub-sub (publish subscribe) model.

Then came tagging and people started publishing 'relevant' tags along with the feed entries. This has helped in the emergence of a new trend, wherein I am able to track not just websites, but information pertinent to certain keywords (or tags).

A major advantage of this is that I don't have to subscribe to RSS feeds, rather I just subscribe to a set of keywords (optionally combined using a regular expression) and then get information based on it. I have been trying this for quite sometime now and have been getting wonderful results.

In fact, this is how founders of websites are able to track the popularity of their tool by just subscribing to the keyword that relates to their website. The moment someone tags their blog entry with this tag, it arrives in the feed readers of the founders and they are quick to comment and 'show interest'. Here's more information and an example of how the founder of a website tracked my blog entry within a single day and here's another.

Hoping that tagging is not misused (remember what happened to <meta>?), we have a new way of tracking relevant information.