crawling – Travel Photography and Technology «buzypi.in»

So what's a semantic grabber? If you do a Google search, you get, umm, '0' results (as on 08-March-2006).

So this definitely is not the word used in the wild. So what's it then?

Well, the story began like this. I started off experimenting the evolving pub-sub model wherein you give a list of keywords and you get the latest feeds for it based on the keywords specified. I was trying to come up with an optimum filter that would give me really crisp information. This is a tough job especially in the as yet semantically immature WWW.

My first requirement was to get a good list of keywords. For example, I would like to know all keywords related to semantic-web. I know words like RDF, OWL, RDQL etc are related to semantic-web. But I want a bigger list. (Does this remind you of Google sets?)

Where can I get a list of keywords? I turned to Delicious. If you are a Web 2.0 geek, you would definitely be aware of the rdf:Bag tag, where you get the list of all tags for a particular link.

For example, an rss page for the tag 'rss' has a link which has the following tags:

<taxo:topics>
<rdf:Bag>
    <rdf:li resource=”http://del.icio.us/tag/rss”/>
    <rdf:li resource=”http://del.icio.us/tag/atom”/>
    <rdf:li resource=”http://del.icio.us/tag/validator”/>
</rdf:Bag>
</taxo:topics>

So you know that rss, atom and validator are some 'related' keywords. Of course, there is no context here, so there could be possibilities of people tagging http://www.google.com/ as 'irc'. (This is true. I have seen people tag Google as IRC). But if you consider a weightage for tag relationships, then soon you can come up with a model where you get to see tag clusters.

Ok, now back to the topic on Semantic grabbers. The idea came to my mind when I thought of writing a crawler that crawls on Delicious RSS feeds and tries to find out tag clusters. So this crawler is not interested in links, but is actually interested in data that resides in the links. That clearly distinguishes it from a normal HTTP grabber, which blindly follows links and grabs pages.

Soon, with the evolution of RDF, I guess there will be more such crawlers on the web (what are agents?) and people are already talking about how we can crawl such a web. This is my first attempt at it.

So ditch Google sets (if at all you have tried it) and use a 'semantic grabber'. 😉