Semantic Crawler – an update

This is in continuation of my blog entry on Semantic Grabbers. I did some experiments after consultation with . Thanks for the inputs.

My intention was to get a set of related words given a single word as input. I wanted to make use of the <rdf:Bag> tag that Delicious provides.

The idea that I had in mind was to start off by seeing the number of occurrences of each tag in the <rdf:Bag> of all links and then to use this to decide which tag to analyze next. The more frequent the occurrence of a tag, the more likely it is to be chosen next.

For example, suppose I see that RDF occurs most frequently in the links, then I select that as my next tag for analysis. I keep updating this list with more tags and their frequency as I crawl through the tags.

Here's the problem I faced: There are chances of the use of very generic words like tech, development, tutorial etc that are likely to be used in more links than others. So the crawler was mislead. The selected tag becomes more and more irrelevant as the crawling proceeds.

There are some solutions that I have in mind.
1. Provide weight-age in comparison with the root-word (i.e. the given word).
2. Do a study of 'all' the tags for the entire list possibly including the description as well and then see the relationships. (This emerged after my discussion with .
3. Provide more than one word as input and use these words to determine the set of related words.

Determining relationships between words is not quite easy in folksonomies because of the lack of contextual information. However it surely is a rich set of information that needs to be exploited.

The result will be available here for a few days.