Wikipedia in RDF

This blog entry is not quite to do with what Wikipedia in RDF is all about, but the kind of problems that I faced in using it.

When I initially read about the Wikipedia in RDF initiative, I was excited. Imagine being able to download the meta information of ALL the articles of Wikipedia and then being able to query it, analyze it and do anything that you would want to do with it.

I loyally downloaded the gzip for RDF/XML format. The zip file size is 397 MB and the unzipped size is supposed to be 3.7 GB (supposed to be, because I did not have enough space in a single partition to unzip the entire zip. I initially had doubt if XP supports files of this size, but saw some page, which said that the maximum file size is the size of the volume in NTFS partitions).

Ok, here come a host of problems. I conducted my experiments in a 256 MB system. I guess the processor is not bad; it is a 1.7GHz Celeron system.

In order to analyze this file, I should first extract it. I extracted this zip partly (about 800 MB) and then tried to open it in my text editor – SciTE. I was disappointed. The file did not open. I then tried Wordpad (I did not dare to try Notepad!), Vim (for Windows), Edit (from cmd.exe) and Mozilla Firefox.

The best response I got was from Edit (I am not surprized. I have done some tests before and I saw that Edit is the best text editor in Windows!), which clearly said it cannot handle files of that size and it will show the first 65000 odd lines. Decent. I atleast get to view 65000 lines!

The second best response was from Mozilla Firefox. I had some problems here. Firefox tried to parse the file, since it was in RDF. I changed the extension to txt so as to avoid parsing and tried again. Firefox immediately started loading the file. It occupied about 150MB of memory, just before it stopped working.

Vim was bad too. 🙁 The file just did not open and Vim made an abnormal exit.

So I am left with a host of problems before I can start playing with this file.

Is there any text editor that I can use to open this file? I guess there should be SOME editor that does caching and is written specially to load huge files.

Ok, now on to the second problem. I am thinking of making some analysis using this RDF document. In order to do that I should be able to 'load' the entire file in memory (because it requires an XML parsing of RDF), or else I cannot use it. I guess I should use FileChannel to create a map of the file and a pull parser to parse the file.

I have not tried this, but I am cent per cent sure that I will face problems. Size does matter!

Wish me luck. 🙂

Programmable wikis, Application wikis, Situational applications

Heard of Jot? It has been in the news for some time now, calling itself the first true application wiki.

So what's this thing all about and how is it different from normal wikis?

Before we delve into this, we need to know where normal wikis fail and how this new concept of application wikis helps in solving them.

Wikis in their present form contain highly unstructured data. Take the example of Wikipedia. Wikipedia allows users to create pages containing information about just anything in the world.

The information in the wikis would be more useful if it can be used somewhere else. For example, I would want to just double click on a word in my browser and view the definition of it (and not the entire page). Or I might want to relate content in a page with that of another – semantically. I might also want to view content based on my current expertise level (contextual views).

In order for this to happen, we require that wikis be more intelligent. Enter application wikis.

Application wikis bring in the dynamic content aggregation feature that is lacking in current wikis. This means that the data may not even reside in one place. It might be aggregated at runtime. However this is not it. The content might be pushed out as an RSS and people can subscribe to changes made to specific sections of the wiki or maybe mailed to them. The basic idea is to be able to 'program' the wiki to display 'information' dynamically.

This brings in some interesting applications of application wikis. Application wikis can be used, for example to create a page for a conference. This page would contain 'A google map' plugin, which would show the venue on the map, latest news in the form of an RSS feed, the weather information in another portlet, a list of all participants, which might be coming from a database directly, a list of all talks (which might be maintained in a separate database for some reason).

You might note that the data does not actually exist in the wiki at all. The wiki just acts like an aggregator of content.

Now comes the concept of views. As a participant in the conference, I might be given more information about the talks, while a non-participant may get lesser information. A speaker might get a totally different set of information and so for the organizers.

You probably don't even need a Graphical UI to view the data. You might as well have a kind of Query Interface that allows you to view data based on your role and your preferences.

It might be obvious, but let me clarify that the content for a particular page may come from a specific section of some other page. So I could have, say a page on Bangalore, a page on Mysore and a page on Cities in Karnataka, which fetches content from these pages. The moment the content changes in the original pages, the content viewed from the aggregated page also changes.

Yeah, there's nothing special here, it's the traditional MVC pattern applied to Wikis. It had to come some day.

Two initiatives that I know in this field are from Jot and Semantic MediaWiki.

If you are a Web 2.0 or a Semantic web geek, you might have come across this by now, but if you have not, then you got to check this out.