This blog entry is not quite to do with what Wikipedia in RDF is all about, but the kind of problems that I faced in using it.
When I initially read about the Wikipedia in RDF initiative, I was excited. Imagine being able to download the meta information of ALL the articles of Wikipedia and then being able to query it, analyze it and do anything that you would want to do with it.
I loyally downloaded the gzip for RDF/XML format. The zip file size is 397 MB and the unzipped size is supposed to be 3.7 GB (supposed to be, because I did not have enough space in a single partition to unzip the entire zip. I initially had doubt if XP supports files of this size, but saw some page, which said that the maximum file size is the size of the volume in NTFS partitions).
Ok, here come a host of problems. I conducted my experiments in a 256 MB system. I guess the processor is not bad; it is a 1.7GHz Celeron system.
In order to analyze this file, I should first extract it. I extracted this zip partly (about 800 MB) and then tried to open it in my text editor – SciTE. I was disappointed. The file did not open. I then tried Wordpad (I did not dare to try Notepad!), Vim (for Windows), Edit (from cmd.exe) and Mozilla Firefox.
The best response I got was from Edit (I am not surprized. I have done some tests before and I saw that Edit is the best text editor in Windows!), which clearly said it cannot handle files of that size and it will show the first 65000 odd lines. Decent. I atleast get to view 65000 lines!
The second best response was from Mozilla Firefox. I had some problems here. Firefox tried to parse the file, since it was in RDF. I changed the extension to txt so as to avoid parsing and tried again. Firefox immediately started loading the file. It occupied about 150MB of memory, just before it stopped working.
Vim was bad too. 🙁 The file just did not open and Vim made an abnormal exit.
So I am left with a host of problems before I can start playing with this file.
Is there any text editor that I can use to open this file? I guess there should be SOME editor that does caching and is written specially to load huge files.
Ok, now on to the second problem. I am thinking of making some analysis using this RDF document. In order to do that I should be able to 'load' the entire file in memory (because it requires an XML parsing of RDF), or else I cannot use it. I guess I should use FileChannel to create a map of the file and a pull parser to parse the file.
I have not tried this, but I am cent per cent sure that I will face problems. Size does matter!
Wish me luck. 🙂