Why Google AppEngine still sucks

Last June, when I built the Twitter Trending Topics app using Google AppEngine, I had mentioned quite a few issues with the application building in Google AppEngine. After giving it about 9 months to mature, I thought I will take a look at it again with a fresh perspective on where it stands.

The first thing that I wanted to try was to revive my old application. The application has been inactive because it has surpassed the total stored data quota and I never managed to find time to revive it.

One of the biggest issues that I mentioned last time, was the ability to not be able to delete data from the application easily. There is an upper limit of 1GB on the total stored data. Considering that the data is schema-less (which means that you need more space to store the same data when compared to Relational Databases), this upper limit is severely restrictive when compared to the other quota limits that are imposed. There were about 800,000 entries of a single kind (equivalent of tables) that I had to delete!

So I started looking for ways to delete all the data available and came across this post. I decided to go with the approach mentioned here. The approach still seems to be to delete data in chunks and there is no simple way out. The maximum number of entries allowed in a fetch call is 500, which means I require 1600 calls to delete all the data.

Anyway, so I wrote a simple script as mentioned in the post above and executed it. I experimented with various chunk values and saw that 300 was the size that worked optimally; anything more either seemed to take a lot of time or actually timed out.

Here is the code that I executed:


from google.appengine.ext import db
from <store> import <kind>


def delete_all():
   i = 0
   while True:
      db.delete(<kind>.all().fetch(300))
   i = i + 1
   print i

saved this file as purger.py and executed it as:

$ python appengine_console.py twitter-trending-topics
App Engine interactive console for twitter-trending-topics
>>> import purger
>>> purger.delete_all()

A seemingly simple script, but after about a couple of hours of execution (after having deleted roughly 200,000 entries), I started seeing a 503 Service Unavailable exception. I thought this was to do with some network issues, but realized soon that this was not the case. I had run out of my CPU time quota!

To delete 200,000 entries the engine had taken up 6.5 CPU hours and this it managed to do in less than 2 hours! It had, according to the graphs, assigned 4 CPU cores to the task and executed my task in the 2 hours. At this rate, it will take me 4 days to just delete the data from my application. The Datastore CPU time quota is 62.11 hours but there is an upper cap of 6.5 hours on Total CPU time quota – the Datastore CPU Time quota is not considered separate. I am not sure how this works!

[ad name=”blog-post-ad-wide”]

As seen in the screenshot above, the script executed for about 2 hours before running out of CPU. There was no other appreciable CPU usage in the last 24 hours. Considering that there was no other task taking up CPU, the 6.42 hours of Datastore CPU time seems to be included in the 6.5 hours of Total CPU time. So how am I supposed to utilize the rest of the 55 hours of Datastore CPU time?

I am not sure if I am doing something wrong but considering that there are no better ways of doing things here are my observations:

  • It is easy to get data into the system
  • It is not easy to query the data (there is an upper limit of 500 and considering that joins are done in code, this is severely restrictive).
  • There is a total storage limit of 1GB for the free account
  • It is not easy to purge entities – the simplest way to delete data is to delete them in chunks
  • Deleting data is highly CPU intensive – and you can run out of CPU quota fairly quickly.

So what kind of applications can we build that is neither IO intensive nor CPU intensive? What is Google’s strategy here? Am I missing something? Is anything wrong with my analysis?

Why it is the way it is – an analysis of the proposal by TimBL of the WWW

Ever wonder why hyperlinks in the World Wide Web (WWW) are unidirectional? Why are links not typed? Why are links many to one and not many to many? Why do browsers have the restrictions that they have today? Why is the web the way it is?

A lot of the answers to these questions are hidden somewhere deep in the web itself. Having come across several technical issues with the web, I began to wonder what the initial creators of the web perceived the web to be? What was running in the minds of the users when they came across the idea of the web?

I started tracing back into history to the very beginning of the WWW. That’s how I came across the ‘original proposal of the WWW‘.

So here are some of my notes on the paper:
(Content in italic are from the paper.)

Use cases for the WWW

The initial use-cases for the WWW were related to project management – communicating project ideas, storing technical details for retrieval later, finding out who wrote a piece of code, fetching all related documents for the current task. Most of the proposal revolves around the system to allow for multiuser hypertext access which is non-centralized and non-hierarchical.

Relationship to relational databases

Linked information systems have entities and relationships. There are, however, many differences between such a system and an “Entity Relationship” database system. For one thing, the information stored in a linked system is largely comment for human readers. For another, nodes do not have strict types which define exactly what relationships they may have. Nodes of similar type do not all have to be stored in the same place.

What does this mean?
We do have entities and relationships, but there are no fixed rules. Entities don’t need to have types and any two entities can be related to each other. There is also no restriction on where the entities are stored.

Hypertext

The key ideas around Hypertext were put down by Vanevar Bush in 1945 in the form of Memex. There were several attempts by people to implement Hypertext and also Hypermedia (linking images, video etc). Ted Nilson coined the word Hypertext in 1965 and subsequently also coined the term Hypermedia. The first implementation of Hypertext in some form seems to be from Doug Engelbart in 1968. The buzz around Hypertext picked up during the late 1980’s – there was a dedicated Usenet newsgroup, a bunch of conferences starting with Hypertext’87, several ACM papers, workshops etc. All this happened even before the WWW was born. There were several commercial products too, like Hypercard from Apple.

TimBL had also tried his hands at building a hypertext system, which he called Enquire. TimBL claims to have built it as early as 1980, although the first mention of Enquire seems to be in this proposal made in 1989.

When I started researching on Hypercard features, I realized one thing. These products are easily 20 years old. Technology has changed a lot in this time. It is really hard to imagine how many of these products looked like. Either the source is not available in its entirety or it is tough to compile. This reminds me of what Grady Booch said – about having an archive of source code similar to the archive of books, videos, music and web pages.

Anyway, the most important difference I see between Enquire and Hypercard is that Enquire was more of a ‘programmers playtool’, while Hypercard was targeted towards end-users.

So while Hypercard had ‘fancy graphics’, Enquire had typed links and was available for multi user access.

WWW requirements

About the requirements that TimBL put down for the WWW:
* Remote access across networks, Heterogeneity, Non-Centralisation – These are what are now taken for granted. The WWW is ubiquitous, it never breaks as a system, it can be accessed from just about any device that is Internet aware.
* Access to existing data – This was one of the reasons why the WWW became popular. It was easy to get existing data onto the web with minimal effort.
* Private links –
One must be able to add one’s own private links to and from public information. One must also be able to annotate links, as well as nodes, privately.
Frankly, I am not sure what TimBL means by private links ‘from’ public information.
* Bells and Whistles – Graphical access to the web was considered optional.
* Data analysis – This is one thing that has not taken off.
It is possible to search, for example, for anomalies such as undocumented software or divisions which contain no people. It is possible to generate lists of people or devices for other purposes, such as mailing lists of people to be informed of changes.
It is also possible to look at the topology of an organisation or a project, and draw conclusions about how it should be managed, and how it could evolve. This is particularly useful when the database becomes very large, and groups of projects, for example, so interwoven as to make it difficult to see the wood for the trees.

The Semantic Web is showing this promise.
* Live links – These are what are now called ‘Dynamic pages’ and most popular pages on the web are ‘live’ in that sense.

The implementation

Much of the academic research is into the human interface side of browsing through a complex information space. Problems addressed are those of making navigation easy, and avoiding a feeling of being “lost in hyperspace”. Whilst the results of the research are interesting, many users at CERN will be accessing the system using primitive terminals, and so advanced window styles are not so important for us now.

As I read this, it gives me a feeling that TimBL was not thinking of making the WWW a ‘public’ web that would be used by just about everyone. Even a non-techie could build a page of content and hook it onto the web. Usability seemed to be of least importance.

The only way in which sufficient flexibility can be incorporated is to separate the information storage software from the information display software, with a well defined interface between them.

This division also is important in order to allow the heterogeneity which is required at CERN (and would be a boon for the world in general).

A client/server split at this level also makes multi-access more easy, in that a single server process can service many clients, avoiding the problems of simultaneous access to one database by many different users.
‘information display software’ – Now that’s what the browser is! Also this is what created the need for HTTP, HTTP server and HTML.

Conclusion

Do we still visualize the web as just content linked via Hypertext? How can we accommodate social networking and the whole realm of developments around Web 2.0 and social network applications?

The web has surely come a long way!

(Note: Draft content – subject to change)

Have I stopped blogging?

“Have you stopped blogging?”, people ask me. I don't have a definite answer. True, it has been a long time since I blogged. Although I have been quite active in my online activity as is apparent here I somehow couldn't blog about anything for the last 3 months!

During the course of the 3 months since my last blog entry, I came across quite a lot of things, which I found interesting and would have normally blogged about. However, I got into this vicious circle where I thought that it is not worth blogging about it, after such a long gap and that added to the time, and now it became more and more difficult to blog about something.

Ok, so what's keeping me interested?

  • Lotus connections, specially the idea of Activities. This has been an eye-opener regarding the way I organize information in my system/s.
  • The emergence of a new web pattern of “server pushing information” to the browser, commonly referred to as Comet.
  • ProjectZero, which is IBM's answer to rapid and 'Zero' obstacle development of Web oriented applications.
  • TiddlyWiki – I wonder why I did not come across this before! It is absolutely fabulous and the idea of a single page self-contained wiki is just too good to believe and sometimes scary. 🙂

Also, of late, I started getting interested in analysis of web activities.

A day of effort, and some hacking of the Firefox history and a tool called RapidMiner helped me get some insights into my browsing habits, which I had never thought about before. I noticed a pattern in the way I come across new topics. Also I learnt about the way I get to certain frequently accessed sites and what I can do to get to certain information quicker than ever before. Finally, I realized that the new del.icio.us Firefox extension has helped me improve my browsing habits and made my bookmarks more valuable.

It is really interesting to see what other 'inferences' are possible with the data that is already available! Considering the fact that data today is available in a wide-variety of open-formats and the data also being openly available, it is possible to fetch all this data, feed it to some analyzer and get some interesting insights and use that to make your web journey more fruitful. The Flickr Cluster experiments are just tip of the iceberg!

Some projects/tools related to this are APML, ManyEyes.

Ok, I have written about a wide variety of topics that I am currently finding interesting.

So, finally, back to the question I started off with. Have I stopped blogging? The answer definitely has got to be a 'No'!