Moving your WordPress blog from Apache to Cherokee in 30 minutes

In my post on VPS Hosting experiences, I had mentioned that inspite of doing server configuration tweaking, I found that load times were gradually increasing and I was experimenting with an alternative server named Cherokee.

The whole migration took less than half a day – including learning Cherokee, trying out locally and then using it in my blog. So what are the steps I followed to move to Cherokee?

I use Ubuntu 10.04 LTS as my dev system as well as on the production server – so one of the things that I am confident about is that, if something works in my dev environment, it is bound to work in the production setup, with minimal pains during deployment. So I wanted to first try out the entire setup – make sure everything is fine, and then replicate the setup on the production server.

I started by installing Cherokee from the PPA and also php5-cgi:

add-apt-repository ppa:cherokee-webserver/ppa
apt-get update
apt-get install cherokee
apt-get install php5-cgi

Continue reading Moving your WordPress blog from Apache to Cherokee in 30 minutes

VPS Hosting Experiences

So after the frustrating experiences with my shared hosting provider, I decided to move to VPS hosting once and for all. I knew that this would mean, spending more money, and having to spend more time and energy tweaking configurations and monitoring the site, I thought it will be worth the effort and price for the flexibility that I would get from it.

So sometime in late December, I made the move. After looking around and asking a few people, I finally decided to go with VPS.net. The movement from shared hosting to VPS was a breeze and I was up and running in under 2 hours. The experience with VPS.net until now has been pleasant.

Meanwhile, I am closely monitoring Google Webmaster Central and there are some very interesting observations and that is what I wanted to share here.

Google Webmaster Central data for buzypi.in
Google Webmaster Central data for buzypi.in
  • Gzipped Content
    The first observation is how, when I moved from shared hosting to VPS, the data download size reduced drastically with no significant change in the number of pages crawled per day. This is because I use GZIP encoding, while my shared host did not (when you pay for bandwidth there is no incentive to reduce the size, now is there?!)
  • Improvement in load times
    The second observation is how the time to download also reduce drastically when I moved to VPS hosting. This was expected. While my server now has only my services running, I am not sure how many umpteen other websites were being served on my shared host.
  • Server configuration tweaking
    Towards mid Jan, the load times started increasing. This is because I had a few other services hosted on the same machine and the server started thrashing. The biggest issue with most VPS providers is that they are very lenient on bandwidth and storage, but very stingy when it comes to memory. So I had 2 choices – either I upgrade my configuration and pay nearly twice the price, or I start playing with the Apache and PHP configurations and see if I can squeeze out more performance from the system. I decided to go for the latter. I cut down on the services hosted, disabled unnecessary modules, played with threads and child processes, and tweaking PHP configurations. But no matter what I did, the load times stayed up there, or worse, continued to increase and there was nothing I could think of.
    Recently a friend of mine asked me to give Cherokee a try. Cherokee is considered to be blazingly fast and very lightweight compared to Apache. So I have moved my blog to Cherokee now and hope to monitor the performance closely over the next few days.
  • Google on steroids
    Another observation is how Google suddenly decided to give my site a real test – and decided to download virtually all the pages possible in a single day – this happened a couple of days back and I am yet to discover why this happened. What I am happy about is that the load times were decent when this happened.
  • Load times and Google Ranking
    I can confirm that there is some corelation between page load times and rankings in Google. In December, when my site was taking as many as 3 seconds to load (Google said my site was slower than 94% of the sites in the world!) – some of the keywords for which my posts used to appear in the first page moved to the second or third pages. It was only in January did I see them come back to their original positions.

Overall, it has been a good experience – you learn a lot when you moved to VPS!

Google Docs, ODF and Data Portability

Consider the code below to display a line of text in HTML:


<style>
.paragraph-text {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
</style>
...
<p><span class="paragraph-text">Here is a test line</span></p>

Now let’s say, we see some developer write it this way:


<style>
.T1_1 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_2 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_3 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_4 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_5 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_6 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_7 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_8 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
.T1_9 {font-family: Arial; font-size: 11pt; font-weight: normal; text-decoration: none}
</style>
...
<p class="P1">
<span class="T1_1">Here</span>
<span class="T1_2"> </span>
<span class="T1_3">is</span>
<span class="T1_4"> </span>
<span class="T1_5">a</span>
<span class="T1_6"> </span>
<span class="T1_7">test</span>
<span class="T1_8"> </span>
<span class="T1_9">line</span>
</p>

What would you say of the quality of the markup above?

Continue reading Google Docs, ODF and Data Portability

Why Google AppEngine still sucks

Last June, when I built the Twitter Trending Topics app using Google AppEngine, I had mentioned quite a few issues with the application building in Google AppEngine. After giving it about 9 months to mature, I thought I will take a look at it again with a fresh perspective on where it stands.

The first thing that I wanted to try was to revive my old application. The application has been inactive because it has surpassed the total stored data quota and I never managed to find time to revive it.

One of the biggest issues that I mentioned last time, was the ability to not be able to delete data from the application easily. There is an upper limit of 1GB on the total stored data. Considering that the data is schema-less (which means that you need more space to store the same data when compared to Relational Databases), this upper limit is severely restrictive when compared to the other quota limits that are imposed. There were about 800,000 entries of a single kind (equivalent of tables) that I had to delete!

So I started looking for ways to delete all the data available and came across this post. I decided to go with the approach mentioned here. The approach still seems to be to delete data in chunks and there is no simple way out. The maximum number of entries allowed in a fetch call is 500, which means I require 1600 calls to delete all the data.

Anyway, so I wrote a simple script as mentioned in the post above and executed it. I experimented with various chunk values and saw that 300 was the size that worked optimally; anything more either seemed to take a lot of time or actually timed out.

Here is the code that I executed:


from google.appengine.ext import db
from <store> import <kind>


def delete_all():
   i = 0
   while True:
      db.delete(<kind>.all().fetch(300))
   i = i + 1
   print i

saved this file as purger.py and executed it as:

$ python appengine_console.py twitter-trending-topics
App Engine interactive console for twitter-trending-topics
>>> import purger
>>> purger.delete_all()

A seemingly simple script, but after about a couple of hours of execution (after having deleted roughly 200,000 entries), I started seeing a 503 Service Unavailable exception. I thought this was to do with some network issues, but realized soon that this was not the case. I had run out of my CPU time quota!

To delete 200,000 entries the engine had taken up 6.5 CPU hours and this it managed to do in less than 2 hours! It had, according to the graphs, assigned 4 CPU cores to the task and executed my task in the 2 hours. At this rate, it will take me 4 days to just delete the data from my application. The Datastore CPU time quota is 62.11 hours but there is an upper cap of 6.5 hours on Total CPU time quota – the Datastore CPU Time quota is not considered separate. I am not sure how this works!

[ad name=”blog-post-ad-wide”]

As seen in the screenshot above, the script executed for about 2 hours before running out of CPU. There was no other appreciable CPU usage in the last 24 hours. Considering that there was no other task taking up CPU, the 6.42 hours of Datastore CPU time seems to be included in the 6.5 hours of Total CPU time. So how am I supposed to utilize the rest of the 55 hours of Datastore CPU time?

I am not sure if I am doing something wrong but considering that there are no better ways of doing things here are my observations:

  • It is easy to get data into the system
  • It is not easy to query the data (there is an upper limit of 500 and considering that joins are done in code, this is severely restrictive).
  • There is a total storage limit of 1GB for the free account
  • It is not easy to purge entities – the simplest way to delete data is to delete them in chunks
  • Deleting data is highly CPU intensive – and you can run out of CPU quota fairly quickly.

So what kind of applications can we build that is neither IO intensive nor CPU intensive? What is Google’s strategy here? Am I missing something? Is anything wrong with my analysis?

Google Reader – Mark Until Current As Read

I am an ardent feed consumer. I easily have over 300 feeds in my Google Reader and read them whenever I get a chance. The feeds include technology blogs, photography blogs, local news, startup blogs, blogs by famous people, blogs that help me in my projects etc.

It’s just not possible for me to visit every feed category every day, so I frequently see some of these categories overflow with posts.

Now I know there are extensive blog posts which describe how to better manage feeds and to cut down on information overload. But as we all know there is no simple solution.

So here I was using Google Reader and just skimming through the posts when I came across this need.

Suppose a feed has about 100 unread posts and I have skimmed through half of them, and read one in between that I thought was interesting, I am now left with quite a few posts on top of my read post, that I am not interested in reading but want to mark them as read so I don’t need to see them again. Would it be possible to mark these as read leaving the rest untouched?

The recent changes to Google Reader provide one option – Mark all entries older than a day, week or month as read. But this does not exactly serve the purpose.

I ended up hacking a Greasemonkey script to do exactly what I wanted.

Here is how the script behaves:

Just press Ctrl+Alt+Y and the script will mark all entries above the current read entry as ‘read’. Ctrl+Alt+I will mark all entries below the current entry as read – for people who read backwards. 🙂

Added benefits:

  • This also works with search results in Google Reader.
  • The script works with entire folders, so you can skim through all posts in a folder marking the ones you have skimmed as read.

How it works:
The script uses the css class names to determine which posts are unread above (or below) the current post. Once it obtains this list, it simulates a click on each of these posts and thereby marks them as read. Simple as that!

This script is part of the Better GReader extension and has featured in Lifehacker.

In order to install the Google Reader – Mark Until Current As Read script, visit this site.

What I dislike about Google AppEngine

When I built the ‘Twitter Trending Topics‘ application, one of the things I had in mind was to see how quickly an application can be built in the most economical way.

While the application is working like a charm, a day into the launch, I already see a few issues with the hosting solution that I chose, the Google AppEngine.

Continue reading What I dislike about Google AppEngine

My weekend hack – Twitter Trending Topics

I got this idea of building an application which pulls all the pages mentioned in the trending topics on Twitter. Why would that be useful? Well, it’s the simplest replacement for Google News, but more real time and no tweet noise.



Here are the steps I followed to build this application:

  1. The first step was to use IPython and use the Twitter Search API to get the latest tweets.
  2. I then wrote the code to parse these tweets looking for URL’s in them.
  3. The next step was to get the content from these URL’s, and get the title of the pages.
  4. Next, I had to persist it in the store.
  5. Slap a front-end and allow navigation. At this point, the obvious choice for me was Google AppEngine, since it is the cheapest hosting alternative available. I had to make some changes to the application to accommodate it to Google AppEngine’s requirements, but they were mostly trivial.
  6. Build the styles, the icons, the pretty URL’s and you are done!

The initial setup of the application was done in less than 2 hours time!

You can access the application here: Twitter Trending Topics.

There are a few known bugs, but the overall results are impressive.

Experiment with Delicious and Python

Once in a while, I look at my Delicious bookmarks to get an idea of what I have been upto in recent time. The ‘Current Interests‘ tool was written with exactly that in mind.

I began to wonder if my bookmarks can give me an idea of trends in technology and my interest in them. So I quickly wrote a Python script to give me the top tags in each year and here are the results.

Continue reading Experiment with Delicious and Python

Optimizing website bandwidth consumption

Since the time I have hosted this blog on WordPress, I am looking for patterns and tweaks to optimize my bandwidth consumption. Most hosting providers charge you for your bandwidth, so it is always a good idea to see where you can cut down on bandwidth consumption.

When I say optimize I don’t mean reducing size unnecessarily. A good theme with good content in it is definitely going to add to the user experience of visitors to your site and that shouldn’t come in the way of bandwidth optimization. A lot of these points are closely related to the SEO of your site.

So here are some things you can do to optimize the bandwidth.

Continue reading Optimizing website bandwidth consumption

Why it is the way it is – Hypertext Design Issues

This post is an analysis of an early document on Hypertext Design Issues.

The key ideas being discussed in this document are on Hypertext – whether links should be monodirectional or bidirectional, should links be typed etc.

These discussions were conducted in the early days of the web. It is interesting to know how things have evolved since the time this design was made.

Let’s first get some facts right:
Hypertext links today:

  • Are Two-ended
  • Are Monodirectional
  • Have one link
  • Are Untyped
  • Contain no ancillary information
  • Don’t have preview information

What are the implications of this design?

  • Hyperlinks are not multiended. A single link cannot link to multiple destinations. There are however cases when one to many, many to one and many to many ‘links’ might make sense. These types of connections among information nodes is what RDF/OWL help achieve.
  • an advantage is that often, when a link is made between two nodes, it is made in one direction in the mind of its author, but another reader may be more interested in the reverse link.
    Bloggers want to track those pages that have linked to their posts. Google indexes allow us to track links to a particular page. Linkback mechanisms have evolved in the Blogger world to serve precisely this purpose. In general however, we never know who has linked to our page
  • It may be useful to have bidirectional links from the point of view of managing data. For example: if a document is destroyed or moved, one is aware of what dangling links will be created, and can possibly fix them.
    This problem has not yet been solved. Since links are monodirectional, dangling links cannot be detected. Dangling links – when the information linked to changes, there is no way to clean up the links
  • About anchors having one or more links: This is still debatable. There are some utilities that allow you to make every word a hyperlink and allow executing a host of ‘commands’ on the word. Ex: Perform a Google search for the word, lookup the word in dictionary.com, map the word (if it is a city) or lookup in Wikipedia. However I am not a big fan of these utilities since I feel it clutters the screen and the context detection is not yet great.
  • Typed links: I feel this is the single most important thing missing from Hyperlinks in WWW. While making types mandatory would have complicated the issue, a standard way to provide ‘types’ to links should have been provided. Anyway, it’s the way it is. So how are people solving this issue? Microformats, RDFa are 2 things I know of. The data is mostly silently read by the browser and tools and users are usually unaware of this data in the pages. In other words, the User Interface for typed links is still not great.
  • Meta information associated with links. Interesting! I am aware of Wikipedia articles containing the date when the page was last visited but this is pretty much manually updated as far as I know.
  • Preview information: Snap solves this very issue.

The conclusion?
Well, it’s tough to say how optimal the design of hypertext on the WWW was. Introducing multi-directional links and typed links would definitely help the technical people out there, but would introduce complexity which would perhaps have made it so tough for the web to flourish that it wouldn’t be what it is today.