Monday, December 22, 2008

Website statistics: Roll your own real-time Google Analytics in 5 minutes!

Hello my fellow hackers! I want to talk about something we all freak out about at least once in life: website statistics.

Website statistics have been a headache for webmasters since Internet became Inter-net. It is fundamental to know how many people see your site every day, what they do there, from where they come, what OS and browser they are using, etc. This is so for a lot of reasons, for optimizing your site to get traffic, for bandwidth optimization and (in my case) just to watch endlessly, frenetically and insanely how the visit counter crawls up ;).

Though in the early beginnings of Internet these solutions weren't so easily available, in this ever-changing, over-connected and wonderful world there is a huge number of options for us to choose. The most popular and widely used (IMO) is Google Analytics.

So why bother making our own then? I won't say I don't like Google Analytics because I really like it, I think its a wonderful solution and a must-have tool for any webmaster. However there is something I personally don't like: the fact it is not real-time. Although this is perfectly justified (they must index and analyze millions of sites and real-timeness is difficult to achieve at such scales), I still wanna see my lil' counter crawl up!!!

The question is, do YOU want to see your lil counter crawl? From a practical point of view, I'd say probably not, since usually having the information available by the next day is already enough to do all your work. But actually I think you probably do. Why? Well, I guess we techies are just like that :) I decided therefore to make my own lil' analytics and set it up in my blog today. So lets get to it.

First of all I'd like to start with how statistics work and how do we get them. As most of you already know, in the ancient times only log statistics were available. That is, you'd just walk through your http server's log and you would build statistics from it. That was great since it was simple and totally passive (no modification needed to the site) and you could see it real-time (a tail -f access.log would show it all), the cons: some things you may find interesting about the users such as the resolution of their screens were not logged in the server and hence out of your stats.

Now how Google Analytics work. You basically embed a little piece of Javascript code in every page in your site you want to analyze and you are set up. Then all you need to do is to log into your Google account and you have your stats there. The con: as I said, it's usually updated after 24 or 48h. And how do they do it? Well basically, this little script makes your browser point to one of their servers while the site is loading, posting all the necessary information to the stats collector.

So lets do the same!

First we will start with the Javascript code, you can save this as stats.js:

data = [ document.referrer,
navigator.userAgent,
screen.width + "x" + screen.height,
screen.colorDepth]
query ="";
for (val in data) {
query+=data[val] + "&";
}
img = document.createElement("img");
img.setAttribute("src", "http://yoursite.com/analyz0r.gif|" + query);

Ok this little monster does all the magic. To embed this in your site you'd simply:

<script type="text/javascript" src="stats.js"></script>

What will this do? It will simply make the browser try to load an image called analyz0r.gif in your server, sending all the information we want about the client. The image can just be missing, we don't really care. We are combining here log analysis with Javascript. We will get something like this in the server's log:

exe@melange:~/workz/stats$ cat /var/log/apache2/access.log
127.0.0.1 - - [23/Dec/2008:01:21:29 +0100] "GET /analyz0r.gif|http://localhost/&Mozilla/5.0%20(X11;%20U;%20Linux%20i686;%20en-US;%20rv:1.9.0.5)%20Gecko/2008121622%20Ubuntu/8.10%20(intrepid)%20Firefox/3.0.5&1680x1050&24& HTTP/1.1" 404 334 "http://localhost/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2008121622 Ubuntu/8.10 (intrepid) Firefox/3.0.5"

Although it looks ugly, this is easily parseable into meaningful data. We can for example use this simple script to export it to CSV and then you can load it with your preferred spreadsheet.

exe@melange:~/workz/stats$ cat /var/log/apache2/access.log |grep analyz0r| sed s/%20/" "/g|cut -d"|" -f2 |cut -d"&" -f1-4
http://localhost/&Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2008121622 Ubuntu/8.10 (intrepid) Firefox/3.0.5&1680x1050&24

The format is simple: referrer & user agent & resolution & depth. You can modify the script to add all the fields you would like to.

And it works with other browsers too! Look:

http://localhost/&Opera/9.61 (X11; Linux i686; U; en) Presto/2.1.1&1680x1050&24

Happy analyzing!