Danny Yee >> Web Design

Logs and Statistics

There's a lot of confusion about web server statistics and exactly what information they log. This is an attempt to explain the basics.

Introduction | Logs | What does it mean?


Introduction

When someone using web browser requests a page from a web site, the web server typically records information about the request. The resulting web server logs (which are accessible to the person running the site) are the basis for most web site statistics.

Note: the following section (about the structure of logs) is somewhat technical. You might want to skip to "What does it mean?".

Some interesting statistics

Logs

Here is a sample line from the logs for my book reviews. (It's from an Apache server, but others should be similar.)

192.168.1.1 - - [24/Oct/2000:18:11:12 -0400] "GET /h/The_Great_Human_Diasporas.html HTTP/1.0"
200 5326 "http://www.google.com/search?q=cavalli-sforza&btnG=Google+Search" "Mozilla/4.5 [en] (WinNT; U)"
192.168.1.1
this is the network address of the machine making the request
[24/Oct/2000:18:11:12 -0400]
the date and time of the request
"GET /h/The_Great_Human_Diasporas.html HTTP/1.0"
this is the request (which file was asked for). In this case it was for an html document (a page), for my review of The Great Human Diasporas.
200 5326
the success code and the number of bytes transferred
"http://www.google.com/search?q=cavalli-sforza&btnG=Google+Search"
the page from which a link was followed (the referer). This example is someone searching for "cavalli-sforza" using Google.
"Mozilla/4.5 [en] (WinNT; U)"
the user's operating system and web browser (the agent). In this case, it was Netscape running on Windows NT.
This is the basic information that is recorded (though browsers don't always provide the referer and agent information). Cookies may provide the server with information handed to the user/client on an earlier visit (for more information about cookies, see Roger Clarke's cookie page). And of course any additional information you provide in forms may be recorded.

What does it mean?

There are many analysis packages that will produce statistics from a web server log.

Each line in the log file is a hit or request. Because every image (and stylesheet) is fetched separately, "hits" is totally useless as a marketing/impact measure (it is however useful as a technical indicator of how much stress the web server might be under).

A request for an html document is a page access. Counting these gives a rough approximation to the number of times pages on the site have been viewed (page views), with some provisos. A site with frames will produce several (three or more) "page" accesses for each actual page view, since the frameset and each frame are separate requests. A proxy server may fetch the page once and then serve it to multiple clients (creating undercounting). And search engines and other automated spiders will often fetch every page on a site - without any of them being viewed by actual people. This can drastically inflate the (effective) page access counts, especially for low-traffic sites with a large number of pages.

The number of unique hosts accessing a site is the number of unique network addresses making requests. This provides a rough approximation to the number of people viewing the site. Again, proxy servers cause undercounting, while someone with a dynamic address connecting at different times will be counted multiple times.

Some analysis software analyses the intervals between series of requests to estimate the number of visits. I don't know much about this, but except for analysis over really long periods, I suspect it won't vary that much from the unique hosts figure.

Figures produced using the same analysis package, on the one site, can be used to track changes over time. Trying to use absolute numbers, however, or comparing statistics from different sites is another matter - it's largely "smoke, mirrors, and spiders" and should be treated accordingly.

Seasonal variation

Changes in the number of page accesses from month to month may not indicate anything unusual. Traffic on all the sites I run drops off drastically mid-year, in northern hemisphere holidays, and has done so consistently over the last five years.


Last modified: November 2000

Web Design << Danny Yee