Parsing and Summarizing a Logfile
Randal L. Schwartz
I recently put www.stonehenge.com
behind a caching reverse-proxy, and rather than switch technologies, I'm
using another instance of a stripped-down Apache server to do the job. But what
kind of job is it doing? How many of my hits are being cached and delivered
by the lightweight front servers, instead of going all the way through to the
heavy mod_perl_and_everything_else backend servers?
Luckily, I have included the caching information in the access log file, thanks
to the CustomLog and LogFormat directives:
LogFormat "[virt=%v] %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \
\"%{User-Agent}i\" \"%{X-Cache}o\"" combined
CustomLog var/log/access_log combined
I have added a virtual host entry (for tracking) to the front of the line,
and the X-Cache header of the response to the end of the line. Of course,
doing so means my access log is not in a standard format any more, so I can't
use off-the-shelf tools for log analysis. That's okay, because I'm
pretty good at writing my own data-reduction tools. A typical output line looks
like this:
[virt=www.stonehenge.com] 192.168.42.69 - - [10/May/2002:01:51:50 \
-0700] "GET /merlyn/UnixReview/ HTTP/1.0" 200 101324 "-" \
"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "MISS \
from www.stonehenge.com"
For my analysis, I wanted to see how many of those X-cache fields began
with HIT or MISS, indicating that the mod_proxy module
had gone all the way to the backend server, and either gotten a good cache-able
hit, or had to regenerate it. I also wanted the data summarized on an hour-by-hour
basis, in a CSV-style file so I could pull it in to my favorite spreadsheet
to do graphs and formulas.
|