Log Analysis and PHP

As the Extension Categorization section of the PHP manual and the Wikipedia entry for PHP will attest, one of PHP’s greatest strengths is its level of integration with third-party libraries. If the functionality you need isn’t included in the officially supported extensions, chances are good you can find it in a package on PEAR or PECL. However, there are no guarantees, as I was reminded this past week when my curiosity was piqued enough to motivate me to check for a particular extension.

Log analysis is a fairly common task in the field of web development, most often analysis of web server traffic logs or what Wikipedia refers to as web analytics. PHP has no officially supported extensions designed specifically for log analysis. There are no related extensions in PECL. The only remotely related extension in PEAR is PEAR_Log, which for generating logs rather than parsing or analyzing them. In short, there is no common solution here.

At this point, most people generally choose to do one of two things. The first is to roll their own solution in PHP. While this has the advantage that it can be catered to the specific data the programmer is trying to extract, it requires time to develop and has the overhead of being written in an interpreted language.

The other common approach is to use a third-party software package to do the log parsing and analysis and to have PHP capture its output and extract whatever is needed from there. Common examples of such software that are open source include analog, awstats, webalizer, and visitors. This generally has somewhat better performance than the first approach (as most such software is C-based) but is limited to the data and output formats that the software provides, which can make it inflexible and difficult or tedious to integrate with.

I believe a PECL extension for log analysis would be very advantageous. Written at the C level, its performance could be on par with existing solutions and it could provide a flexible API for extracting the desired data. The underlying code could potentially be written as a separate library first and then a PECL extension written that merely wraps it. When rolling their own solution, many developers generally dump their raw data into a database first in order to tap its analytic capabilities rather than trying to replicate them in their own code. As such, I wonder if it wouldn’t be advantageous to have the API use a variant of SQL to allow the user to specify what data should be extracted.

So what do you think? If you were a potential user, what do you think optimal test cases for such an API would look like? The strength of open source software lies in the ability of its developers to offer and consider multiple perspectives. As such, if I’m going to take on this project, I’d like to hear what others have to say.

Update: I’ve just been told that development on phpOpenTracker has recently resumed after about a three year hiatus. If you’re interested in a PHP-based web analytics solution, it might be worth a look.

Update 2: If you tried to get to the visitors link above and found it broken, it’s been fixed. Sorry about that.

2 Comments

  1. me says:

    what about http://www.phpmyvisites.net. it’s a very nice tool licensed under the gpl.

  2. Cups says:

    Sounds very interesting – heres some muddled thoughts.

    Experience tells me there is lot of redundant work done in the name of stats, yet when I really need them, I either end up back in the log files in Vi, or just can’t find what I need succinctly, wading through screeen after screen of graphs.

    From a developer POV I really only look at logs when an exception is identified, lots of hits from a particular IP, massive ramp up on requests to a page that normally has few hits. I want to filter out all the images hits etc.

    I want it to contact me in the manner I require when an exception I have programmed into it happens.
    For my clients I will want different things, I have set up I don’t know how many stats procedures using analog, webtrends etc, the fact is that busy people just aren’t that interested in stats per se, they want the headlines sent to them when there is something worth telling them. THEN they want to drill down/through to glean some supposed “fact”.

    I guess this means the api might start to look like:

    setResetFrequency(daterange);
    setAlertType(“text”, 007712345678);
    getTopRequests(daterange, count, [directory]);
    setIpCountTrigger(daterange, 100)->setAlertType(“email”,”me@this.com”);

    Is that the kind of thing you want to write unit tests for?

    A good example is local elections, it only happens every 4 years. nobody is interested in them at all, yet for a week prior, during and a week after, everyone wants to know who was looking at a particular results and why.

    So, I’d like to add to my CMS a couple of lines like;

    $counts = new stats( daterange, “/logfile/[pattern]”);

    echo ‘Between ‘ . daterange[0] . ‘ and ‘ . daterange[1] . ‘ this page has been requested ‘ . $counts->page($_SERVER[‘PHP_SELF’]) . ‘ times’ ;

    I rather like the idea of stats appearing on webpages as meta data, I like the immediacy and relevance, it also provides a parallel means of navigation, which I am into.

    SQL, do you mean as in SQLite? Is that instead of a cache?

    How about stipulating PHP 5 +, and using APC instead of/as well as a database? Or then again using SQLite’s in mem cache.