As the Extension Categorization section of the PHP manual and the Wikipedia entry for PHP will attest, one of PHP’s greatest strengths is its level of integration with third-party libraries. If the functionality you need isn’t included in the officially supported extensions, chances are good you can find it in a package on PEAR or PECL. However, there are no guarantees, as I was reminded this past week when my curiosity was piqued enough to motivate me to check for a particular extension.
Log analysis is a fairly common task in the field of web development, most often analysis of web server traffic logs or what Wikipedia refers to as web analytics. PHP has no officially supported extensions designed specifically for log analysis. There are no related extensions in PECL. The only remotely related extension in PEAR is PEAR_Log, which for generating logs rather than parsing or analyzing them. In short, there is no common solution here.
At this point, most people generally choose to do one of two things. The first is to roll their own solution in PHP. While this has the advantage that it can be catered to the specific data the programmer is trying to extract, it requires time to develop and has the overhead of being written in an interpreted language.
The other common approach is to use a third-party software package to do the log parsing and analysis and to have PHP capture its output and extract whatever is needed from there. Common examples of such software that are open source include analog, awstats, webalizer, and visitors. This generally has somewhat better performance than the first approach (as most such software is C-based) but is limited to the data and output formats that the software provides, which can make it inflexible and difficult or tedious to integrate with.
I believe a PECL extension for log analysis would be very advantageous. Written at the C level, its performance could be on par with existing solutions and it could provide a flexible API for extracting the desired data. The underlying code could potentially be written as a separate library first and then a PECL extension written that merely wraps it. When rolling their own solution, many developers generally dump their raw data into a database first in order to tap its analytic capabilities rather than trying to replicate them in their own code. As such, I wonder if it wouldn’t be advantageous to have the API use a variant of SQL to allow the user to specify what data should be extracted.
So what do you think? If you were a potential user, what do you think optimal test cases for such an API would look like? The strength of open source software lies in the ability of its developers to offer and consider multiple perspectives. As such, if I’m going to take on this project, I’d like to hear what others have to say.
Update: I’ve just been told that development on phpOpenTracker has recently resumed after about a three year hiatus. If you’re interested in a PHP-based web analytics solution, it might be worth a look.
Update 2: If you tried to get to the visitors link above and found it broken, it’s been fixed. Sorry about that.