Posts tagged ‘MySQL’

Database Testing with PHPUnit and MySQL

Update 2012/01/15: I finally got around to submitting a patch to document this feature in the PHPUnit manual. Sebastian has merged it, so it will hopefully be available in the online manual soon.

Update #2 2012/01/23: I got around to checking the online version of the manual and the current build includes my patch. Enjoy.

I recently made a contribution to the PHPUnit project that I thought I’d take a blog post to discuss. One of the extensions bundled with PHPUnit adds support for database testing. This extension was contributed by Mike Lively and is a port of the DbUnit extension for the JUnit Java unit testing framework. If you’re interested in learning more about database unit testing, check out this presentation by Sebastian Bergmann on the subject.

One of the major components of both extensions is the data set. Database unit tests involve loading a seed data set into a database, executing code that performs an operation on that data set such as deleting a record, and then checking the state of the data set to confirm that the operation had the desired effect. DbUnit supports multiple formats for seed data sets. The PHPUnit Database extension includes support for DbUnit’s XML and flat XML formats plus CSV format as well.

If you’re using MySQL as your database, CSV has been the only format supported by both the mysqldump utility and the PHPUnit Database extension up to this point. My contribution adds support for its XML format to the extension. While this support was developed to work in the PHPUnit 3.4.x branch, it won’t be available in a stable release until 3.5.0. In the meantime, this is how you can use it now.

  1. Go to the commit on Github and apply the additions and modifications included in it to your PHPUnit installation.
  2. From a shell, get your XML seed data set and store it in a location accessible to your unit test cases.
    mysqldump --xml -t -u username -p database > seed.xml
  3. Create a test case class that extends PHPUnit_Extensions_Database_TestCase. Implement getConnection() and getDataSet() as per the documentation where the latter will include a method call to create the data set from the XML file as shown below.
    $dataSet = $this->createMySQLXMLDataSet('/path/to/seed.xml');
  4. At this point, you can execute operations on the database to get it to its expected state following a test, produce an XML dump of the database in that state, and then compare that dump to the actual database contents in a test method to confirm that the two are equal.
    $expected = $this->createMySQLXMLDataSet('/path/to/expected.xml');
    $actual = new PHPUnit_Extensions_Database_DataSet_QueryDataSet($this->getConnection());
    // Specify a SELECT query as the 2nd parameter here to limit the data set, else the entire table is used
    $actual->addTable('tablename');
    $this->assertDataSetsEqual($expected, $actual);

That’s it! Hopefully this proves useful to someone else.

Natural Ordering in MySQL

I ran into an instance recently where I wanted to implement natural sorting of a result set in MySQL. When you’re dealing with numerical strings or strings with a common non-numeric prefix, the common solution of casting the order column to an integer by adding zero to it works fine. However, if neither of the aforementioned conditions is the case, it takes a little more work.

What actually happens when you add zero to a non-numeric column depends on the characters at the beginning of the column value. If the column does not begin with a sequence of one or more numeric characters, then adding zero to that column produces zero. (Ex: “dog” + 0 = 0) If the column does begin with numeric characters, then adding zero to it produces the sequence of numeric characters up to the first non-numeric character in the original value or the end of the value, whichever comes first. (Ex: “12 dogs” + 0 = 12) An example might be the easiest way to illustrate this.

mysql> SELECT name+0<>0, name+0, name 
    -> FROM `recommendation` 
    -> ORDER BY name+0<>0 DESC, name+0, name;
+-----------+--------+------------------------+
| name+0<>0 | name+0 | name                   |
+-----------+--------+------------------------+
|         1 |      3 | 3 month follow-up      | 
|         1 |      6 | 6 month follow-up      | 
|         1 |     12 | 12 month follow-up     | 
|         0 |      0 | Intervention           | 
|         0 |      0 | Observation            | 
|         0 |      0 | Specialty Consultation | 
+-----------+--------+------------------------+
6 rows in set (0.00 sec)

The first ORDER BY clause checks the string to see if it begins with numeric characters, then places results for those that do first. If you prefer that numeric results appear after non-numeric results, then you can exclude this clause.

The second ORDER BY clause orders the numeric results by casting them to integers and ordering by those integers.

The third clause orders the non-numeric results by the original column value.

And that’s all there is to it. Hope this proves helpful to someone.

EAV Modeling – Square Peg in a Round Hole?

So I got the June 2008 issue of php|architect (or volume 7 issue 6 for those of you who track it that way) in recently. Right off, I found the cover article on EAV modeling to be of interest seeing as my current employment is in the medical IT industry and I’d never heard of this technique for storing data. I actually more or less knew what it was, but had never put a name to the face so to speak.

The mental image that came to me when reading about this approach to data modeling was taking the traditional relational table and turning it on its head. Despite what the Wikipedia article on the topic might tout early on, there are disadvantages to using the EAV approach. EAV actually has to circumvent, work around, or reimplement features that most mainstream database servers today provide “for free” to the traditional relational counterparts of EAV in order to get equal functionality. These include native data type validation and data type-specific operations without explicit typecasting (if you’re not separating EAV values by data type), row-level referential integrity, and schema metadata. EAV also adds a dimension of complexity to query construction in an era where storage is becoming cheaper and database technologies are evolving. It may work, but I don’t foresee it scaling very well for larger systems. In short, it seems an attempt to force a square peg into the round hole that are traditional relational database systems.

In a MySQL world, there are alternative approaches for deploying DDL modifications. One is to implement master-slave replication to propagate DDL modifications and load balancing to maintain uptime as changes are propagated from server to server. Another is to use MySQL Proxy to direct queries to servers hosting unmodified schemas and queue DML operations in the binary log while DDL modifications are made. Once DDL is complete, the server goes into “read only” mode while queued DML operations are applied and incoming DML operations received during that time are blocked until the queue is empty. (This may be a potential point of improvement.)

Outside of MySQL, there are document-focused database systems such as Apache Lucene and its current .NET and PHP ports as well as Apache CouchDB. While some of these are still a little early in development, I see them as being more ideal for applications demanding more fluid data storage and hope that the development of similar solutions continues.

Graphs and Relational Databases

Situations involving hierarchical data and relational databases are quite common in web applications. Trees lend themselves quite well to providing organizational structures a web site, such as sitemaps and breadcrumb trails. A slightly less common and different type of situation, where the application is just as useful and a solution is a bit more complex to derive, is one involving graphical data (as in graphs, not graphics) and relational databases. These situations have issues like the shortest path problem and find their solutions in graph theory such as the A* algorithm or Dijkstra’s algorithm. An example of such a situation is an airline web site that requires the ability to locate connecting and round-trip flights and find the flight path with the lowest cost in terms of time or ticket price.

If you use MySQL, this chapter from “Get It Done with MySQL 5” is a fairly verbose but comprehensive guide to using MySQL to store graphical data. It includes background information such as terminology used in graph theory and has numerous implementation examples of adjacency list graph models, nested set graph models, and breadth-first and depth-first graph search algorithms.

For Oracle users, there’s a slightly more application-oriented tutorial that assumes more theoretical knowledge on the part of the reader. It shows that Oracle’s hierarchical data features unfortunately can’t be used in cases where cycles might exist in graphs (which is handy if you’re trying to detect them) and then goes on to show an implementation that uses temporary tables to store a summary of a graph analysis. A worthy side note is that part 2 of that tutorial deals with a more specialized approach using state machines that may or may not be applicable to your situation.

If you’d like more information on this topic, a good place to look is your local university. Most with a computer science program offer a course in theory of computation, which deals with topics like these as well as context-free grammars in the context of developing programming languages. Even if you never actually use this information to develop a language yourself, it can still serve a good purpose: it make you more informed when engaging in discussions about language development, and it can increase your appreciation for the beauty of a language from a user perspective.

PHP, MySQL, and Oracle: An Odd Triangle

Maggie Nelson posted a blog entry recently with a review of an article entitled “Database Design for PHP Programmers” in the February 2008 edition of php|architect magazine. In it, she remarks on the article being MySQL-oriented and how limited MySQL explain plan support is compared to Oracle. I’ve had some thoughts in my head for a while that are related to these points, so I finally decided to, knock on wood, put pen to paper.

First, I’ll agree with Maggie that Oracle beats MySQL hands down in a number of areas. Explain plans in Oracle are significantly more detailed and informative than they are in MySQL. Oracle offers features to support hierarchical data, whereas the most MySQL can recommend is to use the nested set model. Oracle also fully supports set operators, while MySQL still only supports unions. Oracle offers Oracle XE and Oracle SQL Developer to lower the barrier of entry for new developers. MySQL itself is free, but MySQL only currently offers similar developer-targetted features in the form of four separate tools, one of which (Workbench) is still very much in beta. These are just a few places where Oracle comes out the clear victor against MySQL. From a developer perspective, it’s no contest.

The story doesn’t end there, unfortunately. The major issue with Oracle is licensing, which still costs a pretty penny these days. Oracle is lowering this barrier a little to include midsize companies, but obviously MySQL still comes out more cost-effective. MySQL itself isn’t completely innocent of causing licensing issues, but it has at least remained free and compatible with PHP in terms of licensing up to this point. The LAMP platform became popular with good reason: it gives smaller companies a scalable, zero-cost, effective means of rapidly developing applications. At the moment, the most Oracle can claim is two out of those three points.

In addition to licensing, Oracle also comes with additional administration costs. I’ll use the managed server hosting provider Rackspace, with which I’ve had a fair amount of experience, as an example of this. They provide hosting for some fairly big names, Apple being one of them. Take a look at the database services they offer. They support both MySQL and Oracle. If you look closely, though, you’ll notice this statement on the Oracle page: “Oracle Server Support is not available to our Managed Support Level customers. For information about our Intensive Support Level, please click here.” I think that about sums up my point.

I would actually like to see us switch to Oracle where I work. As a developer, I prefer its feature set, enough so that I’m willing to overlook some of its nastier qualities. However, with the additional costs that would come with licensing and administration by Rackspace (which is still very reasonable for what they provide), I don’t think the cost difference is one that I can justify. Until the licensing and administration requirements that come with it are lessened, I don’t believe Oracle is going to be an ideal solution for me or the average PHP developer.

Additionally, though I’m still a bit skeptical, the Sun buyout of MySQL could accelerate MySQL’s development and in time put it closer to being on par with Oracle’s feature set. Only time will tell on that one, but I’d say it’s just as much of a possibility as Oracle expanding its offerings to include smaller companies. It’s still too early after the buyout to make any remotely definitive predictions as to its long-term effects on MySQL as a product.

But, as Dennis Miller would say, “That’s just my opinion, I could be wrong.” What’s your take?

NULLification

I’ve seen some “interesting” things during my time with database systems, but the one that takes the cake by far is variations in how NULL is interpreted. I’m going to provide some examples to showcase what I’m talking about using Oracle and MySQL, being that my experience is mostly with those two particular systems. Examples given are run on Windows XP SP2 using Oracle Express Edition 10.2.0.1.0 and MySQL Community Edition 5.0.45.

The stated intention of the existence of NULL is to convey the absence of any value. Both Oracle and MySQL say as much (and I particularly like MySQL’s explanation of the reasoning behind this). Oracle, however, immediately goes on to contradict that principle and maintain that a character value with a length of zero is considered to be equivalent to NULL.

Oracle: SELECT CONCAT(‘test’, NULL) FROM dual; -> ‘test’
MySQL: SELECT CONCAT(‘test’, NULL); -> NULL

This behavior manifests itself in some other “interesting” ways.

Oracle: SELECT ‘true’ FROM dual WHERE LENGTH(”) IS NULL; -> ‘true’
MySQL: SELECT LENGTH(”); -> 0

If you’re wondering why I’ve formatted my Oracle query differently in this example, it’s because this particular release of SQL*Plus appears to not want to give me a definitive answer. If I use the equivalent Oracle query, I simply get blank space where I would expect to see NULL. If you invert the WHERE clause in the query, you’ll see what I mean. I wonder why this is?

To make things worse, it is only indirectly noted in the Oracle LENGTH function documentation in the statement that passing NULL to LENGTH will result in NULL. You have to read the section on NULL to find out that the empty string is equivalent to NULL, then put two and two together in order to figure this out. Oracle has made the statement that it is possible this behavior will change in the future. Given backward compatibility-related implications, however, I highly doubt that.

Strings aren’t the only area to which I have an objection with respect to NULL; numbers are, as well. Prior to MySQL 5.0.13, the following example was handled in what I believed was “the right way.” In versions 5.0.13 and later, it was changed to use the same logic as Oracle.

Oracle: SELECT ‘true’ FROM dual WHERE GREATEST(1, NULL) IS NULL; -> ‘true’
MySQL: SELECT GREATEST(1, NULL); -> NULL

What both do now is cause NULL to be the result if any one operand in the expression is NULL. In dealing with a general expression, I can understand this. When that expression deals with a logical set of associated row- or column-wise values, which this and some other functions do, the absence of a value for a single object in the set should not cause this to happen.

MySQL isn’t even consistent about this; in a section on common problems with NULL in its documentation, it’s stated that “Aggregate (summary) functions such as COUNT(), MIN(), and SUM() ignore NULL values.” In another section on working with NULL values, it’s stated that “Two NULL values are regarded as equal in a GROUP BY.” NULL conveys the absence of a value, so how can this be?! Both Oracle and MySQL have a set of functions and operators for dealing with NULL. At least these treat NULL consistently.

In short, I’m not saying NULL shouldn’t exist. I know it has its uses, such as rewriting queries to avoid using subqueries in order to improve efficiency. I’m saying that mainstream database systems should bear in mind the reason for the existence of NULL when deciding how any given function call or expression should handle encounters with it. Until they do, all I can do is recommend caution when dealing with fields that may contain NULL and querying against them. Be cognizant of these fields and operations involving them and make use of your respective database’s functions and operators for handling these cases so as to avoid unexpected results. Don’t let your data be nullified!