Posts tagged ‘Databases’

Natural Ordering in MySQL

I ran into an instance recently where I wanted to implement natural sorting of a result set in MySQL. When you’re dealing with numerical strings or strings with a common non-numeric prefix, the common solution of casting the order column to an integer by adding zero to it works fine. However, if neither of the aforementioned conditions is the case, it takes a little more work.

What actually happens when you add zero to a non-numeric column depends on the characters at the beginning of the column value. If the column does not begin with a sequence of one or more numeric characters, then adding zero to that column produces zero. (Ex: “dog” + 0 = 0) If the column does begin with numeric characters, then adding zero to it produces the sequence of numeric characters up to the first non-numeric character in the original value or the end of the value, whichever comes first. (Ex: “12 dogs” + 0 = 12) An example might be the easiest way to illustrate this.

mysql> SELECT name+0<>0, name+0, name 
    -> FROM `recommendation` 
    -> ORDER BY name+0<>0 DESC, name+0, name;
+-----------+--------+------------------------+
| name+0<>0 | name+0 | name                   |
+-----------+--------+------------------------+
|         1 |      3 | 3 month follow-up      | 
|         1 |      6 | 6 month follow-up      | 
|         1 |     12 | 12 month follow-up     | 
|         0 |      0 | Intervention           | 
|         0 |      0 | Observation            | 
|         0 |      0 | Specialty Consultation | 
+-----------+--------+------------------------+
6 rows in set (0.00 sec)

The first ORDER BY clause checks the string to see if it begins with numeric characters, then places results for those that do first. If you prefer that numeric results appear after non-numeric results, then you can exclude this clause.

The second ORDER BY clause orders the numeric results by casting them to integers and ordering by those integers.

The third clause orders the non-numeric results by the original column value.

And that’s all there is to it. Hope this proves helpful to someone.

EAV Modeling – Square Peg in a Round Hole?

So I got the June 2008 issue of php|architect (or volume 7 issue 6 for those of you who track it that way) in recently. Right off, I found the cover article on EAV modeling to be of interest seeing as my current employment is in the medical IT industry and I’d never heard of this technique for storing data. I actually more or less knew what it was, but had never put a name to the face so to speak.

The mental image that came to me when reading about this approach to data modeling was taking the traditional relational table and turning it on its head. Despite what the Wikipedia article on the topic might tout early on, there are disadvantages to using the EAV approach. EAV actually has to circumvent, work around, or reimplement features that most mainstream database servers today provide “for free” to the traditional relational counterparts of EAV in order to get equal functionality. These include native data type validation and data type-specific operations without explicit typecasting (if you’re not separating EAV values by data type), row-level referential integrity, and schema metadata. EAV also adds a dimension of complexity to query construction in an era where storage is becoming cheaper and database technologies are evolving. It may work, but I don’t foresee it scaling very well for larger systems. In short, it seems an attempt to force a square peg into the round hole that are traditional relational database systems.

In a MySQL world, there are alternative approaches for deploying DDL modifications. One is to implement master-slave replication to propagate DDL modifications and load balancing to maintain uptime as changes are propagated from server to server. Another is to use MySQL Proxy to direct queries to servers hosting unmodified schemas and queue DML operations in the binary log while DDL modifications are made. Once DDL is complete, the server goes into “read only” mode while queued DML operations are applied and incoming DML operations received during that time are blocked until the queue is empty. (This may be a potential point of improvement.)

Outside of MySQL, there are document-focused database systems such as Apache Lucene and its current .NET and PHP ports as well as Apache CouchDB. While some of these are still a little early in development, I see them as being more ideal for applications demanding more fluid data storage and hope that the development of similar solutions continues.

PHP 5.2 and SQL Server 2000 on Windows XP

Recently at work, I was tasked with finding a method to retrieve data from a third-party SQL Server 2000-based system into our own MySQL-based application. It’s worth noting that both system are behind their own firewalls and I was trying to bear security in mind as I did this.

First, I had to actually get into the system with the SQL Server database. For this, I used Citrix GoToAssist. The company behind the software makes their money off of hosting the servers it uses and selling access to them as a service. The client software uses HTTPS for security and to get around firewall configurations. While their software is nice, I’d be interested to see if anyone knows of any OSS equivalents of it.

Once I had access to the system, I needed something akin to phpMyAdmin or Oracle SQL Developer in order to see what resources were available on the SQL Server itself. For that, I found a local copy of EMS SQL Manager already on the system. This wasn’t the most full-fledged product I’d seen for this type of purpose, but it was free, included everything I needed, and did the job in a pinch.

After doing some reconnaisance on the database structure using the information that had been given to me by the third-party vendor, I downloaded the ZIP archive containing the standard build of PHP 5.2 for Windows. Side note: one of the things I like about PHP on Windows is that, for CLI purposes, it only takes decompressing the ZIP archive, tweaking the configuration file to get the settings and extensions you want, and executing the binary to get it up and running.

With my SQL Server experience being dated by about six years, I started throwing PHP code at the system to see what would stick. I noticed that the server was already set up to accept trusted connections, and being that I was running the script on the local system, this made it likely that authentication wouldn’t present any issues, or so I thought.

I created my PDO instance like so:

$db = new PDO('mssql:localhost;dbname=...');

And then attempted to call a particular stored procedure that I’d been told would have some of the data I was looking to extract. I was surprised to get this in response when calling PDO::errorInfo().

Array
(
[0] => HY000
[1] => 10007
[2] => Could not find stored procedure ... [10007] (severity 5)
[EXEC ...]
[3] => -1
[4] => 5
)

Now I had just been in EMS SQL Manager and seen the stored procedure myself, so I knew it was there. I tried using the sa account, but that didn’t seem to work either. After some digging, I found that I had to create an account for the current Windows user on the current network domain in order to make PHP capable of seeing the stored procedure when using a trusted connection. Once I’d created the account and given it access to execute the specific stored procedure I was trying to call, I tried again and PDO::errorInfo then gave me this.

Array
(
[0] => HY000
[1] => 10007
[2] => Unicode data in a Unicode-only collation or ntext data
cannot be sent to clients using DB-Library (such as ISQL) or
ODBC version 3.7 or earlier. [10007] (severity 5) [EXEC ...]
[3] => -1
[4] => 5
)

Apparently there are issues with the MS SQL stand-alone PHP extension and SQL Servers using unicode collation. Some more digging turned up that the only way to get around this was to use ODBC. Once I did that, I found that I was no longer getting an error when trying to call the stored procedure.

$db = new PDO('odbc:Driver={SQL Server};Server=localhost;
Database=...;Trusted_Connection=yes;');

Though I knew of stored procedures conceptually, my previous experience with SQL Server had never included using them. As such, I wasn’t familiar with the syntax and came across something rather strange while trying to troubleshoot it: if the arguments to a stored procedure are surrounded by parentheses in the  query calling that stored procedure, the code silently fails and returns no results.

// $stmt = $db->prepare('EXEC ...(?, ?)'); // Fails silently
$stmt = $db->prepare('EXEC ... ?, ?'); // Works as expected

Last on the list of “interesting things I experienced” while on this little trek wasn’t related to the database, but to what was happening when I attempted to push data from the SQL Server machine to our LAMP machine. I was using the PHP serialize() function to encode the data into a string, then using the streams HTTPS wrapper to send it via a POST request to a PHP file on the LAMP machine. This meant I wouldn’t have to poke a hole in the firewall on either side since both were already open to web traffic. When attempting to run the code for this, I got an error on the client side:

PHP Warning:  file_get_contents(): SSL: An existing connection 
was forcibly closed by the remote host.
PHP Warning: file_get_contents(https://...): failed to open
stream: HTTP request failed!

The cause for this wasn’t immediately apparent. Upon checking Apache error logs, I came across this:

PHP Fatal error:  Allowed memory size of 16777216 bytes 
exhausted (tried to allocate 3381676 bytes)

The actual issue had to do with how much data I was trying to send; it was exceeding the value specified in the memory_limit configuration setting, which as it turned out was set to the recommended value for a significantly older version of PHP. After updating it to a more current recommended value and bouncing Apache, all was well.

Hope the details of my experiences prove helpful to someone else. Thanks to those who provided assistance along the way.

Graphs and Relational Databases

Situations involving hierarchical data and relational databases are quite common in web applications. Trees lend themselves quite well to providing organizational structures a web site, such as sitemaps and breadcrumb trails. A slightly less common and different type of situation, where the application is just as useful and a solution is a bit more complex to derive, is one involving graphical data (as in graphs, not graphics) and relational databases. These situations have issues like the shortest path problem and find their solutions in graph theory such as the A* algorithm or Dijkstra’s algorithm. An example of such a situation is an airline web site that requires the ability to locate connecting and round-trip flights and find the flight path with the lowest cost in terms of time or ticket price.

If you use MySQL, this chapter from “Get It Done with MySQL 5” is a fairly verbose but comprehensive guide to using MySQL to store graphical data. It includes background information such as terminology used in graph theory and has numerous implementation examples of adjacency list graph models, nested set graph models, and breadth-first and depth-first graph search algorithms.

For Oracle users, there’s a slightly more application-oriented tutorial that assumes more theoretical knowledge on the part of the reader. It shows that Oracle’s hierarchical data features unfortunately can’t be used in cases where cycles might exist in graphs (which is handy if you’re trying to detect them) and then goes on to show an implementation that uses temporary tables to store a summary of a graph analysis. A worthy side note is that part 2 of that tutorial deals with a more specialized approach using state machines that may or may not be applicable to your situation.

If you’d like more information on this topic, a good place to look is your local university. Most with a computer science program offer a course in theory of computation, which deals with topics like these as well as context-free grammars in the context of developing programming languages. Even if you never actually use this information to develop a language yourself, it can still serve a good purpose: it make you more informed when engaging in discussions about language development, and it can increase your appreciation for the beauty of a language from a user perspective.

NULLification

I’ve seen some “interesting” things during my time with database systems, but the one that takes the cake by far is variations in how NULL is interpreted. I’m going to provide some examples to showcase what I’m talking about using Oracle and MySQL, being that my experience is mostly with those two particular systems. Examples given are run on Windows XP SP2 using Oracle Express Edition 10.2.0.1.0 and MySQL Community Edition 5.0.45.

The stated intention of the existence of NULL is to convey the absence of any value. Both Oracle and MySQL say as much (and I particularly like MySQL’s explanation of the reasoning behind this). Oracle, however, immediately goes on to contradict that principle and maintain that a character value with a length of zero is considered to be equivalent to NULL.

Oracle: SELECT CONCAT(‘test’, NULL) FROM dual; -> ‘test’
MySQL: SELECT CONCAT(‘test’, NULL); -> NULL

This behavior manifests itself in some other “interesting” ways.

Oracle: SELECT ‘true’ FROM dual WHERE LENGTH(”) IS NULL; -> ‘true’
MySQL: SELECT LENGTH(”); -> 0

If you’re wondering why I’ve formatted my Oracle query differently in this example, it’s because this particular release of SQL*Plus appears to not want to give me a definitive answer. If I use the equivalent Oracle query, I simply get blank space where I would expect to see NULL. If you invert the WHERE clause in the query, you’ll see what I mean. I wonder why this is?

To make things worse, it is only indirectly noted in the Oracle LENGTH function documentation in the statement that passing NULL to LENGTH will result in NULL. You have to read the section on NULL to find out that the empty string is equivalent to NULL, then put two and two together in order to figure this out. Oracle has made the statement that it is possible this behavior will change in the future. Given backward compatibility-related implications, however, I highly doubt that.

Strings aren’t the only area to which I have an objection with respect to NULL; numbers are, as well. Prior to MySQL 5.0.13, the following example was handled in what I believed was “the right way.” In versions 5.0.13 and later, it was changed to use the same logic as Oracle.

Oracle: SELECT ‘true’ FROM dual WHERE GREATEST(1, NULL) IS NULL; -> ‘true’
MySQL: SELECT GREATEST(1, NULL); -> NULL

What both do now is cause NULL to be the result if any one operand in the expression is NULL. In dealing with a general expression, I can understand this. When that expression deals with a logical set of associated row- or column-wise values, which this and some other functions do, the absence of a value for a single object in the set should not cause this to happen.

MySQL isn’t even consistent about this; in a section on common problems with NULL in its documentation, it’s stated that “Aggregate (summary) functions such as COUNT(), MIN(), and SUM() ignore NULL values.” In another section on working with NULL values, it’s stated that “Two NULL values are regarded as equal in a GROUP BY.” NULL conveys the absence of a value, so how can this be?! Both Oracle and MySQL have a set of functions and operators for dealing with NULL. At least these treat NULL consistently.

In short, I’m not saying NULL shouldn’t exist. I know it has its uses, such as rewriting queries to avoid using subqueries in order to improve efficiency. I’m saying that mainstream database systems should bear in mind the reason for the existence of NULL when deciding how any given function call or expression should handle encounters with it. Until they do, all I can do is recommend caution when dealing with fields that may contain NULL and querying against them. Be cognizant of these fields and operations involving them and make use of your respective database’s functions and operators for handling these cases so as to avoid unexpected results. Don’t let your data be nullified!

Supporting Hierarchical Data Sets

If you deal with relational database systems on any semi-regular basis, you’ve probably had to support hierarchical data sets before. In genealogical terms, each record may have a logical “parent” record and zero or more “child” records. The approach that you’ve likely taken to support these data sets has been to include a foreign key column in the table in question that references the primary key column in the same table. This approach is called the adjacency list model, where the name is taken from the same concept in graph theory. This model is simple and intuitive, but it has a drawback: to obtain any significant portion of the hierarchy stored in a table can take up to as many queries as there are levels of depth in the hierarchy.

There is an alternative approach, called the nested set model, which finds its basis in set theory. Its development is credited to a man by the name of Joe Celko, who’s written a number of books related to SQL and database design. The basis of the approach is this: each record has numerical left and right bounds that represent the boundaries of a set, where everything within that set is a descendant of that record. By employing some simple joins, most of the subsets of data commonly desired in a hierarchical data set can be retrieved with a single query, which is much more efficient in terms of the number of queries required to fulfill an individual request.

If MySQL is your database server of choice, its main web site has an excellent article that outlines what the nested set model is and how to implement it within MySQL. This article on Sitepoint also explains it and its examples use PHP in conjunction with MySQL. There’s also another great article on developer.com that gives a visual walk through of the nested set model and its implementation in Oracle for use in generating breadcrumb trail navigation on a web site. Danne Lundqvist also has a blog entry describing his experiences in using PHP and JavaScript (specifically Mootools) to transfer data back and forth between the nested set model and an ordered depth model. These articles explain the concepts well enough that I doubt I’d do much better in trying to reiterate them here.

I found out about the nested set model through Elizabeth Smith, one of my fellow developers on the Forkr project. One of my tasks while working on the project was to adapt information from various resources, including code she’d written to implement a combined adjacency list/nested set model approach on a PostgreSQL database, to a component for Forkr to support hierarchical data sets. Why a combined approach? There are several reasons why that is actually better than each individual approach by itself.

  1. Some of its common queries, mostly those involving only two levels worth of data (i.e. a parent and its children), are more simple and efficient in the adjacency list model than in the nested set model.
  2. Most sites that support hierarchical data sets already use the adjacency list model, so roughly half the work of implementing a combined approach is already done.
  3. The adjacency list model is fairly easy to understand and, when effectively laid side-by-side with the nested set model, makes it somewhat easier to follow when looking at data that uses a combined approach.
  4. When developing an application that uses the nested set model, the logic which controls the bound values for each record can be error-prone early in development. By also maintaining a foreign key column that points back to the parent, the bound values can easily be regenerated at any time.

My work in Forkr is still very much in an alpha stage, but once it’s completed I’ll likely post another blog entry on this topic that outlines how to implement the nested set model with specific SQL examples. So, keep your eyes peeled for it.

Update: Wow… apparently I unknowingly struck a nerve. :P The earlier comments are correct in that manipulating nested set bounds does indeed require updates on many records of a table. The extra speed in retrieving the hierarchy has to come from somewhere, right? The nested set model is admittedly better for instances where the tree isn’t likely to change often or have multiple potential points of modification that can run concurrently. My apologies for not making that clear earlier on.

Why I Still Hate System Administration

So, Chris Cornutt‘s comment on my last entry regarding LD_LIBRARY_PATH did turn out to be my issue. Once it was properly set, libphp5.so was properly able to see needed libraries and Apache was able to boot successfully.

Some time later, I was able to get the PDO and PDO_INFORMIX extensions from PECL installed. I was able to install the Informix Client SDK, despite post-installation configuration on the client system being a bit tedious to wade through.

As of the end of the day Wednesday, though, I still couldn’t successfully get PHP and Informix speaking on the same wavelength. I have a feeling, though, that this is more than likely due to an issue with Informix than with PHP.

I’ve tried logging in as both root and as the user who owns all of the tables in the database I’m trying to access. However, neither prevents this error from appearing when I attempt to connect to the server from PHP.

Fatal error: Uncaught exception 'PDOException' with message 'SQLSTATE=HY000, SQLDriverConnect: -329 [Informix][Informix ODBC Driver][Informix]Database not found or no system permission.'

I’ve used the dbaccess utility that comes with Informix on the database server box to confirm that both logins are valid and can see the database I’m trying to connect to. I’ve already checked from both the server and client boxes to confirm that the hostname being used maps to the correct IP address on the LAN, so the issue is not a host identity crisis.

It seems odd that I would be able to access the database with no issues on the database server, but that may just be a lack of understanding of Informix user permissions on my part. Oh well… I’ll be hitting it again tomorrow, so wish me luck and throw in pointers if you’re familiar with Informix.

Update: A coworker of mine did eventually find the problem, which had to do with weirdness in the Informix database server configuration. If you didn’t know, Informix stores its data in a fashion similar to that of MySQL: each database has its own directory and each table one or more files within that directory. Apparently, because of the server configuration, when specifying a database, the entire path to the database directory had to be provided instead of just the database name, hence the “Database not found” error. Now I get to engage in the exciting task of porting a large C-based CGI application over to PHP. Joy…

Why I Hate System Administration

I got assigned to a new project at work recently. At first, I was delighted, because I knew that the project requirements would allow me to use PHP. My job generally involves Oracle Forms and Reports or PL/SQL data cartridges, so going back to my language of choice seemed like it would be a refreshing change. The client, however, has rather interesting requirements regarding the server environment.

First, they’re running on Sun hardware and as such require that they maintain a Sun operating system, namely Solaris 10, only that hardware. Second, their applications are currently running off of an Informix 7 database server.

There is currently only one PHP extension for interfacing with Informix that is actively maintained: PDO_INFORMIX, which is in PECL and is maintained by IBM (who bought out Informix shortly after the client got this database server). The latest release of this extension was made in February 2007 and it relies on PDO 1.0, which is about 3 versions back from the current version of PDO. That’s not to mention that you need the Informix Client SDK installed on the system to install PDO_INFORMIX.

The only Solaris builds of PHP are on the Freeware for Solaris web site and the only build that will run on Apache 2.0.59 (what’s currently running on the client’s production server) is PHP 5.2.1, which has the latest version of PDO built in by default. There’s no way to override this with an installation of the separate PECL extension, at least that I know of. The only way to get around it is to recompile PHP from source and disable the default inclusion of PDO.

I spent today trying to figure out how to get PHP to compile from the source tarball offered on that web site so that I could disclude PDO from being built into the interpreter and then install it as a separate PECL extension to allow me to install PDO 1.0 to allow PDO_INFORMIX to be installed and work correctly. No matter what I tried, I got this error message:

ld.so.1: httpd: fatal: relocation error: file /usr/apache/libexec/libphp5.so: symbol sapi_module: referenced symbol not found

I’ve been told (and I believe it) that libphp5.so is expecting there to be some Apache files somewhere and obviously just isn’t finding them, even though I’ve used the –with-apache2 and –with-apxs2 options when configuring PHP for installation from source and tried installing Apache from source as well.

The last thing I did before leaving was start a fresh installation of Solaris 10 to a VMware disk, which I intend to make a snapshot and backup copy of before I try again. I’m hoping that my initial attempts to install PHP and Apache from their Solaris packages is causing some sort of conflict, though I have to say I have my doubts. I’ll continue searching for answers tomorrow. In the meantime, any sagely advice in the comments would be welcome.