Supporting Hierarchical Data Sets

If you deal with relational database systems on any semi-regular basis, you’ve probably had to support hierarchical data sets before. In genealogical terms, each record may have a logical “parent” record and zero or more “child” records. The approach that you’ve likely taken to support these data sets has been to include a foreign key column in the table in question that references the primary key column in the same table. This approach is called the adjacency list model, where the name is taken from the same concept in graph theory. This model is simple and intuitive, but it has a drawback: to obtain any significant portion of the hierarchy stored in a table can take up to as many queries as there are levels of depth in the hierarchy.

There is an alternative approach, called the nested set model, which finds its basis in set theory. Its development is credited to a man by the name of Joe Celko, who’s written a number of books related to SQL and database design. The basis of the approach is this: each record has numerical left and right bounds that represent the boundaries of a set, where everything within that set is a descendant of that record. By employing some simple joins, most of the subsets of data commonly desired in a hierarchical data set can be retrieved with a single query, which is much more efficient in terms of the number of queries required to fulfill an individual request.

If MySQL is your database server of choice, its main web site has an excellent article that outlines what the nested set model is and how to implement it within MySQL. This article on Sitepoint also explains it and its examples use PHP in conjunction with MySQL. There’s also another great article on developer.com that gives a visual walk through of the nested set model and its implementation in Oracle for use in generating breadcrumb trail navigation on a web site. Danne Lundqvist also has a blog entry describing his experiences in using PHP and JavaScript (specifically Mootools) to transfer data back and forth between the nested set model and an ordered depth model. These articles explain the concepts well enough that I doubt I’d do much better in trying to reiterate them here.

I found out about the nested set model through Elizabeth Smith, one of my fellow developers on the Forkr project. One of my tasks while working on the project was to adapt information from various resources, including code she’d written to implement a combined adjacency list/nested set model approach on a PostgreSQL database, to a component for Forkr to support hierarchical data sets. Why a combined approach? There are several reasons why that is actually better than each individual approach by itself.

  1. Some of its common queries, mostly those involving only two levels worth of data (i.e. a parent and its children), are more simple and efficient in the adjacency list model than in the nested set model.
  2. Most sites that support hierarchical data sets already use the adjacency list model, so roughly half the work of implementing a combined approach is already done.
  3. The adjacency list model is fairly easy to understand and, when effectively laid side-by-side with the nested set model, makes it somewhat easier to follow when looking at data that uses a combined approach.
  4. When developing an application that uses the nested set model, the logic which controls the bound values for each record can be error-prone early in development. By also maintaining a foreign key column that points back to the parent, the bound values can easily be regenerated at any time.

My work in Forkr is still very much in an alpha stage, but once it’s completed I’ll likely post another blog entry on this topic that outlines how to implement the nested set model with specific SQL examples. So, keep your eyes peeled for it.

Update: Wow… apparently I unknowingly struck a nerve. :P The earlier comments are correct in that manipulating nested set bounds does indeed require updates on many records of a table. The extra speed in retrieving the hierarchy has to come from somewhere, right? The nested set model is admittedly better for instances where the tree isn’t likely to change often or have multiple potential points of modification that can run concurrently. My apologies for not making that clear earlier on.

7 Comments

  1. You should also mention that inserting a node in costs more in a nested set tree because it has to update more rows in the table than with the adjacency model. So it might be that for applications where you have more write operations than read operations the nested can hurt.

  2. foo says:

    Remember that updates are costly with Nested Sets. You can use the features that the RDBMS offers you (ORACLE – CONNECT BY, PostgreSql – connectby (contrib/tablefunc.sql), MSSQL – CTE) and stick with just the adjacency model. Tree width and depth also are important.

  3. Robert Janeczek says:

    NestedSet can be a straight path to hell. I haven’t seen any article describing how to deal with multiple updates done at the same time and this is something you can easily overlook. In the simplest implementation concurent updates just break the hierarchy, losing branches etc. This requires implementing locks around most of operations on tree (transactions themselves are not enough!), which slows everything down.

    There is additional implicit problem with large hierarchies. Inserts require on average reordering half of the tree, which causes lots of updates. Updating a node in Postgres invalidates a row and adds new version of it at the end of the table. This destroys half of index caches and pretty much requires VACUUM ANALYZE to keep things working.

    Be very careful with NestedSet. We’ve used it for over a year large-scale project and I’d say half of problems with the entire software were caused by NestedSet, which is now being removed gradually.

  4. Watch out with nested sets when you often need to query lists of nodes with their level of depth included. Those queries take O(n^2) runtime (indices omitted) and tend to get very slow with big trees.

  5. Are Pedersen says:

    Another way instead of the adjacency model or nested sets is the transitively closed relation. (http://www.dbazine.com/oracle/or-articles/tropashko4)

    It’s really a “helper” table, or cache table if you like.
    I’we been using it for years and it scales very well.

  6. […] Turland wrote on his blog, “[the nested set] model is simple and intuitive, but it has a drawback: to obtain any […]

  7. Dusan says:

    @Robert Janeczek:

    I would be very interested to know, what was the solution that replaced the nested set afterwards.