Third Normal Form is Snake Oil

13 Comments

Step right up, ladies and gentlemen, and I will sell you the solution to all of your database needs. That’s right, it’s Doctor Codd’s Third Normal Form, guaranteed to cure all ailments of the schemata, pragmata, and performata. Doctor Codd’s Form will make your data performant and compressive. Accept no substitutes or imitators; Doctor Boyce’s lamentable attempts cannot soothe your aches and pains like Doctor Codd’s Third Normal Form. Why, with just a simple application of Doctor Codd’s Third Normal Form, thrice daily, and you’ll be jumping around to the tune of normal forms and transactions in no time!

Sound Familiar?

Anyone pushing a single idea is pushing snake oil, plain and simple. They’re selling you a warm and fuzzy feeling that you’ll make your problems go away by following their simple prescriptions. Deviation from the remedy will, of course, result in problems, failure, and potentially phlebitis.

Can I Cure You?

No, I can’t. Well, I can, but I’m not going to. Not yet, at least. You need to pay attention, first.

The Forms, Both Magnificent and Normal, Are A Not Panacea

Slavish adherence to normalization is bad for your health. There are as many reasons to not normalize data as there are reasons to normalize your data. Don’t believe me? What if I asked you to design a database to persist items that we might sell in a store?

It’s easy, at first, to design an items table with a few columns to describe the main properties that we want to persist about an item in our store. Problems begin when different departments in the store need to save different properties. Different parts of our IT systems will need different views of the data. While adding a column is trivial on small databases, adding a column in a large database is decidedly non-trivial. Eventually the database boils down to an items and item_properties table and at that point the database becomes impossible to query reasonably.

A Solution Most Surprising

We can solve this problem a few ways, but with Microsoft’s Hadoop announcements, it makes sense to look at what the non-relational world can offer. HBase is a real-time column-oriented database that runs on top of Hadoop.

HBase is helpful modeling dynamic properties because of flexible data model. While HBase does have tables, rows, and columns there are some powerful differences. HBase’s columns are split up into column families – these are logical groupings of columns. Columns can be added on the fly once a column family has been created.

Jumping back to our example, instead of modeling a items and item_properties table, we can create an items table and create column families to store properties specific to a department or for a common purpose. Rather than create many tables, we can add a shipping_info column family, a accounting column family, and a sales_promotion column family. Over time this flexible data model can be used to populate reporting tables in an enterprise data warehouse. Rather than focus initial efforts on building a robust general purpose schema in an RDBMS, it’s easy to create a flexible schema in HBase and pull out the data we need for reporting at a later time.

A Final Commentary on Data

Denormalization doesn’t have to be a dirty word. There are many reasons to denormalize data. Ultimately, the process of shredding data apart should depend not on blind adherence to the principles of normalization but to the needs of the applications that consume the data. If you have a log file processing application, does it make sense to read log files from disk into a relational database? Every log entry will need to be shredded into multiple columns doesn’t make sense when log files are only infrequently processed and used to produce aggregations.

Even when you eventually need to query the log file data, there are tools suited to performing SQL-like operations across flat files. Hive provides a SQL-like querying layer on top of the Hadoop framework making it possible to run bulk queries across large volumes of data stored in flat files and spread across many servers.

Know how data is used; know the problem that the business wants to solve. Let the principle of consumption drive the structure of your information. You will thank me, some day, for freeing you from the false rigor of normalization.

Previous Post
Local Backups Are The New Offsite Backups
Next Post
Virtualization and SAN Basics for DBAs Video

13 Comments. Leave new

  • Another solution is to use the CQRS design pattern. That is; command query responsibility segregation. Using this you can satisfy both fast read and write requirements. Using service broker you can queue the data that is required for reporting and use this to populate a separate database from your operational data store. In your reporting database you can de-normalize and aggregate your data to your heart’s content in order to make reports both easier to write and run faster.

    Reply
    • CQRS is an alternate solution. I’ve never seen CQRS used in a way that solves the entity attribute value problem where there is an unknowable number of attributes. RDBMSes don’t like it when you keep adding nullable columns to really wide tables, either, and eventually give up on storing more columns at a certain point. Reporting on a mess of NULLable columns in an RDBMS is really painful, too.

      Reply
  • Brandon Leach
    November 7, 2011 7:12 am

    Great points Jeremiah! Design should be based on needs not Zealot like adherence.

    Your opener reminds me of the barber showdown scene from Sweeney Todd, just without all the homicidal angst.

    Reply
  • Anything that will solve all your problems without requiring effort will cause another problem. If nothing else, it will cause unemployment.

    Reply
  • Let’s hope that nobody reads this without realizing that it is written ‘tougue-in-cheek’!

    Reply
    • Well, despite the obvious trolling of the title, I wouldn’t say that this was tongue in cheek. BCNF is arguably more important than third normal form and 4NF and 5NF are usually easily attained without much effort. On the flip side, for some use cases normalization is the devil.

      After all, if someone didn’t think normalization was a bad idea, what would all of the business intelligence people do?

      Reply
      • Normalization is far from evil. In most cases it’s a good idea. But its not bad to look at other options either if you think they have benefits. I look at BI as a different case. The goal is to glean information from data. Normalization is more about organizing that source data.

        Reply
        • This.

          OLTP and OLAP systems are mutually exclusive. Normalization (among other things) helps to keep anomalies out of your transactional data so that when you do denormalize it into an OLAP system what you get back is accurate.

          “Normalize until it hurts, demornalize until it works” is still valid advice for OLTP systems.

          It’s too bad so many transactional systems are created by people who don’t understand the reasons behind things like normalization.

          Reply
  • Jeremiah,

    This post does read like snake-oil 😉

    Your 3NF argument here is a side effect of an object 2 role transformation, and hence not a good example of the dangers of normalisation, but of the dangers of abstraction (remember good ol’ EAV?). It is actually a problem of decomposition & transformation most traditional DBMSses (SQL Server incl.) fail on because their physical implement their relations as 1-1 with the logical view on the relations. Both NoSQL(HADOOP) and SQLDBMS solutions like MonetDB use Column like approaches to solve these issues and keep going FAST. But this has nothing to do with SQL or DBMSes or relational theory (ref multirelations solve this issue in a mathematical elegant way) but purely with SQLDBMS physical implementations (and also SQL limitations).

    Reply
    • “This post does read like snake-oil” — my thoughts exactly; putting words in the mouths of people who can demonstrate that 3NF is just sound engineering.
      There are performance reasons why HADOOP et al make unupdatable copies of dependent data, or drop ACID constraints. That’s not an advantage; it’s a feechur. Most of this was hashed to death in the temporal-SQL developments of the 80’s and 90’s. Same sort of noises were made about MDX and dimensional databases — from those who thought a SQL “relation” was a two-dimensional object. The principles stand: use 3NF to avoid some problems known to have been around longer than you probably have.

      Reply
  • In the “Item” example given, you say that the database becomes “impossible to query reasonably”. There is a pattern for Exclusive subcategories where the unique properties for a base class table are stored in child tables, with a foreign key to the parent Base. triggers are required in SQL to do this type of referential integrity, but the queries are very straight forward, and performant. Works for Items, Documents, and most objects I have encountered where subcategories have distinct properties. Unfortunately, while queries are straight forward, they are not automagically generated for this pattern in most of the CRUD tools and frameworks. SQLMETAL can be altered to support this pattern however.
    It may however turn out that snake oil (from the right snake) actually does cure some important ills.

    Reply
  • I enjoyed the article. The theme I embrace is that you should never do something just because it is what you have been doing. With technology and needs of the business constantly changing you need to be able to look at your environment with clear eyes and ask the question why are we doing this and why are we not doing something else. I can’t tell you how often I see someone’s eyes glaze over when these questions are asked.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.