Third Normal Form is Snake Oil

Last Updated October 29, 2011

Step right up, ladies and gentlemen, and I will sell you the solution to all of your database needs. That’s right, it’s Doctor Codd’s Third Normal Form, guaranteed to cure all ailments of the schemata, pragmata, and performata. Doctor Codd’s Form will make your data performant and compressive. Accept no substitutes or imitators; Doctor Boyce’s lamentable attempts cannot soothe your aches and pains like Doctor Codd’s Third Normal Form. Why, with just a simple application of Doctor Codd’s Third Normal Form, thrice daily, and you’ll be jumping around to the tune of normal forms and transactions in no time!

Sound Familiar?

Anyone pushing a single idea is pushing snake oil, plain and simple. They’re selling you a warm and fuzzy feeling that you’ll make your problems go away by following their simple prescriptions. Deviation from the remedy will, of course, result in problems, failure, and potentially phlebitis.

Can I Cure You?

No, I can’t. Well, I can, but I’m not going to. Not yet, at least. You need to pay attention, first.

The Forms, Both Magnificent and Normal, Are A Not Panacea

Slavish adherence to normalization is bad for your health. There are as many reasons to not normalize data as there are reasons to normalize your data. Don’t believe me? What if I asked you to design a database to persist items that we might sell in a store?

It’s easy, at first, to design an items table with a few columns to describe the main properties that we want to persist about an item in our store. Problems begin when different departments in the store need to save different properties. Different parts of our IT systems will need different views of the data. While adding a column is trivial on small databases, adding a column in a large database is decidedly non-trivial. Eventually the database boils down to an items and item_properties table and at that point the database becomes impossible to query reasonably.

A Solution Most Surprising

We can solve this problem a few ways, but with Microsoft’s Hadoop announcements, it makes sense to look at what the non-relational world can offer. HBase is a real-time column-oriented database that runs on top of Hadoop.

HBase is helpful modeling dynamic properties because of flexible data model. While HBase does have tables, rows, and columns there are some powerful differences. HBase’s columns are split up into column families – these are logical groupings of columns. Columns can be added on the fly once a column family has been created.

Jumping back to our example, instead of modeling a items and item_properties table, we can create an items table and create column families to store properties specific to a department or for a common purpose. Rather than create many tables, we can add a shipping_info column family, a accounting column family, and a sales_promotion column family. Over time this flexible data model can be used to populate reporting tables in an enterprise data warehouse. Rather than focus initial efforts on building a robust general purpose schema in an RDBMS, it’s easy to create a flexible schema in HBase and pull out the data we need for reporting at a later time.

A Final Commentary on Data

Denormalization doesn’t have to be a dirty word. There are many reasons to denormalize data. Ultimately, the process of shredding data apart should depend not on blind adherence to the principles of normalization but to the needs of the applications that consume the data. If you have a log file processing application, does it make sense to read log files from disk into a relational database? Every log entry will need to be shredded into multiple columns doesn’t make sense when log files are only infrequently processed and used to produce aggregations.

Even when you eventually need to query the log file data, there are tools suited to performing SQL-like operations across flat files. Hive provides a SQL-like querying layer on top of the Hadoop framework making it possible to run bulk queries across large volumes of data stored in flat files and spread across many servers.

Know how data is used; know the problem that the business wants to solve. Let the principle of consumption drive the structure of your information. You will thank me, some day, for freeing you from the false rigor of normalization.

Local Backups Are The New Offsite Backups

Virtualization and SAN Basics for DBAs Video

13 Comments. Leave new

Chris Adkin
November 3, 2011 8:14 am

Another solution is to use the CQRS design pattern. That is; command query responsibility segregation. Using this you can satisfy both fast read and write requirements. Using service broker you can queue the data that is required for reporting and use this to populate a separate database from your operational data store. In your reporting database you can de-normalize and aggregate your data to your heart’s content in order to make reports both easier to write and run faster.

Reply
- Jeremiah Peschka
  November 3, 2011 8:27 am
  
  CQRS is an alternate solution. I’ve never seen CQRS used in a way that solves the entity attribute value problem where there is an unknowable number of attributes. RDBMSes don’t like it when you keep adding nullable columns to really wide tables, either, and eventually give up on storing more columns at a certain point. Reporting on a mess of NULLable columns in an RDBMS is really painful, too.
  
  Reply
Brandon Leach
November 7, 2011 7:12 am

Great points Jeremiah! Design should be based on needs not Zealot like adherence.

Your opener reminds me of the barber showdown scene from Sweeney Todd, just without all the homicidal angst.

Reply
bcs
November 7, 2011 6:13 pm

Anything that will solve all your problems without requiring effort will cause another problem. If nothing else, it will cause unemployment.

Reply
Phil Factor
November 8, 2011 11:45 am

Let’s hope that nobody reads this without realizing that it is written ‘tougue-in-cheek’!

Reply
- Jeremiah Peschka
  November 8, 2011 11:57 am
  
  Well, despite the obvious trolling of the title, I wouldn’t say that this was tongue in cheek. BCNF is arguably more important than third normal form and 4NF and 5NF are usually easily attained without much effort. On the flip side, for some use cases normalization is the devil.
  
  After all, if someone didn’t think normalization was a bad idea, what would all of the business intelligence people do?
  
  Reply
  - Brandon Leach
    November 8, 2011 12:07 pm
    
    Normalization is far from evil. In most cases it’s a good idea. But its not bad to look at other options either if you think they have benefits. I look at BI as a different case. The goal is to glean information from data. Normalization is more about organizing that source data.
    
    Reply
    - Rob
      November 18, 2011 7:44 am
      
      This.
      
      OLTP and OLAP systems are mutually exclusive. Normalization (among other things) helps to keep anomalies out of your transactional data so that when you do denormalize it into an OLAP system what you get back is accurate.
      
      “Normalize until it hurts, demornalize until it works” is still valid advice for OLTP systems.
      
      It’s too bad so many transactional systems are created by people who don’t understand the reasons behind things like normalization.
      
      Reply
DM Unseen
November 9, 2011 5:51 am

Jeremiah,

This post does read like snake-oil 😉

Your 3NF argument here is a side effect of an object 2 role transformation, and hence not a good example of the dangers of normalisation, but of the dangers of abstraction (remember good ol’ EAV?). It is actually a problem of decomposition & transformation most traditional DBMSses (SQL Server incl.) fail on because their physical implement their relations as 1-1 with the logical view on the relations. Both NoSQL(HADOOP) and SQLDBMS solutions like MonetDB use Column like approaches to solve these issues and keep going FAST. But this has nothing to do with SQL or DBMSes or relational theory (ref multirelations solve this issue in a mathematical elegant way) but purely with SQLDBMS physical implementations (and also SQL limitations).

Reply
- Mischa Sandberg
  November 12, 2011 12:53 pm
  
  “This post does read like snake-oil” — my thoughts exactly; putting words in the mouths of people who can demonstrate that 3NF is just sound engineering.
  There are performance reasons why HADOOP et al make unupdatable copies of dependent data, or drop ACID constraints. That’s not an advantage; it’s a feechur. Most of this was hashed to death in the temporal-SQL developments of the 80’s and 90’s. Same sort of noises were made about MDX and dimensional databases — from those who thought a SQL “relation” was a two-dimensional object. The principles stand: use 3NF to avoid some problems known to have been around longer than you probably have.
  
  Reply
John Hanson
November 18, 2011 1:03 pm

In the “Item” example given, you say that the database becomes “impossible to query reasonably”. There is a pattern for Exclusive subcategories where the unique properties for a base class table are stored in child tables, with a foreign key to the parent Base. triggers are required in SQL to do this type of referential integrity, but the queries are very straight forward, and performant. Works for Items, Documents, and most objects I have encountered where subcategories have distinct properties. Unfortunately, while queries are straight forward, they are not automagically generated for this pattern in most of the CRUD tools and frameworks. SQLMETAL can be altered to support this pattern however.
It may however turn out that snake oil (from the right snake) actually does cure some important ills.

Reply
Michael
January 4, 2013 9:51 am

I enjoyed the article. The theme I embrace is that you should never do something just because it is what you have been doing. With technology and needs of the business constantly changing you need to be able to look at your environment with clear eyes and ask the question why are we doing this and why are we not doing something else. I can’t tell you how often I see someone’s eyes glaze over when these questions are asked.

Reply