supermikenews

My views on software, programming, Linux, the Internet, government, taxation, working, and life.

Friday, July 14, 2006

Why Database Table Denormalization Is Just As Important As Normalization


When it comes to databases, everyone wants to make them run faster. They may crack open books and read advice from Computer Science guys with Master's Degrees or even PhDs on the subject. You might even think they are absolutely 100% correct and could never tell you a lie. Well I'm about to set you straight.

Most often the topic you hear about is normalization, which is where you try by some large or small degree towards the goal of reducing redundant information in two or more tables. For a great explanation of normalization, check out these links:

The idea behind normalization is that if you don't have to write the same data to more than one table, the database should run faster, right?

In fact, as now a manager of developers (as well as a coder myself), I often see developers right out of the college come in two camps. One camp doesn't have a clue about how to optimize database design. The other camp knows nothing except normalization and they want to take it to the farthest extreme possible. They praise their teachers for teaching them this and think that their computer science teacher knew exactly what they were talking about by only preaching the virtues of normalization. In fact, they may think that every computer science teacher they had was an absolute genius that should have won the Nobel Prize or something.

However, I'm a seasoned programmer. My resume is not filled with quick-hit resume stuffers by the office politics games that others have played to get into some Senior Developer or Application Architect positions. My resume is filled with blood, sweat, and tears over more than a decade. My resume is not all about buzzwords that should have dried up in the dot com implosion such as "I write synergistic XML designs that federate your supply chains." I also didn't do something as stupid and time-wasting as going all the way to my Master's Degree in Computer Science before I ever learned anything about real world programming, then get that development management job and think I'm a know-it-all even though that sort of resume often has zero things on it besides a short project here or there while the guy was going through college. That's not me. My work is about real world programming.

So back to that question. Does database normalization always make sense? Does reducing redundancy in database table design, which does indeed create the least amount of table write tasks on the hard drive, make the most sense? The answer is no. It doesn't always make sense. Database denormalization is also important. In fact, you can even mix the two in the same database, and some newbies to development, or, more scary, seasoned senior developers, might not even know that.

No, database normalization does not make the most sense, and is up at least for consideration for denormalization, when these two cases occur:

  • When the time it takes for a database to facilitate all the other table handles for the write transaction is greater than the time it takes to write it to the same table or to fewer tables.

  • When you want faster reads for commonly selected data.


Let's put this into easier language. Got something you want to write to the database and your normalization rules tell you to write it into, say, 8 tables instead of 4? Great, create one database design with that and time it. Then, time it against 7 tables, then 6, then 5, then 4, then at least 3. Which time is faster? Okay, you've just denormalized a little and found a speed improvement. Good for you! Did that sting? Do you need a bandaid?

And that covers writes, but what about reads? Well, you'll often find that database reads take longer when you have to join all these tables in a normalized table design. A good example is when you have a CRM or ERP product and there's something a user usually sees when they login, or perhaps a product inventory or point of sale application where there's a list of database data that someone normally sees after they login. Can you imagine the large numbers of reads this has on your database? And if the reads are heavy, an extremely normalized database design may slow this down far more than you think.

Instead of this, try moving backwards a bit on the normalization form steps. If you're in sixth normal form, try everything from fifth to first normal form, or no normal form at all. Which improves your read speed the most?

Okay, you can already see the problem with this, I'm sure. You realize that denormalization also has its faults because, although it may speed up your reads a great deal, it may slow down your writes too much. So then you come to this crossroads. You have to realize that you can either create a table design from one of the following models:

  • Optimize mostly for table reads, not writes, and forget about it. This is likely not what you want to do unless your database is read-only.

  • Optimize mostly for table writes, not reads, and forget about it. Again, this is also not what you want to do unless you're storing customer transactions in a queue before they get written somewhere else.

  • Back up on some or all of your tables with the normalization form, either resorting to a lesser normal form, or no normal form. This may help you create a balance, trying to optimize both reads and writes at the same time.

  • Split the database conceptually so that certain tables are written to and are in a decent normal form that optimizes application speed, and certain tables are read from in a lesser normal form that also optimizes application speed. The problem, then, is synchronization of data between these two sets of tables, and that's a separate problem for you to consider.

  • Split the database physically into two or more databases so that some are better for table writes, by using a larger normal form, or for table reads, by using a lesser (or no) normal form. You will then have to use database replication tools, which often come with many database server products, to synchronize data on a schedule between the two or more databases.


The decision is up to you. You see, normalization is actually a spectrum that moves from no normalization to fifth normal form. On the left, if you imagine this in your mind, the spectrum is fully denormalized with no normalization whatsoever. On the right, it is normalized to fifth normal form. Your designs will have to either lean more to the right or the left on sets of tables for a certain set of one or more application features. Your designs may also have to use conceptually-split table designs in the same database, or split the database up into multiple databases that may lean to the left or right of that spectrum.

Another important decision is database manageability. Is it more important for you to have a database that other programmers can understand and improve, than to have a database that is fastest overall on writes, reads, or reads and writes? The decision is yours. However, I've seen some frightful designs that fill the size of a common doorway in tiny little tables with lines going everywhere to show database relations. I don't find that kind of design very useful and it often shows a fifth normal form zealot that doesn't have clue's address when it comes to real world programming.

In the first two links I provided for you at the top of this article, you'll be hard-pressed to find much talk of denormalization. The PhDs who write about database optimization seem to give it little more than a couple paragraphs, oddly enough. However, your speed tests can drive you to do the right thing.

So now, all of this didn't hurt you to have a blogger without a Master's Degree in Computer Science teach you something new, did it? Like I said, you don't need a bandaid, do you?

3 Comments:

  • At Fri Jul 14, 11:23:00 PM MST , Blogger supermike said...

    One unpolite individual on reddit.com tried to say that since I didn't say one thing about data integrity, that I didn't know what I was talking about. Of course I think about data integrity. I use normalization with foreign keys for that, yes. However, in some cases where I know that selects are important, I either do not push all the way to the heavily normalized side of that data, or I may try to build a table with as much information in it as possible, even if it is redundant, just so the selects are faster. I may suffer the risk of data integrity by doing that, but that's a risk I sometimes must take when I'm faced with trying to speed up the application. It means I have to put other controls in the application, or in stored procedures, to try to ensure data integrity is carried out by another means besides normalization.

     
  • At Tue Aug 01, 05:17:00 PM MST , Anonymous Anonymous said...

    This paper idenifies as a candidate for denormalization any situation where three or more join operations are required to produce a specific output.

    http://www.siue.edu/~dbock/cmis564/denormal.htm

     
  • At Mon Feb 04, 06:43:00 AM MST , Anonymous Anonymous said...

    top [url=http://www.c-online-casino.co.uk/]uk online casinos[/url] check the latest [url=http://www.realcazinoz.com/]online casino[/url] manumitted no store hand-out at the chief [url=http://www.baywatchcasino.com/]bay take note of casino
    [/url].

     

Post a Comment

Subscribe to Post Comments [Atom]

<< Home