Appendix: traditional normalization

Normalization is usually thought of as a process of applying a set of rules to your database design, mostly to achieve minimum redundancy in the data. Most textbooks present this as a three-step process, with correspondingly labeled “normal forms,” which could be done in an almost algorthmic sequence.

  • In theory, you could start with a single relation scheme (sometimes called the universal scheme, or U) that contains all of the attributes in the database—then apply these rules recursively to develop a set of increasingly-normalized sub-relation schemes. When all of the schemes are in third normal form, then the whole database is properly normalized. In practice, you will more likely apply the rules gradually, refining each relation scheme as you develop it from the UML class diagram or ER model diagram. The final table structures should be the same no matter which method (or combination of methods) you’ve used.

  • Since most developers will use traditional terms, you should know how the design patterns that you have learned will lead to the same (normalized) results, as shown in the following table:

    Normal forms
    Normal form Traditional definition As presented here
    First normal form (1NF)
  • All attributes must be atomic, and
  • No repeating groups
  • Eliminate multi-valued attributes, and
  • Eliminate repeated attributes
  • Second normal form (2NF)
  • First normal form, and
  • No partial functional dependencies
  • Eliminate subkeys (where the subkey is part of a composite primary key)
  • Third normal form (3NF)
  • Second normal form, and
  • No transitive functional dependencies
  • Eliminate subkeys (where the subkey is not part of the primary key)
  • Some textbooks discuss “higher” normal forms, such as BCNF (Boyce-Codd), 4NF, 5NF, and DKNF (domain-key). These topics are properly covered in a more advanced course or tutorial.

    Denormalization

    One major premise of this tutorial is that you should learn to develop the “best” possible design—which really focuses on the database structure itself. By doing this, you should be able to avoid many of the problems, bugs, inconsistencies, and maintenance nightmares that frequently plague actual systems in use today.

  • However, your database will always be part of a larger system, which will include at least a user interface and reporting structure, perhaps with a large amount of application code written in a language such as Java or C++. Your database could also be the back-end of a Web site, with both middle-tier business logic and front-end presentation code dependent on it. It is not uncommon for developers to “break the rules” of database design in order to accommodate other parts of a system.

  • An example of denormalization, using our “phone book” problem, would be to store the city and state attributes in the basic contacts table, rather than making a separate zip codes table. At the cost of extra storage, this would save one join in a SELECT statement. Although this would certainly not be needed in such a simple system, imagine a Web site that supports thousands of “hits” per second, with much more complicated queries needed to produce the output. With today’s terabyte disk systems, it might be worth using extra storage space to keep Web viewers from waiting excessively while a page is being generated. On the other hand, similarly-increasing processor power makes it less likely that this tradeoff will actually have to be made in practice.

  • The key to successful denormalization is to make sure that end users of the system never have to manually duplicate or maintain the redundant data. Possible techniques for doing this include using materialized views, writing triggers (code executed by the database itself—not available on all systems), or writing application code that takes care of it at data-entry time.