Appendix: traditional normalization
Normalization is usually thought of as a process of applying a set of rules to
your database design, mostly to achieve minimum redundancy in the data. Most textbooks
present this as a three-step process, with correspondingly labeled
“normal forms,”
which could be done in an almost algorthmic sequence.
In theory, you could start with a single relation scheme (sometimes called
the universal scheme, or U) that contains all of the attributes in the database—then apply
these rules recursively to develop a set of increasingly-normalized sub-relation schemes. When all
of the schemes are in third normal form, then the whole database is properly normalized. In
practice, you will more likely apply the rules gradually, refining each relation scheme
as you develop it from the UML class diagram or ER model diagram. The final table structures
should be the same no matter which method (or combination of methods) you’ve used.
Since most developers will use traditional terms, you should know how
the design patterns that you have learned will lead to the same (normalized) results, as shown in the
following table:
Normal forms
Normal form |
Traditional definition |
As presented here |
First normal form (1NF) |
All attributes must be atomic, and
No repeating groups |
Eliminate
multi-valued attributes, and
Eliminate
repeated attributes |
Second normal form (2NF) |
First normal form, and
No partial functional dependencies |
Eliminate
subkeys (where the subkey is part of a composite primary key) |
Third normal form (3NF) |
Second normal form, and
No transitive functional dependencies |
Eliminate
subkeys (where the subkey is not part of the primary key) |
Some textbooks discuss “higher” normal forms, such as BCNF (Boyce-Codd),
4NF, 5NF, and DKNF (domain-key). These topics are properly covered in a more advanced course or tutorial.
Denormalization
One major premise of this tutorial is that you should learn to develop the “best”
possible design—which really focuses on the database structure itself. By doing this, you should be
able to avoid many of the problems, bugs, inconsistencies, and maintenance nightmares that frequently
plague actual systems in use today.
However, your database will always be part of a larger system, which will
include at least a user interface and reporting structure, perhaps with a large amount of application
code written in a language such as Java or C++. Your database could also be the back-end of a Web
site, with both middle-tier business logic and front-end presentation code dependent on it. It is not
uncommon for developers to “break the rules” of database design in order to accommodate
other parts of a system.
An example of denormalization, using our “phone book” problem, would be to
store the city and state attributes in the basic contacts table, rather than making a separate zip codes table.
At the cost of extra storage, this would save one join in a SELECT statement. Although this would
certainly not be needed in such a simple system, imagine a Web site that supports thousands of “hits”
per second, with much more complicated queries needed to produce the output. With today’s
terabyte disk systems, it might be worth using extra storage space to keep Web viewers from waiting
excessively while a page is being generated. On the other hand, similarly-increasing processor power makes it less
likely that this tradeoff will actually have to be made in practice.
The key to successful denormalization is to make sure that end users of the system
never have to manually duplicate or maintain the redundant data. Possible techniques for doing this
include using materialized views, writing triggers (code executed by the database itself—not available
on all systems), or writing application code that takes care of it at data-entry time.