Design pattern: subkeys (the zip code)

(Wayne Dick and Tom Jewett)

Redundancy, Subkey, Functional Dependencies, Normalization

Introduction

One of the major goals of relational database design is to prevent unnecessary duplication of information (redundancy). In fact, this is one of the main reasons for using a relational database instead of a “flat file” that stores all information in one table. Sometimes we will design a class that seems to be correct, only to find out in the relation scheme or in the table itself that we have a problem.

Example: Contacts

Almost every personal productivity program today includes some sort of contact manager. A “contact” is a person who could be a business associate or simply a friend or family member. Many of these programs have a very simplistic one-table model for the contact information, which probably has a corresponding UML class diagram as shown below (omitting phone numbers for the moment)

Contact class diagram
Contacts UML class diagram. Other views of this diagram: Large image - Data dictionary (text)

It may not be obvious that this model has a problem, until you look at the Contacts table with some typical data filled in, as illustrated below.

Contacts table with typical rows
firstNamelastNamestreetzipCodecitystate
GeorgeBarnes1254 Bellflower90840Long BeachCA
SusanNoble1515 Palo Verde90840Long BeachCA
ErwinStar17022 Brookhurst92708Fountain ValleyCA
AliceBuck3884 Atherton90836Long BeachCA
FrankBorders10200 Slater92708Fountian ValleyCA
HannaDiedrich1699 Studebaker90840Long BeachCA

Notice the repeated information in the city and state attributes. This is not only redundant data; it might also be inconsistent data. (Can you spot the “typo” above?)

Functional dependencies, subkeys, and lossless join decomposition

To understand why we have a problem, we first have to understand the concept of a functional dependency (FD), which is simply a more general concept of the super key constraint. If X and Y are sets of attributes, then the notation X→Y is read “X functionally determines Y” or “Y is functionally dependent on X.” This defines a constraint on any table whereby if any two rows have the same value for X, then the two rows must have the same value for Y. That is, given a value for X (no matter how many rows may have that same value), it maps to only one unique value for Y.

Detecting redundancy

A super key always functionally determines all of the other attributes in a relation (as well as itself). If you are not quite sure why this is the case, consider a table in which there are two rows with the same value for the super key and apply the definition of FD above. Informally, we say a super key FD is a “good” FD. A “bad” FD happens when we have an attribute or set of attributes that functionally determine some but not all of the attributes in the relation. We call such set of attributes a subkey of the relation, as it determines only a subset of the attributes, not all of the attributes, the way a super key does.

Thus, a subkey dependency enables us to detect when a relation scheme has redundancy; that is, when a table using the scheme will contain unnecessary duplication of information. A relation scheme has redundancy whenever there is a subkey dependency. In such a case, any table following the relation scheme will redundantly store information about the attributes determined by the subkey.

Preventing/Removing redundancy

There is a very simple 4-step way to fix the problem with a relation scheme that has redundancy, as detected by the presence of a subkey. In the steps given below, the scheme with redundancy is R and the subkey is represented by the FD W→Z, where each of W and Z is a set of attributes that is a subset of the attributes in R.

  1. Replace R by two schema, R1 and R2 as described in the following steps.
  2. Assign R1 the attributes in the union of the attributes in the subkey FD. That is R1={W ∪ Z}. Since W→Z, by definition, W is a superkey of R1.
  3. Assign R2 the set of attributes {R - Z}, that is, all the attributes in R except those in Z, the attributes on the right-hand-side of the subkey FD
  4. Both R1 and R2 share W in common. In R1, W is a superkey and in R2 it becomes a foreign key.

The “bad” subkey dependency has been removed because we’ve moved the attributes that were functionally dependent on W to another scheme, and we’ve made W the super key of that scheme.

The example below illustrates this process.

Example: detecting and removing redundancy in Contacts

In our contacts example we started above, the zipCode is a subkey of the Contacts relation scheme. It is not a super key for the entire table, but it functionally determines the city and state, zipCode→{city, state}. (If you know the zip code, you can always find the city and state, although, you might need all nine digits instead of the five shown in the sample table.) The opposite is not true, because a city has more than one zip code, like Long Beach CA in this example.

We can depict the functional dependencies visually in the relation scheme, as shown in the following figure.

Contact relation scheme with functional dependencies
Contacts relation scheme with functional dependencies. Other views of this diagram: Large image - Data dictionary (text)

Applying the 4-step process described above, we will replace the single relation scheme database with two relation schema. For readability, we keep the same name for one of our scheme; thus, we go from the relation scheme diagram above to the one shown in the next figure. In this revised model, we will have a many-to-one relationship between the two schema since one has a FK that references the PK of the other scheme.

Revised contact relation scheme
Relation scheme diagram after removing redundancy due to subkey dependency. Other views of this diagram: Large image - Data dictionary (text)

The new Contacts table will look like the old one, minus the city and state fields. The new ZipLocations table, shown below, contains only one row per zip code. Joining this table to the Contacts (on matching zipCode pk-fk pairs) will produce the same information that was in the original table. This property, whereby the information modeled in one database is not lost when the database is redesigned as we've done here is formally called the lossless join property of a decomposition of the original table.

Zip locations
zipCodecitystate
90840Long BeachCA
90836Long BeachCA
92708Fountain ValleyCA

Subkeys and normalization

Normalization means following a procedure or set of rules to insure that a database is well designed. Most normalization rules are meant to eliminate storing redundant information (that is, unnecessary duplicate information) in the database. The procedure presented in the previous section removes redundancy that is always a consequence of subkeys. In fact, that procedure is one stage of the BCNF decomposition algorithm which will be covered in another article.

If there are no subkeys in any of the tables in your database, you have a well-designed model according to what is usually called third normal form, or 3NF. Actually, 3NF permits subkeys in some very exceptional circumstances that we won’t discuss here; the strict no-subkey form is formally known as Boyce-Codd normal form, or BCNF.

Some textbooks use the terms partial FDs and transitive FDs. Both of these are subkeys—the first where the subkey is part of a primary key, the second where it isn’t. Both can be eliminated by the procedure presented in this article.

Correcting the UML class diagram

When we find a subkey in a relation scheme or table, we also know that the original UML class was badly designed. The problem, always, is that we have actually placed two conceptually different classes in a single class definition.

In the contacts example, a zipCode is not strictly an attribute of the Contact class. It is an attribute of a ZipLocation class, which we can describe as “a geographical location whose boundaries have been uniquely defined by the U.S Postal Service for mail delivery.”

The zipCode is an external key, created by the USPS for the convenience of its sorting machinery (not the postal customers). The ZipLocation class has the additional attributes of the city and state where it is located; in fact, it also has the attributes needed to precisely describe its boundaries, although we certainly do not need to represent these in our database. The geographical boundaries would form the “real” descriptive CK if they were included. As always, we need to describe the association between ZipLocations and Contacts.

  • Each Contact lives in one and only one ZipLocation
  • Each ZipLocation may be home to many Contacts

As with all one-to-many associations, the association itself identifies which ZipLocation a Contact lives in. If we had started with this class diagram and used our previously defined procedures to map it to the relational model, we would have produced exactly the same relation scheme that we developed with the normalization process above. This speaks to the importance of a well-designed UML class diagram; though, to become a good designer, one must build hundreds of database models and be exposed to a variety of different contexts or enterprises.

Revised contact class diagram
UML class diagram revised to model city/state/zipcode information. Other views of this diagram: Large image - Data dictionary (text)