SQL technique: union and minus

Set operations on tables

Some students initially think of the join as being a sort of union between two tables. It’s not (except for the schemes). The join pairs up data from two very different tables. In RA and SQL, union can operate only on two identical tables. Remember the Venn-diagram representation of the union and minus operations on sets. Union includes members of either or both sets (with no duplicates). Minus includes only those members of the set on the left side of the expression that are not contained in the set on the right side of the expression.

Set operations

Both sets, R and S, have to contain objects of the same type. You can’t union or minus sets of integers with sets of characters, for example. All sets, by definition, are unordered and cannot contain duplicate elements.

SQL and RA set operations treat tables as sets of rows. Therefore, the tables on both sides of the union or minus operator have to have at least the same number of attributes, with corresponding attributes being of the same data type. It’s usually cleaner and more readable if you just go ahead and give them the same name using the AS keyword.

Union

For this example, we will add a Suppliers table to our sales data entry model. “A supplier is a company from which we purchase products that we will re-sell.” Each supplier suppliers zero to many products; each product is supplied by one and only one supplier. The supplier class attributes include the company name and address, plus the name and phone number of the representative from whom we make our purchases.

We would like to produce a listing that shows the names and phone numbers of all people we deal with, whether they are suppliers or customers. We need rows from both tables, but they have to have the same attribute list. Looking at the relation scheme, we find corresponding first name, last name, and phone number attributes, but we still need to show what company each of the supplier representatives works for.

Set operations

We can create an extra column in the query output for the Customers table by simply giving it a name and filling it with a constant value. Here, we’ll use the value 'Customer' to distinguish these rows from supplier representatives. SQL uses the column names of the first part of the union query as the column names for the output, so we will give each of them aliases that are appropriate for the entire set of data.

Build and test each component of the union query individually, then put them together. The ORDER BY clause has to come at the end.

        SELECT cLastName AS "Last Name", cFirstName AS "First Name",
          cPhone as "Phone", 'Customer' AS "Company"
        FROM customers
        UNION
        SELECT repLName, repFName, repPhone, sCompanyName
        FROM suppliers
        ORDER BY "Last Name";

Phone list

Bradley	Jerry	888-736-8000	Industrial Tool Supply
Dick	Wayne	562-777-3030	Customer
Jewett	Tom	714-555-1212	Customer
Monge	Alvaro	562-333-4141	Customer
O'Brien	Tom	949-567-2312	Bosch Machine Tools

Minus

Sometimes you have to think about both what you do want and what you don’t want in the results of a query. If there is a WHERE clause predicate that completely partitions all rows of interest (the result set) into those you want and those you don’t want, then you have a simple query with a test for inequality.

The multiplicity of an association can help you determine how to build the query. Since each product has one and only one supplier, we can partition the Products table into those that are supplied by a given company and those that are not.

        SELECT prodName, sCompanyName
        FROM Products NATURAL JOIN Suppliers
        WHERE sCompanyName <> 'Industrial Tool Supply';

Contrast this to finding customers who did not make purchases in 2002. Because of the optional one-to-many association between Customers and Orders, there are actually four possibilities:

1. A customer made purchases in 2002 (only).
2. A customer made purchases in other years, but not in 2002.
3. A customer made purchases both in other years and in 2002.
4. A customer made no purchases in any year.

If you try to write this as a simple test for inequality,

        SELECT DISTINCT cLastName, cFirstName, cStreet, cZipCode
        FROM Customers NATURAL JOIN Orders
        WHERE TO_CHAR(orderDate, 'YYYY') <> '2002';

you will correctly exclude group 1 and include group 2, but falsely include group 3 and falsely exclude group 4. Please take time to re-read this statement and convince yourself why it is true!

We can show in set notation what we need to do:

{customers who did not make purchases in 2002} = {all customers} − {those who did}

There are two ways to write this in SQL.

The easiest syntax in this case is to compare only the customer IDs. We’ll use the NOT IN set operator in the WHERE clause, along with a subquery to find the customer ID of those who did made purchases in 2002.

        SELECT cLastName, cFirstName, cStreet, cZipCode
        FROM Customers
        WHERE custID NOT IN
          (SELECT custID
          FROM Orders
          WHERE TO_CHAR(orderDate, 'YYYY') = '2002');

We can also use the MINUS operator to subtract rows we don’t want from all rows in Customers. (Some versions of SQL use the keyword EXCEPT instead of MINUS.) Like the UNION, this requires the schemes of the two tables to match exactly in number and type of attributes.

        SELECT cLastName, cFirstName, cStreet, cZipCode
        FROM Customers
        MINUS
        SELECT cLastName, cFirstName, cStreet, cZipCode
        FROM Customers NATURAL JOIN Orders
        WHERE TO_CHAR(orderDate, 'YYYY') = '2002';

Other set operations

SQL has two additional set operators. UNION ALL works like UNION, except it keeps duplicate rows in the result. INTERSECT operates just like you would expect from set theory; again, the schemes of the two tables must match exactly.