Database "Key" Subtleties
There is a lot of confusing documentation, opinion and debate when it comes to database terminology, database design dos and don'ts and best practices. This is especially prevalent when it comes to the subtle differences between seemingly similar relational database "key" terminology/constructs, such as candidate keys, super keys, surrogate keys, natural keys, primary keys, alternate keys, unique keys, unique constraints, unique indexes and foreign keys.
I will attempt to uncover some of the subtle, but very pertinent, differences between these confusing concepts and terminology.
Candidate, Primary, Alternate and Unique Keys
Very simply, any combination of one or more attributes (columns), whose combined value is unique among all tuples (rows) is a candidate key (CK) on a relation (table). That is, any unique column or unique combination of columns is a candidate key on a table.
The primary key (PK) of any table is any candidate key of that table which the database designer arbitrarily designates as "primary". The primary key may be selected for convenience, comprehension, performance, or any other reasons. It is entirely proper (albeit often inconvenient) to change the selection of primary key to another candidate key.
The alternate keys (AK) of any table are simply those candidate keys which are not currently selected as the primary key. Exactly one of those candidate keys is chosen as the primary key and the remainder, if any, are then called alternate keys. An alternate key is a function of all candidate keys minus the primary key.
There can be many candidate keys on a table. The terminology "candidate key" implies that one of these keys can be selected as the primary key -- there can be only one primary key on a table (more about primary keys below). That is, all of the candidate keys are potential primary keys, hence the name "candidate key". It is theoretically possible that a table has no candidate keys and therefore no primary key. This is a hot topic of debate, since many relational database purists insist that all tables must have a primary key, and therefore must have at least one candidate key.
Many data modelling products do not adequately represent the concept of a candidate key. This is unfortunate as much metadata in the model definition is lost as a result.
For example:
Lets say there is an Employee table, a Skill table and an EmployeeSkill association table associating skills with employees:
EmployeeSkill
In the table above, the primary key (PK) is across both the EmployeeId and the SkillId columns.
Lets say that later on we decide to add a database generated, numeric integer, surrogate key column called EmployeeSkillId and designate it as the new primary key:
EmployeeSkill
| PK | EmployeeSkillId |
| | EmployeeId |
| | SkillId |
By adding the EmployeeSkillId column and changing the primary key to it, the model has lost the notion that EmployeeId and SkillId together form a candidate key (alternate key) and are therefore unique. That is, their uniqueness is no longer automatically enforced by the model.
In the database model, the candidate keys (CK) should first be represented and then one of those candidate keys marked as the primary key, thus making the remaining candidate keys alternate keys (AK) as follows:
| CK1 | EmployeeSkillId |
| CK2 | EmployeeId |
| CK2 | SkillId |
The model metadata would now be as follows:
| PK (CK1) | EmployeeSkillId |
| AK1 (CK2) | EmployeeId |
| AK1 (CK2) | SkillId |
This ensures that the EmployeeId/SkillId composite, alternate key still enforces uniqueness in the model. If we change the primary key to the second candidate key, there is sufficient metadata to preserve the notion of a unique EmployeeSkillId alternate key.
| AK (CK1) | EmployeeSkillId |
| PK (CK2) | EmployeeId |
| PK (CK2) | SkillId |
A characteristic of a candidate key is that it should never allows NULLs. This is important, because a candidate key is a potential primary key, and a primary key must never be NULL because it is used to identify a row. Likewise an alternate key must never allow NULLs for the same reason.
A candidate key must be unique. That is, a candidate key identifies a row uniquely, and therefore so does a primary key and an alternate key.
A very important, but often overlooked, notion of a candidate key, and therefore a primary key and an alternate key as well, is the notion of immutability.
A candidate key must always be immutable!
There is a distinct difference between truly, uniquely identifying a row and simply distinguishing a row from other rows. For instance a company has a name and a registration number. Both are unique and "identifying". But, the name is not immutable, since a company can change its name, while the registration number is immutable as the registration number never changes. Thus, although the company name can be used as a unique key to identify the company at a point in time, it cannot TRUELY identify the company through its entire lifetime -- as the name can change over time, and thus is not a candidate key (or a primary or alternate key for that matter).
This is an important and probably the most significant distinction between a candidate key (and therefore a primary and alternate key as well) and a unique key.
By enforcing that all candidate keys (primary and alternate keys) are immutable, unique and not nullable, they are truly identifying. This is an important characteristic of a candidate key, since a primary key and/or alternate keys can be used to persist references to an entity outside of the database model -- perhaps in another database. For this reason, primary and alternate keys should never be updated. Even if you allow cascaded updates within your data model, you cannot know who is holding a reference to your primary or alternate keys outside of your data model.
If you did allow updating of a primary or alternate key, it would imply that you had some knowledge outside of the data model, that the row before the update and after the update represents the same entity. This implies that your data model is wrong, because this knowledge is not represented in the data model. In fact there would be no distinction between an update and a delete followed by an insert.
In summary, a candidate key must have the following properties:
1. Must NOT be NULL. You must not have the notion of a missing identity for an entity.
2. Must be unique.
3. Must be immutable.
One candidate key must be selected as the primary key -- all other candidate keys become the alternate keys, which adhere to the same rules as a candidate key.
A unique key, that is not a candidate key (primary key or alternate key), is mutable.
Unique Index versus Unique Constraint
To define a unique key (not a primary key or an alternate key), SQL Server offers two implementations, a unique index and a unique constraint. Although these are both implemented in SQL Server as unique indexes under the hood, there are important differences between the two.
I have seen many posts touting that unique constraints and unique indexes are essentially the same. Many posts even highlight the implementation differences between the two concepts, such as FILLFACTOR that can be specified on a unique index, but not on a unique constraint. However, none of the posts I have read, have ever highlighted the most important and significant difference between these two concepts.
Conceptually, a unique constraint is a schema enforcing construct. That is, it enforces uniqueness on the column or columns to which it applies. A unique index implies something completely different from a modelling point of view. An index implies something added in addition to the schema definition, to improve performance.
Conceptually a constraint is a real schema modelling construct, which enforces integrity in the model, while an index is created to improve query performance. Thus I might decide to drop a unique index and create another index, while performance tuning, and inadvertently break the constraints of the data model schema.
If something is truly unique with respect to the conceptual/logical data model, then it must be defined as a unique constraint to adequately represent the notion of the constraint. However, if you define an index across one or more columns to improve performance, and that index happens to include a column or columns, which make it unique, then you should define this as a unique index, in addition to the unique constraint on the unique columns to preserve the conceptual/logical data model.
For example:
Employee
If IdNumber in the Employee table above is unique, it should be defined as a unique constraint in the data model.
If you often query on FirstName and return the IdNumber, you may choose to create an index, to improve performance, on FirstName and IdNumber. This index is unique because it included the IdNumber column which is unique and should be created as a unique index in addition to the unique constraint.