Database Fundamentals #26: The Primary Key War

There is a war about primary keys in the database world. There are two camps. The first camp believes that primary keys should only ever be created on meaningful information. For example, there is an ISO standard for the abbreviation of state names in the United States. You could create a table for looking up state names and make the primary key that abbreviation because it is guaranteed to be unique. The other camp believes that primary keys should never be created on meaningful information because, meaningful information is subject to change and you don’t want your primary keys to be changing. A changing primary key means changing all the tables that are related to that value. This camp believes that all primary keys should be artificial. As far as I’m concerned, both camps are 100% correct. My own personal preference is towards the artificial key camp, but I have built systems with natural keys and they can work just fine. The examples we’re using in these blog posts are for artificial keys, but that is just me staying in my comfort zone. I am not attempting to advocate for one camp or the other. I’ll leave it to others to have that fight. I just want to make you aware that you will see both types in different databases. You may see both types in the same database. In upcoming examples, I will even show you something that is a combination of both types.

One point in favor of natural keys, just because you have an artificial key, you’re not done. You still must define the natural key and set it up as a candidate, or alternate, key in order to be sure that you’re meeting the business requirements for uniqueness as well as simply uniquely identifying the row. If you do not do this, then it will be possible to violate business rules such as having more than one state abbreviation with identical values. So, while artificial keys are easier to set up and maintain, and can perform better, you have the added overhead of having to also set up and maintain the natural key anyway. Don’t ever forget this point if you do choose to work with artificial keys.

One point in favor of artificial keys, data compliance through mechanisms like the GDPR requires you to change the data going to non-production environments so that privacy is maintained. A natural key could also be personally identifying information. Having to change all primary key and foreign key values across your database could be a massive undertaking. Artificial keys, used appropriately, completely avoid this issue.

Conclusion

Artificial vs. Natural keys is not a hill I wish to die on. Nor should you. There are so many more important things to worry about when building your database. The next post in the series will be on the scripts necessary to define primary keys.

7 thoughts on “Database Fundamentals #26: The Primary Key War

  • Sean Redmond

    In the case of something not expected to change soon (or possibly marginally in our life times), like the states in the U.S., either artificial or natural keys (say the official two-letter code) would do me fine, although I would take the artificial key as a matter of habit. This also applies to other enumeration-type tables that are generally very static.

    What is more important to me is the datatype chosen. If it is to be an artificial key then TINYINT, if it is to be a natural key, then CHAR(2). This key could be well be used in a table with a 100 million rows and then every byte in the primary key adds on extra 100 million bytes. That is an extra 12,207 pages *per byte* or en extra 36,621 pages if the developer has set the datatype to INT on the grounds of old habits.

    And, in this case, if there is a database with a 100-million row table that has state foreign key field, there is a case to be made for the 2-byte CHAR(2) — it may very well save many needless joins in queries when only the 2-character state code is required — against the smaller number of pages saved with the TINYINT primary key.

  • Andy Eggers

    I agree with you; but my interest is in the speed of data retrieval.

    1. Create a large table with an identity field for a Primary key and a clustered index on that field. Perform timed queries to
    retrieve data from the table.
    2. Create a same large table without an identity field and clustered index. Perform timed queries to retrieve data from the
    table.
    3. “The Why”: If you create a table with an Identity field, SQL Server uses 4 Bytes to keep track of each row of data in the
    table.
    If you create the same table with No identity field, then SQL Server uses a hidden Varchar (128) field or 128 Bytes to keep
    track of each row of data in the table. This hidden field is a RID (Row Identification Number) which contains the Database
    File number: Page number and Slot number of each record in a Table.

    Here are two Queries on your table with the no identity field, which shows the RID’s in your table:

    SELECT sys.fn_PhysLocFormatter(%%physloc%%) AS [File:Page:Slot], * FROM Your_TableName ;

    SELECT * FROM Your_TableName CROSS APPLY sys.fn_PhysLocCracker(%%physloc%%);

    Me, I always feel the need for speed.

    • And on that very strict comparison, no arguments. However, what about a situation where the identity column isn’t used to retrieve the data. Let’s say, just as an example, two foreign key values, which also define a natural key, both INT, are the most common path to the data. Does it still make sense to make an identity column on the table? If so, does it make sense to make that idenity column the clustered index? I’d argue no to both and could provide good tests to back it up.

      In short, this stuff is nuanced. There’s the general rule, which you’re on and I agree with. Then there’s specifics, and where things get tough.

  • I only saw this today so am a bit late chiming in. In my mind it’s very much like so many things in the database world. It depends. In this case it depends on the goal for the design. I like to try and use natural keys when they apply but I work on multi-tenant systems so this is largely not practicable. Alternate keys composed of a combination of naturally meaningful data and automatically populated numeric values are most often used.
    No design should be governed solely by dogmatic design rules. The design should be governed by the goals, performance and flexibility needed in adding, updating and retrieving the data.

    my 1.2 cents worth.
    Richard.

    • Honestly, couldn’t agree more. If all my mistakes and screw ups have taught me anything, it’s that there’s often a reasonable compromise that’s going to work better overall than a dogmatic standard.

  • David

    It should be pointed out that GDPR applies equally to artificial keys as to natural ones. But you make the important point well: there is really no “war” here because it’s a phoney dilemma. Natural keys are virtually essential to satisfy almost any accurate representation of the real world. Given that you will or should have natural keys in your database it makes no difference which keys you call “primary”. Keys matter but primary keys are an issue of little or no importance in the majority of cases.

Please let me know what you think about this article or any questions:

This site uses Akismet to reduce spam. Learn how your comment data is processed.