SQL Server training for developers: primary keys & indexes

I had to do some developer training last week and I wrote up a paper on the basics of primary keys and indexes. Sure, there’s tons of similar stuff around the net, but this is MINE, baby.

Our Table: Phone Numbers

For our evil training purposes, let’s say we work for the phone company, and we need a database table with phone numbers. We need to track:

  • Phone number (required)
  • Billing contact last name (required)
  • Billing contact first name (required)
  • Business name (optional)
  • Business category (restaurant, dog groomer, auto dealer, etc)
  • Address 1
  • Address 2
  • City
  • State
  • Zip
  • Service start date

(Sometimes a person or a business will have multiple phone numbers, but for the sake of this training, let’s keep it a simple flat table.)

We will never have two records in here with the same phone number. We have to tell our database about that by making the phone number our primary key. When we make the phone number the primary key, we’re telling SQL Server that there can be no duplicate phone numbers.

That means every time a record is inserted or updated in this table, SQL Server has to check to make sure nobody exists with that same phone number.

As of the year 2000, there were about 360,000 people in Miami. Throw in businesses, and let’s say our table has 500,000 records in it.

That means by default, every time we insert one eensy little record, SQL Server has to read half a million records just to make sure nobody else has the same phone number!

Every 1 write = 500,000 reads.

Well, that won’t work, will it? So let’s organize our table in the order of phone number. That way, when SQL Server inserts or updates records, it can quickly jump to the exact area of that phone number and determine whether or not there’s any existing duplicates.

This is called setting up a primary CLUSTERED key. It’s called clustered because – well, I have no idea why it’s called clustered, but the bottom line is that if you could look at the actual hard drive the data was stored on, it would be stored in order of phone number.

This is different from a primary NONCLUSTERED key – nonclustered means it has nothing to do with the way the data is stored on the physical hard drive.

So to recap:

  • Primary Clustered Key means 1 write = a few reads
  • Primary Nonclustered Key means 1 write = 500,000 reads

Sounds like a black and white decision, right? It usually is. There are ways to make primary nonclustered keys fast, but it needs to be a very careful decision not taken lightly, because there are a heck of a lot of ways to make them dead slow.

But let’s get back to our phone number table. Now we have the table organized by phone number, and if we want to find people by phone number, it’ll be very fast. While our computer systems will usually need to grab people’s data by phone number, our customers and end users often need to get numbers by other ways. That’s where indexes come in.

The White Pages: A Lookup Index

Our customers constantly need to find people’s phone numbers by their name. They don’t know the phone number, but they know the last name and first name. We would create an index called the White Pages:

  • Billing contact last name
  • Billing contact first name
  • Phone number

That index would save people a ton of time. Think about how you use the white pages:

  • You scan through pages looking at just the letters at the top until you get close
  • When you get close, you open up the full book and jump to the right letters
  • You can quickly find the right single one record

Now think about how you would do it without the White Pages. Think if you only had a book with 500,000 records in it, organized by phone number. You would have to scan through all 500,000 records and check the last name and first name fields.

The database works the same way, except it’s even worse! If a developer wrote a SQL query looking for the phone number, it would look like this:

SELECT PhoneNumber FROM Directory WHERE LastName = ‘Smith’ AND FirstName = ‘John’

That doesn’t say select the top one – it says select ALL of them. When you, as a human being, go through that list of 500,000 phone numbers, you would stop when you thought you found the right John Smith. The database server can’t do that – if it finds John Smith at row #15, it doesn’t matter, because there might be a few John Smiths. Whenever you do a table scan and you don’t specify how many records you need, it absolutely, positively has to scan all 500,000 records no matter what.

If the database has an index by last name and first name, though, the database server can quickly jump to Smith, John and start reading. The instant it hits Smith, Johnathan, it knows it can stop, because there’s no more John Smiths.

Covering Fields: Helping Indexes Work For You

But that’s not always enough. Sometimes we have more than one John Smith, and the customer needs to know which John Smith to call. After all, if your name was John Smith, and the phone book didn’t include your address, you’d get pretty tired of answering the phone and saying, “No, you want the John Smith on Red Road. He’s 305-838-3333.”

So we would add the Address 1 field in there too.

  • Billing contact last name
  • Billing contact first name
  • Address 1
  • Phone number

Do we absolutely need the address in our index for every query? No, but we include it for convenience because when we DO need it, we need it bad. And if we DON’T need it, it doesn’t really hurt us much.

This is called a covering index because it covers other fields that are useful.

Adding the address field to our index does make it larger. A phone book without addresses would be a little thinner, and we could pack more on a page. We probably don’t want to include the Address 2 field, because the Address 1 field is enough to get what we need. The database administrator has to make judgement calls as to which fields to use on a covering index, and which ones to skip.

When building covering indexes, the covering fields go at the end of the index. Obviously, this index would suck:

  • Billing contact last name
  • Address 1
  • Billing contact first name
  • Phone number

We don’t want all of the Smiths ordered by their address, and then a jumbled mess of first names. That wouldn’t be as fast and easy to use. That’s why the covering fields go at the end, and the names go first – because we use those.

Selectivity: Why the Last Name Goes First

If you wanted to search for Brent Ozar in the phone book, you look in the O’s for Ozar first, and then you’ll find Ozar, Brent. This is more efficient than organizing the phone book by first name then last name because there are more unique last names than first names. There are probably more Brents in Miami than Ozars.

This is called selectivity. The last name field is more selective than the first name field because it has more unique values.

For lookup tables – meaning, when users need to look up a specific record – when you’ve narrowed down the list of fields that you’re going to use in an index, generally you put the most selective field first.

Indexes should almost never be set up with a non-selective field first, like Gender. Imagine a phone book organized by Gender, Last Name, First Name: it would only be useful when you wanted a complete list of all women in Miami. Not that that’s a bad thing – but no matter how much of a suave guy you think you are, you don’t really need ALL of the women in Miami. This is why non-selective indexes aren’t all that useful on lookup tables.

This rule is really important for lookup tables, but what if you aren’t trying to look up a single specific record? What if you’re interested in a range of records? Well, let’s look at…

The Yellow Pages: Another Index

When we need to find a dog groomer, we don’t want to go shuffling through the white pages looking for anything that sounds like a dog groomer. We want a list of organized by business category:

  • Business Category
  • Business Name
  • Address 1
  • Phone Number

Then we’ll look at the list of businesses, see which name sounds the coolest and which address is closest to ours, and we’ll call a few of them. We’ll work with several of the records.

Here, we’re searching for a range of records, not just a single one.

Notice that we didn’t put the most selective field first in the index. The field “Business Name” is more selective than “Business Category”. But we put Business Category first because we need to work with a range of records.

When you’re building indexes, you not only need to know what fields are important, but you have to know how the user is fetching records. If they need several records in a row next to each other, then it may be more helpful to arrange the records like that by carefully choosing the order of the fields in the index.

When in doubt, experiment. Create two or more indexes with the same fields, but in different orders. Then check your query execution plan to see which index gets used.

One Phone Book per City: Partitioned Tables

When we’re dealing with a lot of data, sometimes we partition the same data into multiple tables. The phone company separates its customer list into cities, for example, so that there are separate tables for Miami and for Fort Lauderdale.

When customers want to look up a phone number, they grab the Miami phone book. They don’t want to look in the Fort Lauderdale phone book for someone who lives in Miami.

But how do they know which book to use?

They look at the front of the book, where it says “Miami, Florida”. This is a constraint: this book is constrained to Miami numbers only.

Constraints Define Partitions

Constraints help us because it means we don’t have to put fields like State in our index. When our users are looking at a single entry in the phone book, they don’t have to wonder which State the entry is for. The book is constrained to State = ‘FL’.

Databases need constraints too. We might create a table called OrderHistory_FL, and put a State field in it. However, our poor little database doesn’t know that only FL orders are stored in this table, because the table name doesn’t mean anything. We have to specifically create a constraint on the state field.

When the database server builds query execution plans, it looks at the constraints on each table and uses that knowledge to help build a better execution plan. If it knows that it will only find states of ‘FL’ in a particular table, then it knows it doesn’t have to do any work when the query says WHERE STATE = ‘FL’. The database already knows that all records in that table match the constraint – just like if I tell you to find me all of the John Smiths in Miami, you don’t have to call each of the John Smiths in the phone book and ask them if they’re actually in Miami.

Previous Post
Quibbles with the Cingular 8125
Next Post
Why open source will triumph: one big community

8 Comments. Leave new

  • Hi Brent

    Great Blog. I like above that you point out times (with a covering index, and same can apply to a clustered index) when the most selective column does not necessarily need to be left most – very few articles on indexes comment on this. Love the admin stuff lots.

    Two quickies:
    1)
    * Primary Clustered Key means 1 write = a few reads
    * Primary Nonclustered Key means 1 write = 500,000 reads
    Either I misunderstand or this is incorrect. A primary key is enforced by an index (clustered or nonclustered). The engine will never have to scan the leaf level of the table so there should never be 500, 000 “reads”.

    2)
    You can make your covering indexes ever-so-slightly narrower by omitting the clustered index columns – these are in the leaf level of the index anyway (as a pointer to the data in the clustered index) so don’t need to be in the index nodes if all you are doing is returning the value.

    Cheers
    Dan

    Reply
  • This entire article should really be removed or corrected, as it’s false and highly misleading the way it is.


    •Primary Clustered Key means 1 write = a few reads
    •Primary Nonclustered Key means 1 write = 500,000 reads
    … There are ways to make primary nonclustered keys fast

    All primary keys will be supported by an index. Any pk, clustered or not, will by definition be unique, so it will *always* be fast for a single-value lookup, clustered or not.

    A phone book table should be clustered by ( LastName, FirstName ) [for technical reasons, you might add your own “uniquifier”, but that’s not needed in an intro article), because a phone book is searched by LastName and, optionally, FirstName.

    Physical phone books are also “clustered” by (LastName, FirstName) for the same reason: because that’s how people look up phone numbers.

    Reply
  • Hi Brent! Love this article and your site, webcasts, presentations… everything! lol

    Quick note, I found a typo where you say “But we put Business Name first because we need to work with a range of records.” I think you meant Business Category 🙂

    That’s all! Thanks, this was the clearest article on clustered vs. non-clustered indexes I’ve read!

    Reply
  • Thanks for the write-up. I’m new to database development and was confused about indexing and constraints and just the concept of it. Your phone book analogy helps to get a basic concept for my mind to wrap around. I’m sure there’s more to be understood, but at least I understand terms and the basics. Appreciate it!

    Reply
  • Steven Johnson
    April 22, 2019 4:34 pm

    Brent, one of the best non-high level description of indexes! Great job

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}