How to Think Like the SQL Server Engine: Included Columns Aren’t Free.

In our last cliffhanger episode, I said that if we ran this query:

And we had this index:

Then we would have to do all of these things:

  1. Look up user #643 on the clustered index by doing a seek – because thankfully our kind developer included the clustered primary key in the where clause
  2. (If the relevant clustered index pages for this user aren’t in memory, fetch them from disk)
  3. Get user #643’s LastAccessDate, Id, DisplayName, and Age so that we can look them up quickly on the gray index
  4. Look up that LastAccessDate, Id, DisplayName, and Age in the IX_LastAccessDate_Id_DisplayName_Age
  5. (If the relevant nonclustered index pages aren’t in memory, fetch them from disk)
  6. Lock these rows that need to be updated
  7. Write to the transaction log file indicating what we’re going to change
  8. Change the data page(s) for the clustered index
  9. Change the data page(s) for the nonclustered index, including moving a row around to different pages if necessary (like if the key values changed enough to merit being on an entirely new page)
  10. Later – asynchronously – write the changed data pages to disk

So what parts can we skip if we just include DisplayName and Age?

Are included columns somehow lighter weight? Let’s say we have this index:

Let’s examine the work required with the old index:

  1. Look up user #643 on the clustered index by doing a seek – because thankfully our kind developer included the clustered primary key in the where clause – we still have to do this
  2. (If the relevant clustered index pages for this user aren’t in memory, fetch them from disk) – we still have to do this
  3. Get user #643’s LastAccessDate, Id, DisplayName, and Age so that we can look them up quickly on the gray index – we still have to do this, but we only need LastAccessDate and Id – not a real reduction in work here
  4. Look up that LastAccessDate, Id, DisplayName, and Age in the IX_LastAccessDate_Id_DisplayName_Age – we still have to do this, it’s just that we’re looking it up in IX_LastAccessDate_Id_Includes
  5. (If the relevant nonclustered index pages aren’t in memory, fetch them from disk) – we still have to do this
  6. Lock these rows that need to be updated – we still have to do this
  7. Write to the transaction log file indicating what we’re going to change – we still have to do this
  8. Change the data page(s) for the clustered index – we still have to do this
  9. Change the data page(s) for the nonclustered index, including moving a row around to different pages if necessary (like if the key values changed enough to merit being on an entirely new page) – we still have to do this – it’s just that we don’t have to move the row around on the page…or do we?
  10. Later – asynchronously – write the changed data pages to disk – we still have to do this

Even before I get tricky with the answer, you can see that we still have to do every single one of those steps! (And just to be clear, there are even more steps than this in dealing with updates, but these are by far the biggest ones, and I’m concentrating on the bulk of the workload here, not tiny edge case differences – let’s focus on the big picture for now.)

The only possible reduction in these 10 steps is a slight tweak to #9, the ability to skip moving rows around between pages.

Well, we still save a little in step 9, right?

Keep thinking, dear reader, by examining the index page involved. User #643, Sarcastic, is at the top right of the page:

If you update Sarcastic to be 31 years old instead of 32, does he move on the page?

If you updated his name to be Optimistic instead of Sarcastic, do you have to move him around?

Even if you somehow changed his Id, do you have to move him then?

As long as the first column or two (or both) are fairly unique, the sorting on the third, fourth, fifth, etc keys on an index don’t matter nearly as much as the first column or two in the keys. This is one of the (many) reasons why folks often suggest that the first and second keys on an index should be very selective: they should help you quickly narrow down your search space. If that first key or two changes, then sure, we’re going to need to move a row around – but for subsequent keys, not quite so much, assuming that you did a good job of picking your first and second keys. We dig into picking index key column order in way more detail in my Fundamentals of Index Tuning class.

There are absolutely cases in this big, wide, wonderful world where it makes a stunning difference whether the 3rd and 4th columns are keys as opposed to included columns. However, they’re probably not the cases that you’re facing. In your environment, out of the 10 steps involved with updating a field, the big question isn’t, “Should this column be in the key or the includes?”

The big question is, “Should this column even be in the index at all?”

In a perfect world, we don’t index very hot data: data that changes constantly, forcing us to take locks on not just the clustered index, but all of the nonclustered indexes that contain that data, too. The more columns you include – especially hot columns – the more blocking problems you’re going to run into, and the slower your storage becomes (because you’re doing so many writes per update.)

In that perfect world, we probably wouldn’t index LastAccessDate, for example – a value that changes every single time a user logs in or visits a page. After a few days of users logging in, imagine what this 8KB page above is going to look like – and we’ll talk about that soon, too.

The point to take away:
design the right leaf pages first.

When you’re designing an index, start by asking:

  1. “Does this index enable me to seek to just the rows I need, and only read those, and no extras?”
    • “If not, what are the extra reads, and am I okay with that?”
  2. “Does this index give me all the columns I need?”
    • “If not, what are the additional key lookups required, and am I okay with that?”

Don’t get hung up on keys versus includes before you answer those first four questions. Once you’ve answered those, then you can start obsessing about keys vs includes – but so often, I see people obsessing over what to put in the includes, thinking that it’s somehow going to be noticeably cheaper for performance purposes. Most of the time, it ain’t – but even worse, they haven’t designed the index to answer those four questions above to begin with.

And I get it – it’s hard, especially when the missing index recommendations seem to be working against you. We’ll talk about that later this week.

Previous Post
Contest: Guess the SQL Server 2022 Release Date
Next Post
Announcing a New Live Online Class: How I Use the First Responder Kit

3 Comments. Leave new

  • This is super annoying. I never questioned the guidance before that includes were less expensive and have wasted many hours over the last few years trying going into the blitzcache records to figure out what can be included and what has to be keyed. It seems like the only practical use is to be able to optimize update statements when you commonly update or read more than 32 columns in one query, but don’t do that on the clustered index for some reason. To me this sounds like it would normally only come as a result of a major failure in table and application design. Index tuning and implementation very well may be the least of your problems.

    Reply
  • In my understanding Keys are Key :-). But more than that, only a person who know the system behave can improve over “missing index recommendation gotchas….” , or to count with people like you guys than can lead us to a success index strategy. Asking the right questions.

    Reply
  • Ehsan Pourtorab
    November 11, 2019 7:21 pm

    I see that you’ve addressed all controversies on the last post and that’s amazing, thanks. However, I’ll still add non-key columns to INCLUDE, but I won’t overthink about it anymore.
    Thank you.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.