Unique Indexes and Row Modifications: Weird

Confession time

This started off with me reading a blurb in the release notes about SQL Server 2016 CTP 3.3. The blurb in question is about statistics. They’re so cool! Do they get fragmented? NO! Stop trying to defragment them, you little monkey.

Autostats improvements in CTP 3.3
Previously, statistics were automatically recalculated when the change exceeded a fixed threshold. As of CTP 3.3, we have refined the algorithm such that it is no longer a fixed threshold, but in general will be more aggressive in triggering statistics scans, resulting in more accurate query plans.

I got unnaturally excited about this, because it sounds like the behavior of Trace Flag 2371. Anyone who has taken a bite out of a terabyte database probably knows about this one. Ever try waiting for statistics to automatically update on a billion row table? You’re gonna need a crate of Snickers bars. I’m still going to write about the 2016 stuff, but I caught something weird when I was working on a way to demonstrate those thresholds. And that something was how SQL tracks modifications to unique indexes. It freaked me out for, like, days.

We’re gonna need a couple tables

But they’ll be slightly different. It’s the only way to really show you how weird it gets inside SQL’s head.

Table 1 has a clustered PK on the ID column. It has a non-unique, nonclustered index on DateFiller and TextFiller.

Table 2 has the same structure, but the clustered PK is on ID and DateFiller. Same nonclustered index, though.

All this code works, I swear. Let’s drop a million rows into each.

Now let’s take a basic look at what’s going on in our indexes and statistics. We just created tables! And inserted a million rows! Each! That has to count for something, right? Here’s a query to check that kind of thing.

Holy heck why don’t we have any statistics? The indexes tracked our million modifications from the insert, but the statistics aren’t showing us anything. They’re all NULL! Right now, SQL has no idea what’s going on in here.

Empty inside
Empty inside

At least, until it has to. If we ran a query with a WHERE clause, an initial statistics update would fire off. Hooray. SQL is lazy. We can skip all that fuss and just update manually. I want a FULLSCAN! No fullscan, no peace. Or something.

If we go back to our DMV query, the stats columns will at least not be NULL now. It will show 1,000,000 rows sampled, and no modifications, and the last stats update column will have a date in it. Wonderful. You don’t need a picture of that. Conceptualize. Channel your inner artist.

Weir it all gets whered

Let’s think back to our indexes.

  • Nuisance has the clustered PK on ID
  • Nuisance2 has the clustered PK on ID, DateFiller
  • They both have non-unique nonclustered indexes on DateFiller, TextFiller

One may posit, then, that they could let their workloads run wild and free, and that SQL would dutifully track modifications, and trigger automatic updates when necessary. This is being run on 2014, so we don’t expect the dynamic threshold stuff. The rule that applies to us here, since our table is >500 rows, is that if 20% of the table + 500 rows changes, SQL will consider the statistics stale, and trigger an update the next time a query runs against the table, and uses those statistics.

But, but, but! It does not treat all modifications equally. Let’s look at some examples, and then buckle in for the explanation. No TL;DR here. You must all suffer as I have suffered.

We’ll start with an update of the nonclustered index on Nuisance.

We use @@ROWCOUNT to verify the number of rows that were updated in the query. Got it? Good. It should show you that 100,000 rows were harmed during the filming of that query. Poor rows.

Here’s the execution plan for it. Since we don’t have a kajillion indexes on the table, we get a narrow plan. There are some compute scalars to come up with the date adding, the replace, and the predicates in our WHERE clause. It’s all in the book. You should get the book.

ACTUAL EXE-CUTIE-PIE
ACTUAL EXE-CUTIE-PIE

At this point, if you run the DMV query, you should see 100,000 modifications to the nonclustered index on Nuisance. Not enough to trigger an update, but we don’t care about that in this post. It makes sense though, right? We updated 100k rows, SQL tracked 100k modifications.

What if we run the same update on Nuisance2? We still only update 100k rows, but our execution plan changes a little bit…

Split! Sort! Collapse! Fear! Fire! Foes!
Split! Sort! Collapse! Fear! Fire! Foes!

And now we have TWO HUNDRED THOUSAND MODIFICATIONS?

What in the wide world of sports?
What in the wide world of sports?

This is how SQL handles updates on columns with unique constraints, which we’ll get to. But let’s look at a couple other updates first!

If we go back and update just the ID column of Nuisance, something really cool happens.

Two is the loneliest number
Two is the loneliest number

It only took two modifications to update one million rows in the clustered index. We still had to update all million rows of the nonclustered index (+1, I’m guessing, to insert the new row for ID 1,000,001).

That’s because, if you’ve been paying attention, nonclustered indexes carry all the key columns of your clustered index. We updated the clustered index, so we had to update our nonclustered index. If we had multiple nonclustered indexes, we’d have to update them all. This is why many sane and rational people will tell you to not pick columns you’re going to update for your clustered index.

If you’re still looking at execution plans, you’ll see the split/sort/collapse operators going into the clustered index again, but only split and sort going into the nonclustered index update.

Oh, yeah. That update.
Oh, yeah. That update.

If we run the same update on Nuisance2, and check back in on the DMVs, it took a million modifications (+5 this time; due to the data distribution, there are net 5 new rows, since there are exactly five unique values in DateFiller). But at least it didn’t take 2 million modifications to update it, right?

I still can't do math.
I still can’t do math.

Bring it on home

Why are there such big differences in the modification counts?

For the update to the ID column of Nuisance, it only took two modifications. This is because of the split/sort/collapse operations.

Split takes the update, and, as the name implies, splits it into inserts and deletes. If you think about what it would look like to change 1 through 1,000,000 to 2 through 1,000,001, it really is only two modifications:

  1. Delete row 1
  2. Insert row 1,000,001

All the other numbers in the range already exist, in order. That’s what the sort does, basically. Orders the values, and whether they need an insert or a delete to occur. The final operation, collapse, removes duplicate actions. You don’t need to delete and re-insert every number.

Unfortunately, for Nuisance2, it results in doubling the modifications required. This is true for the clustered index update, where DateFiller is the second column, and the nonclustered index update, where DateFiller is the leading column.

It doesn’t appear to be the data distribution, or the data type of the column that causes the double working. As things stand in this demo, there are only five unique values in DateFiller. I tried where it was all unique, I also tried it as DATE, and BIGINT, but in each scenario, SQL tracked 2x the number of modifications to each index.

Takeaways

I’m all for unique indexes! I’m even okay with two column PK/clustered indexes. But be really careful when assigning constraints, and make sure you test your workload against them. While they may obviously help read queries, there’s some cost to maintaining them when modifying data.

What I didn’t mention this whole time, because I didn’t want it to get in the way up there, was how long each update query took. So I’ll leave you with the statistics time and IO results for each one.

Thanks for reading!

Brent says: go back and read this again, because you didn’t digest it the first time. Plus, trust me, the time it takes you to read is nowhere near what it took for Erik to get to the root cause on this. (We saw the play-by-play unfold in the company chat room.)

Previous Post
Learning How To Learn: Setting Priorities
Next Post
Why most of you should leave Auto-Update Statistics on

3 Comments. Leave new

  • Paul Connors
    March 11, 2016 3:15 am

    Apologies in advance if the Blackadder reference is lost on you… but…. Darling, That was a great article!

    Reply
  • That was awesome, but…. Huh?!

    /me goes to get more caffeine and goes back to re-read

    Reply
  • George Palacios
    March 14, 2016 10:34 am

    Holy crap that’s clever. Is this an artifact of how the initial data input spread the data out though?

    What happens if we declare the ID column using BIGINT IDENTITY (1,2) instead, and then add 1 to the column?

    I’m going to play around with this, but damn I never thought the engine would be that clever.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.