Upsizing the SQL ConstantCare® Database Servers

If you’re looking for SQL Server news, just skip this post. I’m going to talk about general business & capacity planning type stuff here today.

Our SQL ConstantCare® customers install an app that polls diagnostic data from their SQL Servers and Azure SQL Databases, upload it to us daily, and then automatically get advice on what they should do next.

SQL ConstantCareWe use Amazon Aurora PostgreSQL as the back end, and I wrote about that design decision last year. I followed up later with a discussion of our hosting costs. Throughout 2018 and most of 2019, the production servers were:

  • Two r4.xlarges in a cluster (for high availability)
  • 2 physical cores, 4 with hyperthreading on
  • 30GB RAM

That served us really well, except for the times when Richie needed to vacuum out deleted rows. PostgreSQL keeps deleted rows inside the user database (just like SQL Server 2019’s new Accelerated Database Recovery), and the automatic vacuuming processes can’t always keep up, especially with app designs like ours. We only keep a rolling 30 days of data, and we roll through billions of rows per month in some of the tables, so there’s a lot of deleting going on.

When Richie needs to do big vacuums manually, he spins up the biggest RDS instance available: the r5.24xlarge with 96 cores, 768GB RAM, and 14,000 Mbps storage throughput. (One of the slick advantages of Aurora’s architecture is that since the compute and storage are completely separate, you can amp up to insane performance quickly without waiting for storage to catch up.) At only $14 per hour, that’s worth it to make maintenance tasks fly temporarily.

However, our workloads are growing.

We’ve been adding new features that gather more data and run more queries against it, like our index recommendations and multi-server emails. I found it incredibly hilarious that our most resource-intensive Postgres query by far was the query that analyzed your indexes and told you what to add. And you know what that query needed? Indexes. Of course.

By October, our reserved instances had expired. With Amazon, you can reserve your database server capacity for 1-3 years, even prepay in advance if you want, and get deeper discounts. In the hosting cost discussion post, I talked about how I’d decided to reserve a pair of db.r4.xlarge instances for production for 1 year – but that reservation was up. I didn’t want to make a decision back then, though, because…

We’ve been adding users and servers: the population had been fairly stable through 2018/2019 because I didn’t promote SQL ConstantCare® too hard, and like any SaaS product, we had a churn of subscribers. New people would sign up, and older users would let their subscription lapse once they’d taken care of the basic issues on their servers. However, with the annual Black Friday sale, I knew a big population jump would come – I just didn’t know if it would be big or fast.

So we decided to wait out through November to see how the sales & installations went. Now, in early December, we’re approaching 500 users sending in data on almost 4,000 servers. I expect those numbers to grow over the coming year as more Black Friday sale buyers start to leverage their newly purchased goodies. Over 1,000 folks have active subscriptions that include SQL ConstantCare as a benefit, and we’ve seen that folks tend to monitor just a server or two at first, then gradually add more of their servers over time as they see the value in the service. I wouldn’t be surprised if we were analyzing over 10,000 servers by this time next year.

So what should we use next?

Richie and I talked through our options:

Switch to Azure SQL DB or Azure PostgreSQL: I’m only mentioning this because I know you’re gonna suggest it in the comments, but no, this didn’t make any financial sense. Changing back ends or cloud providers would have been an expense with no appreciable benefit to end users.

Switch to Aurora Serverless: it’s still PostgreSQL-compatible, so no code changes required, but the Serverless version can automatically scale up and down very quickly. This would be really cool for a business like ours where the database server is idle for hours at a time, then gets hammered when lots of users send in data on a timer:

Database server CPU

Database server CPU%

That graph is from our new upsized primary – it used to be banging up against 100% for hours at a time during peak loads – but you get the idea. Serverless seems like a good idea here, but the Aurora Serverless limitations had just enough gotchas that we couldn’t quite migrate production to it yet. (Once a couple of those limitations are gone, though, we might be able to cost-effectively deploy SQL ConstantCare® in the EU. Fingers crossed.)

Stay with on-demand (not reserved) instances: this didn’t make sense because the savings are so large on reserved instances. You can easily save 35%-45% if you’re willing to lock in rates for a while:

Buy reservations for larger instances: instead of the r4.xlarge (4 cores, 30GB RAM), we could commit to larger instance sizes:

  • r5.xlarge: 4 cores, 32GB RAM = $5,540/year for 2 reserved instances (current state)
  • r5.2xlarge: 8 cores, 64GB RAM = $11,082/year
  • r5.4xlarge: 16 cores, 128GB RAM = $22,162/year
  • r5.12xlarge: 48 cores, 384GB RAM = $66,486/year

Since there’s nothing in between the 4xlarge and 12xlarge, that made my decision process pretty easy: we went for a pair of 4xlarge reservations for the next year. It lets Richie focus on building features rather than troubleshooting Postgres performance problems that neither of us are particularly good at, and it also let me add more queries into my Power BI dashboard that I use for analyzing client data quickly. That even drove improvements to the way I do my consulting process, and I’ll be updating that product early next year.

The database is still by far, far and away the most expensive thing in our entire infrastructure. It’s not even close: serverless apps are just so, so affordable to host compared to fixed assets like database servers. I love seeing the same problem that my clients see: the decision about when it makes more sense to hire a performance tuner versus just throw hardware at it. SQL ConstantCare®’s asynchronous queue-based serverless architecture definitely helps keep database costs low, but there’s going to come a time in the next couple/few years when we’ll have to bring in a serious Postgres performance tuner to help push our bill back down. (And no, I’m not gonna be the person who does it!)

What Should We Change About This Year’s Data Professional Salary Survey?

Every year, we run a Data Professional Salary Survey here. Thousands of y’all have answered – here are the past results in Excel – and it helps everybody get a better idea of how overpaid they are. (Did I word that right?)

Here are the questions I’m planning to ask this year:

  • What’s your total salary in US dollars, annual before taxes?
  • Your country
  • (Optional) Postal/zip code
  • Primary database you work with
  • Years that you’ve worked with this database
  • Other databases you worked with in 2019
  • Job type: (FTE, FTE of consulting/contracting company, independent, part time)
  • Job title
  • (Optional) Other job duties
  • Do you manage other staff
  • Years of doing this job
  • At how many companies have you held this job?
  • How many other people on your team do the same job as you?
  • How many database servers does your team work with?
  • What’s the newest version of SQL Server you have in production?
  • What’s the oldest version of SQL Server you have in production?
  • Highest level of higher eduction completed
  • If you have a college degree, is it computer-related?
  • Do you hold industry certifications?
  • How many days per week do you work from home?
  • What is the population of the largest city within 20 miles of where you work?
  • Employer sector (private, government, non-profit, etc.)
  • What are your career plans for the year 2020?
  • To which gender do you most identify?

If there’s any changes you want to make, leave a comment and we’ll talk through it.

Updated First Responder Kit and Consultant Toolkit for December 2019

It’s the hap-happiest season of all. To get the new version:

Consultant Toolkit Changes

When you have to troubleshoot deadlocks, wouldn’t it be nice to get the deadlock graphs all saved off nicely in separate XDL files that you can open with Plan Explorer? Good news! For SQL Server 2012 & newer, The Excel file now has a new Deadlock Summary and Deadlock Detail tabs that show the output of sp_BlitzLock, plus the deadlock graphs are written to separate xdl files. (This does mean there is duplication in the output: there’s a copy of the deadlock graph saved for every query in the deadlock. I can live with that for a first version, though.) Here’s what the output folder looks like:

Consultant Toolkit output folder

Other changes this month:

  • Improvement: on the HA-DR tab, the list of AG replicas is now sorted by AG name, primary first, then the sync replicas first, then by replica name.
  • Plus includes the below updates from the First Responder Kit.

sp_BlitzCache Changes

  • Improvement: @BringThePain = 1 no longer forcibly sets @Top to the max. Instead, it’s used as a confirmation switch as in the other First Responder Kit procs. For example, if you set @SortOrder = ‘all’ and @Top > 10, and you didn’t set @SkipAnalysis = 1, then we require you to set @BringThePain = 1 to make sure you understand that processing will be slow. (#2172)

sp_BlitzFirst Changes

sp_BlitzIndex Changes

  • Improvement: when you run it for table-level details, like sp_BlitzIndex @TableName = ‘Users’, the first result set with the list of existing indexes has a new last column with the drop T-SQL. Useful for rapidly building undo/redo scripts. (#2177)
  • Fix: when saving the output to table, index definitions longer than 4,000 characters were truncated. Changed output table definition to NVARCHAR(MAX) for the Drop_tsql, Create_tsql, and index_definition columns, and cried just a little. (#2183, thanks Helio Lino de Almeida Filho for the bug report.)

SqlServerVersions Changes

  • Improvement: added SQL Server 2019, updated a few versions. (#2181, thanks sm8680 for the bug report and Emanuele Meazzo for the fix.)

sp_DatabaseRestore Changes Awaiting Review

There are two pull requests with improvements, but I’ll be honest: I’m not really qualified to do code review on these because I don’t personally use sp_DatabaseRestore. (I just don’t have a use for it myself since most of my work focuses on performance tuning.) If you can give these authors a hand by testing it and making notes in the Github issues, we can get these merged into the main branch:

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

SQL ConstantCare® Population Report: Fall 2019

Our SQL ConstantCare® service lets users send data in about their SQL Servers and get health & performance advice via email. Users can opt in to share their anonymized data to get better recommendations overall.

With 3,568 database servers sharing data as of November 28, there’s a lot of useful information in here for the community. Understanding what servers are out there will help you put context around where your own servers rank, and helps the community better build training material that’s relevant for most users. Let’s start with a simple one:

What versions are the most popular?


Or if you’re into percentages:

A few things jump out at me right away:

  • SQL Server 2008 has more adoption than Azure SQL DB and Azure Managed Instances combined.
  • SQL Server 2019 has only 1 production instance sharing data in the population.
  • Almost 50% of the population are already on SQL Server 2016 & 2017. (That’s better than I expected!)
  • About 1 in 10 SQL Servers are no longer under support (and don’t even get me started on patching levels.)

Now, I’ll be the first to tell you that SQL ConstantCare® users simply aren’t an accurate representation of the population overall. Quest Spotlight Cloud’s metrics as of 2019/11/28 look a little different: they have 4% of the population on SQL Server 2008, but only 1% of the population on SQL Server 2017. My guess is that that page isn’t kept up to date.

I’m looking forward to trending this data over time to see the rate at which people adopt newer versions of SQL Server. I’m going to publish these quarterly. In the next couple/few, I’ll stick with this same presentation format, but once we’ve got some timeline-friendly data, I’ll look at more interesting ways to present it.

I’m not sure if people who use SQL ConstantCare® are more or less likely to jump to newer versions faster. I’d like to think that you, dear reader, are more likely to be on the cutting edge. However, I might have a couple of different kinds of readers: one conservative group that likes mainstream versions and monitoring tools, and another group that likes cutting edge stuff and doesn’t want SQL ConstantCare®.

But what about development? Are people maybe testing their apps on newer versions?

What versions are people using in development?

It’s interesting how different this mix is than the last chart! 50% of all Development Edition servers are running SQL Server 2016, and 11% are on 2017.

Twice as many people are using SQL Server 2019 in development as there are in production. (I’ll leave the math to you, dear reader.)

For the rest of the census report, I’m going to focus on just the production servers.

How much data do servers have?

I think this chart really helps to set the rest of them in perspective:

A lot of the SQL ConstantCare® population have what I would consider to be a fairly small server: 30% of them are hosting <25 GB data.

About 15% have 1TB of data or more, but even though that number sounds low, the population is still large. That still gives us about 500 servers’ worth of metadata to help guide folks who are in those larger tiers. There are a dozen instances in the population with over 10TB – good times there.

How much hardware do production servers get?

This really helps to reset folks’ expectations around what’s normal for SQL Servers out there. There are a huge, huge amount of production SQL Servers that just don’t need a lot of CPU power. Almost 70% have 1-8 cores, while only 10% of servers have more than 24 CPU cores.

Given that almost 50% of the population have 1-4 cores, that isn’t too surprising. But what if we exclude the folks using 1-4 cores?

Okay, now that’s much more like it: 68% of that population is using 64GB of RAM or more. Good to see folks investing in the resource that has an amazing impact on performance, especially given the SQL Server licensing costs involved.

What questions would you ask the data?

If there’s something you’d like to see analyzed, leave a note in the comments. To make it more likely that I’ll actually run your experiment, include why you want to know it, and even better, your guess as to what the data’s going to say. (That’ll make it a more fun game to play, heh.) I’ll take my favorites and do a followup post with those answers.

SQL Server Doesn’t Encrypt Data By Default.

Create a table and insert a couple of rows:

Then back up the database without using compression, and open up the backup file with a hex editor:

The same trick works on the data file, too.

That’s it.

That’s all. That’s the post.

Look, not every post is groundbreaking: sometimes it’s just about writing down something that I had to explain to a client, and I’m pretty sure I’m going to have to explain it again. PII is like a sexually transmitted infection: it gets around in ways that you’re just not prepared to face.

How to Use CrystalDiskMark 7 to Test Your SQL Server’s Storage

I bet you wanna know whether your storage is hot or not, and there’s no quicker, easier way to get a rough idea of your storage’s capabilities than to fire up CrystalDiskMark. This month, CrystalDiskMark released an all-new version 7, and it makes for even better testing.

Start by downloading it and installing it – sadly, there’s no zip file version anymore that you can stick on a network share and run remotely.

After it launches, click the Profile dropdown at the top:

The Peak Performance + Mix setting runs a pretty cool mix of tests that will push your storage hard. Note that I don’t try to get CDM to replicate exactly how SQL Server does IO: I’m just trying to get a quick 5-minute idea of whether my storage is hot or not.

Then, across the top there will be 4 dropdowns:

  1. 1 = the number of tests you want to run. If you only run one test, it’ll finish quickly, but it’ll be vulnerable to other activity running on the server at that moment. If you’re testing an already-in-production server, just do 1 test, but if you’re testing the bejeezus out of one before going live, use more tests.
  2. 1GiB = the test file size. A small 1GiB file will finish the test quickly, but if your storage has a caching component, you might get deceivingly fast results. I like using a 64GiB file if I’ve got the luxury of more time for testing, but it really will write out 64GiB of data, so make yourself some coffee.
  3. S: = the drive letter or folder you want to test. You wanna test where the data files, log files, and TempDB live in order to get a good idea of how everything performs.
  4. R70%/W30% = for the mixed test, the amount of reads vs writes happening simultaneously.

If you want to more closely replicate how your own server does IO, use the mix dropdown to pick a ratio of reads vs writes that mirrors your own server’s workload, and use the Queues and Threads window to pick the amount and sizes of IO. I cover those in my Mastering Server Tuning class, but if you haven’t been through that, you can just stick with the defaults for now.

Click the All button, and all of the tests will run.

When it finishes, click on the text entry box at the bottom of CrystalDiskMark, and you can type in some descriptive info about your storage, then hit enter. This is useful for when you take screenshots like these.

Here’s local ephemeral NVMe storage in an AWS instance:

And here’s general purpose (GP2) SSD – note that throughput on this goes up as drive size goes up, and this drive is only 500GB, so it’s pretty cripplingly slow:

When you’re trying to figure out whether your SQL Server’s storage is any good, just compare it to your laptop:

This is especially important with cloud servers: so often, you can sink a ton of monthly fees into storage that performs far, far worse than a $50 USB3 thumb drive. (I’m certainly not saying you shouldn’t go to the cloud – I’m just saying you want to know the strengths and challenges of the infrastructure life dealt ya so you can understand why query tuning and index tuning pay off so well in the cloud.)

New Multi-Server Emails for SQL ConstantCare®

SQL ConstantCareWith our SQL ConstantCare® service, you install a service that polls your SQL Servers once a day, and then we send you a daily email telling you the most important tasks to perform on that server to make it faster and more reliable.

We’re all about minimizing the amount of emails you get from us – I simply don’t wanna bombard your email inbox. I don’t wanna have the kind of service where you end up setting up an Outlook rule just to filter all the monitoring emails to go to a folder – because we all know how that story ends.

So now, we can cut down those emails even further. If you’ve got multiple servers, we can send you just one email per day with the advice for all of your servers in a single email.

To opt into the new consolidated emails, fill out the download form, and choose one combined email:

I honestly don’t expect everybody to choose this (and it’s not the default) because I’ve been surprised by how many folks have the SQL ConstantCare® emails set up to automatically create help desk tickets for them. See, some companies have email addresses that will automatically create a ticket whenever it receives an email. That way, they can assign new SQL ConstantCare® alerts and advice to specific people, and track the progress they’re making on solving that particular issue. Bravo, them!

Wanna save 50% off SQL ConstantCare®?
It’s on sale until Friday!

Our annual Black Friday sale is on now through Friday. Here’s how you can save big on SQL ConstantCare®:

  • SQL ConstantCare® by itself: $495 just $247.50
  • Level 1 Bundle: $1,385 just $295. Includes SQL ConstantCare®, plus the Recorded Class Season Pass and the Consultant Toolkit
  • Level 2 Bundle: $5,380 just $995. Includes all of the above, plus a Live Class Season Pass where you build your own VM to follow along. Attend all of my online training classes for an entire year, including the new weekend versions!
  • Level 3 Bundle: $7,380 just $1,995. Includes all of the above, plus I give you a VM in the cloud during training classes so you don’t have to lift a finger. Well, until the labs start, and then you have to lift a whole lot of fingers to keep up!

How to Think Like the Engine: When a Seek Isn’t

In our last episode, I introduced the concept of scan predicates: execution plan operations that weren’t able to seek directly to the rows they needed. Let’s take another query:

If we ONLY have the gray pages index on LastAccessDate, Id, DisplayName, and Age, our query plan looks like this:

I’m going to narrate this from bottom up because it makes for easier storytelling:

  1. Seek Predicates: SQL Server was able to seek on this index to a specific date/time. It jumped to the first user whose LastAccessDate = 1800/01/01. That’s, uh, not going to eliminate a lot of users because after all, Stack Overflow wasn’t around in the days of covered wagons. Now, SQL Server has “seeked” to that user – but now the real work begins.
  2. Predicate: now SQL Server examines every row to check its DisplayName, looking for people with DisplayName = ‘Brent Ozar’.
  3. Number of Rows Read: 299,398 – because we had to examine every single row in the table to find Brent.

You see an Index Seek here, but it is not a lightweight operation. We read every single stinkin’ row of the table.

“Then why does SQL Server call it a seek?”

When SQL Server builds execution plans, it’s going on all the information it knows at the time – which isn’t necessarily much, and can also be out of date. We talked about how SQL Server uses statistics to build plans, but statistics aren’t updated in real time.

Sure, the last time SQL Server updated statistics on this table, it knew that nobody had a LastAccessDate prior to 1800/01/01 – but it can’t guarantee that the data distribution hasn’t changed since the last time statistics were updated. Hell, it even gives you the ability to disable statistics updates, so the stats might be WAY out of date.

Granted, you know that based on the column name “LastAccessDate,” that nobody is going to access in the past. The data just doesn’t work that way. But SQL Server doesn’t know that unless you tell it – and that’s where things like constraints come in, telling SQL Server business rules about your data.

SQL Server has to just say, “Well, I know that I have an index that starts with LastAccessDate, and they’re looking for a specific LastAccessDate range. I can jump to that range and bypass any extra data that’s been added since the last time statistics were updated.”



Technically, this query plan works totally fine – there’s nothing wrong with it. But there sure is something misleading about that execution plan, isn’t there? The operator says “Index Seek”, which implies rows are being eliminated, and the “1 of 3” makes it sound lightweight too. That thick arrow is your only clue that there’s more work involved here than first meets the eye.

Oh, that thick arrow size? On actual plans, it would look like it relates to the data coming out of that execution plan operator. That’s not what it means at all – on actual plans, it refers to the amount of data that was read by that operator. The thicker the arrow, the more work is being done.

So when you see a thick arrow on an actual plan, look in the direction that the arrow is pointing away from.

(How is that for messed up? Even the arrow is drawing your attention away from the very problem. I tell you, sometimes I think Microsoft does this stuff just to give me job security.)

How to Think Like the Engine: Index Column Order Matters a LOT.

We’ve been working with the clustered index of the Users table, which is on the Identity column – starts at 1 and goes up to a bajillion:

dbo.Users clustered index

And in a recent episode, we added a wider nonclustered index on LastAccessDate, Id, DisplayName, and Age:

Whose leaf pages look like this:

So what’s going to happen when we run this query:

DisplayName is in the nonclustered index,
but we can’t use it for seeks.

If you look at the gray pages of the index, they’re organized by LastAccessDate, then Id, then DisplayName. We can’t use the gray pages to seek directly to Brent.

In this case, most people think SQL Server is going to scan (not seek) the entire clustered index (white pages.) If we don’t have an index that directly supports our query, we think SQL Server’s going to resort to scanning the table. Good news, though: it’s way smarter than that.

SQL Server says, “If I have to scan the entire table to find some rows, what’s the narrowest/smallest copy of the table that I could scan in order to achieve my objectives?” After all, it can use the gray index – it’s just going to scan it, the same way it would scan the clustered index. Which index is smaller/larger? Let’s use sp_BlitzIndex to find out:

  • The CX/PK is the clustered index, and it’s 58.1MB.
  • The nonclustered index (gray pages) is 12.5MB.

Both of them have the same number of rows (299,398) – so all of the rows in the table are included in both objects. Both of them have everyone’s DisplayName. So if you’re looking for people by DisplayName, you would much rather scan 12.5MB of pages than 58.1MB of pages – and SQL Server feels the same way. It uses the gray pages:

Number Of Rows Read = 299,398 because SQL Server had to scan the entire index in order to find the Brent Ozars. Only one of us showed up. (Thank goodness.)

Estimated Number of Rows Read = 299,398 because SQL Server knew it was gonna read the whole table to find me. If we’d have said SELECT TOP 1, then SQL Server would have been able to quit as soon as it found a matching row, so it might have been able to read less rows.

Predicate = @1 is a little tricky. Note that it says Predicate, not Seek Predicate. The word “Predicate” by itself means SQL Server is looking for this data, but it couldn’t seek directly to the rows we were looking for – because DisplayName wasn’t first in the index.

To get a seek predicate,
we need an index that starts with DisplayName.

Let’s create one, and rerun our query:

Now that’s better: SQL Server is able to seek directly to DisplayName.

Number of Rows Read = 1, and so does Actual Number of Rows. That means you’ve got a really good index for your query’s goals: SQL Server is able to read exactly as many rows as it needs, no more.

We now have a Seek Predicate, not just a plain ol’ Predicate. I really wish that instead of just saying “Predicate”, SQL Server would call them Scan Predicates, because when you see the word “Predicate” by itself, you’re likely reading more rows than you actually need – and it’s a sign that you could improve performance by doing index tuning.

When you’re designing indexes, the first column matters a lot.

You’ll hear folks say that the first column should be really selective, or really unique. That’s kinda sorta true, but not really – we go into more details on that in my Fundamentals of Index Tuning class.

What IS important is that the first column needs to be something you’re filtering on. If my queries aren’t filtering by LastAccessDate – like the example in this post – then an index that starts with LastAccessDate isn’t going to be very helpful, even though LastAccessDate is extremely selective (unique).

However, even if you’re seeking on the first column doesn’t mean that you’re done with index design. A single query can get a mix of seek predicates and scan predicates, too – more on that in the next episode.

Writing Diagnostic Queries is Hard Because SQL Server Still Has (Tiny) Bugs.

When I first got started working with SQL Server, I assumed it was flawless because it came in a sealed box from a ginormous corporation. I figured they’d tested every edge case, and that there was no way things were gonna go sideways.

These days, even though I read CU KB articles and giggle about their contents over at, I’m still really happy with the quality of SQL Server. There are bugs, sure, but there are a lot of people doing really crazy stuff with databases and expecting it to work – even when there’s no way someone could possibly have thought to test that particular combination of craziness.

So when I run into a problem, I start by assuming it’s a problem with my code, because let’s be honest: it usually is. Almost always. 99.999% of the time, probably. (Hey, whaddya know, I’m achieving five nines.)

But every now and then, it’s actually SQL Server.

And when it is, I try to boil the problem down to the simplest possible code that anyone can walk through in a matter of seconds on their own machine in order to see the problem. I want someone at Microsoft to be able to highlight the code, copy it, paste it in, and go, “Oh whoa, wait a minute, he’s right.”

Here are two examples.

sys.dm_db_persisted_sku_features ignores isolation level requests.

This one bit me in sp_Blitz: I had a client that was rebuilding large columnstore indexes, and whenever they did, their job to log sp_Blitz to table would get blocked, and cause cascading blocking issues in other sessions. At first I thought c’mon, no way – I am really, really religious about using dirty reads in our procs because this ain’t financial data, and I’m fine with inaccurate results here under heavy contention. But blocking? That’s bad.

Here’s how it happens. In one session, create a clustered columnstore index:

Then in another session, in the same database, try to query that DMV:

I know, those two isolation level hints are redundant – I’m just proving the point that it’s really a bug. It gets blocked, which shouldn’t happen. Other DMVs don’t work this way.

Once I had it boiled down to that simple repro script, I filed a bug report with Microsoft, and added code in sp_BlitzIndex to simply skip that DMV by default. (You can change that by setting @SkipBlockingChecks = 0.) I wasn’t sure how quickly Microsoft would fix the bug, or how far back they’d backport the fix.

It’s been a month now without a word from Microsoft on the bug report, so I don’t hold out a lot of hope for that one. (It doesn’t help that I accidentally flagged the idea as inappropriate just now, and I have no way to unflag it, HAHAHA. Man, I *really* hate the user interface for

sys.identity_columns has the same problem.

Similar repro:

Then in another window:

That gets blocked, and I filed a bug report for that one too, and a Github issue for sp_BlitzIndex that I’ll work on at some point, but that one’s less urgent since you can’t (currently) set up Agent jobs to log sp_BlitzIndex to table. This one’s going to be a lot more challenging though, as you can read about in the Github issue – the workaround’s going to have to be a lot more ambitious, I think.

Are these big bugs? Not at all – but if you’re writing diagnostic queries like sp_Blitz and friends, they can turn into hours of troubleshooting trying to figure out what’s going wrong – especially given how widespread these scripts are these days.

Announcing Weekend Versions of My Mastering Classes

Last week, I noticed that an awful lot of you were buying my Level 2 and Level 3 Bundles with your own personal money (not company cards), so I wondered – if your company wasn’t paying for it, were they giving you time off? So I polled to ask if you were interested in taking the Mastering classes over the weekend, and an overwhelming majority of you wanted weekend versions.

So here you go, the first weekend rotation:

To make your weekend learning easier, I’m compressing these down to 10AM-4PM Eastern time (rather than the normal 9AM-5PM time.) It won’t change the amount of content, but I’ll just make the bio and lunch breaks way shorter since you’ll be at home on Sat/Sun, and won’t have to deal with as much noise from your coworkers – although you WILL have that pesky home life to deal with.

How much will they cost?
That’s a good question.

Normally I’d want to charge a lot more for weekend work because frankly, I’d rather goof off on my weekends.

But on the flip side, I think you would rather goof off on your weekends too, and those of you who are buying this with your own cash – you would probably rather use your money for something else, too. You’re choosing to invest your weekend time and your personal money with me.

Live Class Season Pass 2018And that’s kinda humbling.

And I kinda wanna reward you for that, not penalize you for it.

So for now at least, during the Black Friday sale, the weekend classes are included in the Level 2 & Level 3 Bundles and the Live Class Season Pass, and they’re the same price as regular classes. After the Black Friday sale finishes, and based on how registration & attendance goes, I may revise that – I may no longer include the weekend classes in the season passes. For now though, pile on in!

I’m coming to SQL Saturday Stockholm! Registration is open now.

Hello again, Sweden! I’m coming to SQL Saturday Stockholm in May 2020.

I’ll be teaching my Performance Tuning in 21 Demos workshop:

Performance Tuning in 21 DemosYou’re a developer or DBA who wants to tune indexes and queries to make your application go faster. You don’t like learning by watching slides or reading white papers – you want to jump right into the code and see what happens. You’re the kind of person who loves watching YouTube videos that start with, “Hold my beer and watch this.” Attendees get a year of my Consultant Toolkit.

These pre-cons are an awesome deal for one day of in-person SQL Server training, and they usually sell out, so I try to get the word out as early as I can:

See you in Scandinavia!

SQL ConstantCare® Now Gives Index Advice, Too.

When I was a DBA, I got so sick and tired of monitoring tools and scripts yelling at me to make a million different changes – most of which didn’t really have any measurable impact on performance or reliability. Sure, I want things to go fast, but I only have so many hours in the day – and are users really going to notice if, say, I turn Lock Pages in Memory on? Especially when I’m not having any memory pressure to begin with?

SQL ConstantCareSo when we built SQL ConstantCare®, we only wanted to give you advice that users would actually notice, or that would save your job.

We wanted to focus on really big bang for the buck.

Indexes are big bang for the buck, but they’re hard to get right.

Especially when SQL Server’s missing index recommendations suck so much. So now, we’re going to make life easier for you: starting now, SQL ConstantCare® watches the missing index advice coming out of SQL Server, tracks it over time, and then recommends the most important index changes that can make a real performance difference.

We’re still dealing with data coming from SQL Server’s missing index DMVs, though, so we’re starting really paranoid:

We’re only focusing on small tables: under 100M rows and under 10GB in size. Again, starting on the paranoid side. As we build up expertise, we’ll also examine your SQL Server edition to see whether we should do the index build online, and based on your TempDB metrics & size, whether the SORT_IN_TEMPDB option should be used.

We make sure the same index recommendation shows up across multiple days. We want to avoid the problem where someone runs a few analysis queries on just one day, and then never does that work again. Going even further, we make sure that the query involved is running at least 100x per day. (Could less-frequent queries benefit from indexes? Absolutely – but we’re starting paranoid.)

We make sure the index isn’t a duplicate or narrower subset. If you’ve already got an index on DisplayName, LastAccessDate, then SQL Server may still ask for an index on DisplayName alone. We ignore that recommendation because while it might make queries marginally faster, it would make your delete/update/insert workloads slower, and we’re just not fans of slower.

We make sure the table already has a clustered rowstore index. If you’ve set up a table as a heap, we’re not going to recommend that you index it yet – although as we gain more confidence in our recommendations, we’ll start recommending clustered indexes for tables where it looks like someone just made an innocent mistake, like a heap with a nonclustered primary key. We’re also not recommending indexes yet for columnstore tables – even though your workload might actually need one.

We focus on tables with <5 nonclustered indexes. We don’t want to put you in a position where you’ve added so many indexes that your delete/update/insert performance suffers terribly. (If you’ve been through my Mastering Index Tuning class, you’ll know why.) We’re not automatically de-duping indexes (yet) to get you down to a lower number of indexes, but that’ll be in the works this year as well.

If the index recommendation has >5 columns, we remove the includes. We don’t wanna double the size of your table, so we’re starting with index recommendations with the 5 & 5 Guideline from Mastering Index Tuning. There can absolutely be cases where wider indexes make sense – but we’re not going to automate those yet.

We don’t include big (>200 byte) columns in includes. SQL Server has a nasty habit of throwing everything and the kitchen sink into your index, but if you want this Big Data™, you’re gonna have to do a key lookup to get it.

We limit the index changes per day. We only take the top 10 missing index recommendations on the server overall, and no more than 1 per table. We’re trying to limit the blast radius if a well-meaning admin just takes everything we say verbatim and hits execute. After all, when you add indexes, you tend to have wide-ranging changes in execution plans, and suddenly SQL Server will come up with better ways to do all kinds of queries.

Yeah, we’re being pretty paranoid.

When Richie started building the code for this, he joked that it was really hard to get the recommendation to actually happen! I’d put so many exclusions in the spec that it was really hard to build a workload to actually trigger it! And sure enough, since the recommendation went live a week ago, only a handful of index recommendations gone out. We’re definitely starting on the shy side. Over the coming weeks, we’ll watch how the recommendations go, and gradually ease up on the restrictions to get more indexing goodness out there in the wild.

The resulting index recommendations are in your prescription T-SQL file in your SQL ConstantCare® emails. Just like many of our recommendations these days, they’re ready to run – just copy ’em into SSMS or ADS, review ’em to make sure you’re comfortable, and hit execute.

To get ’em, sign up for SQL ConstantCare® or get one of our Black Friday Bundles. If you’re already a subscriber, upgrade to the latest exe – we added some stuff in this month’s release to get even more indexing goodness.

The Many Problems with SQL Server’s Index Recommendations

These days, I jump back & forth a lot between SQL Server and Postgres. (We use AWS Aurora Postgres to store SQL ConstantCare®‘s data.) Whenever I come back to the sweet, sweet graphical execution plans in SQL Server Management Studio, I breathe a sigh of relief. Dang, these things are so much easier for me to interpret. It’s like coming home.

Graphical plans include missing index recommendations, telling you how you should store your data for faster queries:

And if you don’t have time to review one query at a time, SQL Server makes wide-ranging analysis easy too, letting you query dynamic management views like sys.dm_db_missing_index_details to get index recommendations for the entire database. You can even use tools like sp_BlitzIndex to analyze and report on ’em.


Both of these – the index recommendations in the query plan and the ones in the DMVs – suffer from some pretty serious drawbacks.

They’ll recommend huge indexes. They don’t give a damn about the overhead of multiplying the table’s size. They’re razor-sharp focused on improving the response time of queries that need to find rows, but they don’t care about how much slower your deletes, updates, and inserts go.

The suggested key columns aren’t even in order. I know, you think they have something to do with selectivity, but they don’t: they’re just a comma-delimited list of columns ordered by equality vs inequality search, then by their field order in the table. Don’t believe me? Here’s a query to reproduce it.

The indexes may not even get used. Because the key columns aren’t necessarily in the right order, there are times where you’ll see SQL Server recommend an index, only to realize that the key order doesn’t even make sense for the filtering, grouping, joining, and ordering that the query really does. But does SQL Server go back and tell you to drop that index? Nope – it’ll just keep recommending indexes to support your hoarding habit.

The indexes may make the query slower. I’m not talking about unintended side effects where one query gets faster and another gets slower – I’m talking about SQL Server recommending an index for one specific query, and then that very query gets slower and uses more CPU. That demo’s always an eye-opener in my Fundamentals of Index Tuning class.

They’ll recommend redundant indexes. You’ll often see SQL Server recommend one index on ColumnA, ColumnB, and then another separate index recommendation for just ColumnA, and then yet another recommendation on ColumnA + include ColumnB. Even worse, they’ll recommend indexes that are subsets of existing indexes, or tell you to make a wider index (but not tell you to drop a narrower subset index that would be replaced by the new one.)

They’ll recommend indexes on huge tables. Whether the table has ten thousand rows or ten billion rows, you get the same basic index recommendations. I dunno about you, but I gotta approach indexing a lot more carefully when I know that the index would take an hour to create.

They’ll recommend nonclustered (but not clustered) indexes on heaps. I can almost hear SQL Server talking itself through this. “Well, sure, the entire table is literally organized in random order, but … I sure would like a copy of it sorted by just the primary key.”

They’ll recommend indexes for rarely-executed queries. If you let an analyst loose in a database, they’ll run all kinds of crazy queries in just one day’s time. SQL Server sees those crazy queries, recommends all kinds of indexes to make ’em go faster, and then…stores those recommendations, showing them over and over again whenever you check index design for a table, even weeks or months later. Sure, the index recommendations will get reset eventually, but…

They don’t tell you when this data was reset. You have no idea how much history you’re looking at – maybe a day’s worth of queries, maybe six months.

They don’t let you reset the data on your terms. Sure, you could restart the entire SQL Server, or set the database offline, but who on earth can actually do that just to do index analysis? It’s left up to you, dear reader, to log this data to a table regularly and then do differentials on it to understand when the indexes would have been used.

“But it gets better in Azure SQL DB, right?”

I got really excited when Microsoft announced automatic indexing in Azure SQL DB. After all, if the robots are going to do our job for us, they’re gonna need better data to work with, right? I excitedly devoured the white paper about it to see what new instrumentation they’d added in order to help solve the problem.

And they didn’t actually fix the missing index DMVs.

They just wrote code to work around the limitations.

The bad news is that they’re probably not going to fix the broken missing index DMVs anytime soon. Sure, we got really incremental improvements in SQL Server 2019, but they don’t address any of the above stuff. That’s a bummer.

The good news is that if they can work around it, so can we. I’ll tell you about the newest feature of SQL ConstantCare® in the next post.

Announcing a New Live Online Class: How I Use the First Responder Kit

During the hands-on labs for my Mastering classes, when I’m answering an attendee question, this kind of thing happens a lot:

  • Attendee: “How would I find the query causing that missing index recommendation?”
  • Me: “Oh, good question – you can use the sp_BlitzCache @SlowlySearchPlansFor parameter. Here’s how it works.” (Does a quick demo)
  • Attendees in unison in Slack: “WHOA I HAD NO IDEA THAT PARAMETER EXISTED”
  • Me: “Yeah, there are so many parameters. One of these days, I need to teach a one-day class just on the First Responder Kit’s parameters and how you use them.”
  • Attendees: “OMG YOU MUST DO THAT”

How I Use the First Responder KitSo for 2020, I put together a new live class:

How I Use the
First Responder Kit

You’ve downloaded sp_Blitz, sp_BlitzIndex, and sp_BlitzCache from the First Responder Kit, and you’ve run them a few times. You know there’s a ton of advice in there, but…it’s kinda overwhelming. Where should you even start? What should you tackle first? Are there parameters that would help diagnose your specific situation?

In this one-day course, I’ll walk you through running them on your production server, and we’ll interpret your results together. Space is limited in these particular live classes because I’ll be giving you advice on your specific server.

Here’s how the day will go:

  • You’ll run sp_Blitz to do a server-wide health check
  • You’ll run sp_BlitzFirst to look at your server’s wait stats and storage throughput since it started up
  • You’ll graph the server’s performance over time with the Power BI Dashboard for DBAs
  • You’ll run sp_BlitzCache to find the queries causing your top wait types, using my favorite parameters for deeper diagnosis and trending
  • You’ll run sp_BlitzIndex to diagnose the most urgent indexing issues specifically related to your top wait types
  • You’ll write up a report to hand off to the rest of your team about what you learned about your server today, and what actions everyone needs to take in order to see a performance boost

Throughout the day, as you have questions about the data you’re seeing, you can post screenshots in the class’s Slack channel and get advice from me (and the rest of the students!) You’ll learn not just from your own questions, but also from the data that the other students are sharing about challenges they’re facing, too.

Registration is open now – but seating is limited in this class because I want to make sure to have plenty of time to answer individual attendee questions about their particular server’s issues.

For those of you with the Live Class Season Pass, Level 2 Bundle, or Level 3 Bundle, registration is of course completely free for you, since you get access to all of my live classes while you’re subscribed.

If you only got the Level 1 Bundle and you’re feeling buyer’s regret for not going bigger, good news: as of today, if you go to the Level 2 Bundle or Level 3 Bundle pages, you get $295 off during our Black Friday month sales too. I got tired of fielding emails that said, “Hey, I’m really wishing I’d have gone for that Level 2 – it’s just too good to pass up,” so now you can do self-service upgrades. Enjoy!

Update Nov 13: well, that was fast! The first one already sold out, so I added a date in January. If that one sells out, I won’t add another one until after the April class date.

How to Think Like the SQL Server Engine: Included Columns Aren’t Free.

In our last cliffhanger episode, I said that if we ran this query:

And we had this index:

Then we would have to do all of these things:

  1. Look up user #643 on the clustered index by doing a seek – because thankfully our kind developer included the clustered primary key in the where clause
  2. (If the relevant clustered index pages for this user aren’t in memory, fetch them from disk)
  3. Get user #643’s LastAccessDate, Id, DisplayName, and Age so that we can look them up quickly on the gray index
  4. Look up that LastAccessDate, Id, DisplayName, and Age in the IX_LastAccessDate_Id_DisplayName_Age
  5. (If the relevant nonclustered index pages aren’t in memory, fetch them from disk)
  6. Lock these rows that need to be updated
  7. Write to the transaction log file indicating what we’re going to change
  8. Change the data page(s) for the clustered index
  9. Change the data page(s) for the nonclustered index, including moving a row around to different pages if necessary (like if the key values changed enough to merit being on an entirely new page)
  10. Later – asynchronously – write the changed data pages to disk

So what parts can we skip if we just include DisplayName and Age?

Are included columns somehow lighter weight? Let’s say we have this index:

Let’s examine the work required with the old index:

  1. Look up user #643 on the clustered index by doing a seek – because thankfully our kind developer included the clustered primary key in the where clause – we still have to do this
  2. (If the relevant clustered index pages for this user aren’t in memory, fetch them from disk) – we still have to do this
  3. Get user #643’s LastAccessDate, Id, DisplayName, and Age so that we can look them up quickly on the gray index – we still have to do this, but we only need LastAccessDate and Id – not a real reduction in work here
  4. Look up that LastAccessDate, Id, DisplayName, and Age in the IX_LastAccessDate_Id_DisplayName_Age – we still have to do this, it’s just that we’re looking it up in IX_LastAccessDate_Id_Includes
  5. (If the relevant nonclustered index pages aren’t in memory, fetch them from disk) – we still have to do this
  6. Lock these rows that need to be updated – we still have to do this
  7. Write to the transaction log file indicating what we’re going to change – we still have to do this
  8. Change the data page(s) for the clustered index – we still have to do this
  9. Change the data page(s) for the nonclustered index, including moving a row around to different pages if necessary (like if the key values changed enough to merit being on an entirely new page) – we still have to do this – it’s just that we don’t have to move the row around on the page…or do we?
  10. Later – asynchronously – write the changed data pages to disk – we still have to do this

Even before I get tricky with the answer, you can see that we still have to do every single one of those steps! (And just to be clear, there are even more steps than this in dealing with updates, but these are by far the biggest ones, and I’m concentrating on the bulk of the workload here, not tiny edge case differences – let’s focus on the big picture for now.)

The only possible reduction in these 10 steps is a slight tweak to #9, the ability to skip moving rows around between pages.

Well, we still save a little in step 9, right?

Keep thinking, dear reader, by examining the index page involved. User #643, Sarcastic, is at the top right of the page:

If you update Sarcastic to be 31 years old instead of 32, does he move on the page?

If you updated his name to be Optimistic instead of Sarcastic, do you have to move him around?

Even if you somehow changed his Id, do you have to move him then?

As long as the first column or two (or both) are fairly unique, the sorting on the third, fourth, fifth, etc keys on an index don’t matter nearly as much as the first column or two in the keys. This is one of the (many) reasons why folks often suggest that the first and second keys on an index should be very selective: they should help you quickly narrow down your search space. If that first key or two changes, then sure, we’re going to need to move a row around – but for subsequent keys, not quite so much, assuming that you did a good job of picking your first and second keys. We dig into picking index key column order in way more detail in my Fundamentals of Index Tuning class.

There are absolutely cases in this big, wide, wonderful world where it makes a stunning difference whether the 3rd and 4th columns are keys as opposed to included columns. However, they’re probably not the cases that you’re facing. In your environment, out of the 10 steps involved with updating a field, the big question isn’t, “Should this column be in the key or the includes?”

The big question is, “Should this column even be in the index at all?”

In a perfect world, we don’t index very hot data: data that changes constantly, forcing us to take locks on not just the clustered index, but all of the nonclustered indexes that contain that data, too. The more columns you include – especially hot columns – the more blocking problems you’re going to run into, and the slower your storage becomes (because you’re doing so many writes per update.)

In that perfect world, we probably wouldn’t index LastAccessDate, for example – a value that changes every single time a user logs in or visits a page. After a few days of users logging in, imagine what this 8KB page above is going to look like – and we’ll talk about that soon, too.

The point to take away:
design the right leaf pages first.

When you’re designing an index, start by asking:

  1. “Does this index enable me to seek to just the rows I need, and only read those, and no extras?”
    • “If not, what are the extra reads, and am I okay with that?”
  2. “Does this index give me all the columns I need?”
    • “If not, what are the additional key lookups required, and am I okay with that?”

Don’t get hung up on keys versus includes before you answer those first four questions. Once you’ve answered those, then you can start obsessing about keys vs includes – but so often, I see people obsessing over what to put in the includes, thinking that it’s somehow going to be noticeably cheaper for performance purposes. Most of the time, it ain’t – but even worse, they haven’t designed the index to answer those four questions above to begin with.

And I get it – it’s hard, especially when the missing index recommendations seem to be working against you. We’ll talk about that later this week.

Would you attend the Mastering classes on a weekend?

I notice that a lot of you are springing for my Black Friday deals with your own money rather than the company’s money. That got me wondering – if you’re paying for it out of pocket, maybe your company isn’t going to give you free time at work to attend the class either.

So, should I run a weekend version of my 3-day Mastering classes? Let’s find out with a poll:

Contest: Guess the SQL Server 2020 Release Date

When will Microsoft officially release SQL Server 2020 for download? The dust is just barely starting to settle on the 2019 box, so it’s time to guess the next one.

  • Leave one – and only one – comment here in YYYY/MM/DD format with your release date guess. If you leave multiple comments, only the first/earliest one is going to count.
  • “The release date” is the date that Microsoft announces that the final RTM bits will be downloadable to the public from (Not the date they make the announcement, and we’re only talking the release-to-manufacturing public version, not a preview, CTP, RC, CU, or SP.)
  • Closest to win, without going over, wins a Live Class Season Pass and a Recorded Class Season Pass.
  • In the event of a tie (multiple people guessing the same date), the earlier comment wins both the live & recorded pass, and subsequent folks only win a Recorded Class Season Pass.
  • Only comments more than 48 hours earlier than Microsoft’s public release announcement will count. If Microsoft makes their announcement, and you run over here trying to leave a fast comment with the release date, not gonna take it.
  • If Microsoft announces two release dates – like one for Windows, and one for Linux – then we’ll pick a separate winner for each. (But you only get to leave one date in the comments.)

Place your bets!

How to Think Like the SQL Server Engine: Should Columns Go In the Key or the Includes?

In our last episode, in between crab rangoons, I had you create either one of these two indexes:

And I said that the leaf pages of either index would look the same:

In terms of the space they take up on the leaf pages, it doesn’t matter whether columns are in the keys of an index or in the includes. It’s the same amount of space on the leaf pages. To see it, run sp_BlitzIndex focused on that table:

And note that the two highlighted indexes at the bottom – which are the two indexes I told you to choose from – take 12.4MB and 12.5MB. The difference in size has to do with the tree pages that help SQL Server get to a specific leaf page. If you wanna learn more about that, pick up a book on SQL Server internals – any of them will do, since this part of the topic hasn’t changed in decades.

Somehow, people think included columns are cheap.

I don’t know how this happened, but an urban legend developed around INCLUDE columns that they’re somehow blazing fast and inexpensive. To understand why, think about this query:

If we ONLY have this index:

Then here’s how the update’s execution plan looks:

Remember, we read plans right to left, so here’s the work we did, although not necessarily in order – and note that I’m including more details than the plan shows graphically because there’s more work to be done than is immediately visually apparent:

  1. Look up user #643 on the clustered index by doing a seek – because thankfully our kind developer included the clustered primary key in the where clause
  2. (If the relevant clustered index pages for this user aren’t in memory, fetch them from disk)
  3. Get user #643’s LastAccessDate, Id, DisplayName, and Age so that we can look them up quickly on the gray index
  4. Look up that LastAccessDate, Id, DisplayName, and Age in the IX_LastAccessDate_Id_DisplayName_Age
  5. (If the relevant nonclustered index pages aren’t in memory, fetch them from disk)
  6. Lock these rows that need to be updated
  7. Write to the transaction log file indicating what we’re going to change
  8. Change the data page(s) for the clustered index
  9. Change the data page(s) for the nonclustered index, including moving a row around to different pages if necessary (like if the key values changed enough to merit being on an entirely new page)
  10. Later – asynchronously – write the changed data pages to disk

What’s that, you say? You didn’t see all that on the plan? Well, a lot of it is hidden at first glance, and you have to hover your mouse over various tooltips (or infer stuff to figure it out.) For example, here’s the nonclustered index update, hidden in the clustered index update:

Do included columns require less work?

You tell me: take a look at that 10-point checklist above, and tell me what would be different if we had this index:

And we ran this specific query:

And I’ll give you a hint: the user involved is at the top right of this gray page:

So you can work through the query yourself with a pen and paper to figure out how much better off we’d be if DisplayName and Age were only included columns rather than part of the key. You can rant and throw things in the comments, and I’ll follow up with the answers in our next episode in this series.