Building SQL ConstantCare: Let’s Tune an Index in Postgres.

This week, our SQL ConstantCare® back end services started having some query timeout issues. Hey, we’re database people – we can do this, right? Granted, I work with Microsoft SQL Server most of the time, but we host our data in Amazon Aurora Postgres – is a query just a query and an index just an index regardless of platform? Follow along and let’s see how your SQL Server knowledge translates across platforms.

Finding AWS Aurora Postgres queries that need indexing help

I started the process by going into AWS Performance Insights, their free database performance monitoring tool for all RDS-hosted instances – including MySQL, Postgres, SQL Server, and Aurora. It comes with 7 days of performance history, which is plenty enough to help with most real time troubleshooting.

It’s not exactly intuitive to find: in the RDS console, when looking at the monitoring graphs for your instance, you have to hit the Monitoring dropdown, and choose Performance Insights:

I switched my time scale to 24 hours, saw an area of the chart I wanted to zoom into, and then selected that area of the graph with my mouse (trackpad, whatever):

The graph switches to the wait stats for that specific time range, and on the right side, the default slicing is by waits, so I can investigate where the server’s bottleneck is:

Or if I want to figure out which queries used the most resources overall, I can change that slice to queries:

And I get a color-coded list of which queries used the most resources. That’s pretty doggone easy. Looks like I need to do some tuning for the query that checks to see whether your heaps need to be rebuilt – and now things start to get a little more complicated.

Designing an index based on the SQL query

Here’s the first part of the query that I need to tune – it’s a CTE that kicks off the rule query, and it looks for heaps that have had more than 100,000 forwarded fetches since the last time we analyzed ’em a week ago:

The 3082 and 3053 are numbers I’m using to replace variables – normally, this code is driven by variables for the specific collection of DMV data that we’re analyzing, and the collection done around one week prior.

That sys_dm_db_index_operational_stats table only has one nonclustered index:

Isn’t that if not exists syntax nifty? Postgres has so many cool shortcuts. And isn’t that index name terrible? This is what happens when you let Richie name things.

Anyhoo, let’s look at what the index has, versus what the query needs:

  • connection_id = this is helpful, because we’re seeking to just one connection_id, meaning one collection of DMV data for one server on one day
  • user_id: uh oh, our query isn’t using this at all, but it’s the second column in our index, meaning from this point forward, the data is essentially unsorted
  • object_id, index_id: the table & index we’re analyzing (useful)
  • leaf_insert_count: we’re not using that at all here (but we might be for other queries)
  • And these fields aren’t in the index, but the query needs them: database_id, partition_number, forwarded_fetch_count

So I have a few things I could do here:

  • I could add user_id into the query – because I do know it at runtime, and I could just as easily pass it in to make sure that SQL Server understood the rest of the ordering key fields could still be used as-is, but the index still won’t work without putting database_id in the index.
  • I could build a new additional nonclustered index based on connection_id, database_id, object_id, index_id, and partition_number, possibly including forwarded_fetch_count. You don’t usually think of adding indexes tailored to fit every query, but this table essentially exists to satisfy just a handful of queries, so it might be realistic. The production table has 350M rows though, so I don’t wanna go wild and crazy with indexes here.
  • I could tweak both the query and the index by adding user_id into the query, and then adding database_id, partition_number, and forwarded_fetch_count to the index. The key ordering on there would be a little tricky, and in a perfect world, I’d need to see all of the other queries that touch this table. Because Richie is what’s known in our industry as a 10X Full Stacks Developer™, he’s got all of our queries in source control, so it’s easy for me to do a quick search in the repo to find all queries that touch a table.

I chose to add a new index to perfectly cover this query – and based on my quick search of the repo, I don’t think the other index will be helpful to any of the other existing queries. Here’s the new index:

(I honored Richie’s naming convention, but Postgres is shortening the names of the indexes, so I just embraced that.) That index is fairly straightforward, but to learn more about how I design indexes without looking at the query plan first, check out my Fundamentals of Index Tuning course.

Next, I need to see whether the query plan will use my index or not.

Finding out if the query uses my index, and how much it helps

Our production Aurora database is around 2TB, and obviously I don’t wanna go creating indexes on the fly in there. I want to create indexes in development first, and make sure that the query actually picks those indexes up, and see what kind of a difference it makes in the plan.

Using Azure Data Studio, to get an estimated query plan in Postgres, you highlight the query and then click the Explain button, which gives you:

<sigh> I know. Text. You’re probably used to graphical plans in tools like SQL Server Management Studio, SentryOne Plan Explorer, or JetBrains DataGrip, but I’m living in the dark ages over here in ADS. You can also use the explain command to dump out results in JSON format, and then copy/paste it into online visual plan viewers like PEV or explain, but those say that ADS’s output isn’t valid JSON, so, uh, yeah. I just use text. (I’m not sure if the problem is Aurora or ADS.)

Anyhoo, some of the important things to notice from the text are:

  • There are two index scans, one for each connection_id: these are what you and I would consider index seeks in SQL Server because SQL Server would seek to one particular connection_id and then start reading, but then it’s going to scan all of that connection_id’s data because the rows aren’t sorted in a useful way.
  • Those two index scans each have a cost of 1,113 query bucks.
  • The query’s total cost is 2,227-2,244 query bucks.

Next, I created the index, and then I asked Aurora to explain the plan to me again:

The takeaways:

  • The index accesses are now called “Index Only Scan” (and they use my new index) – because now the index completely covers the query, so there are no key lookups
  • Each index scan only costs 0.43-739.92 query bucks
  • The query’s total cost is 0.85-1494.62 query bucks

Ta-dah! Now I feel confident adding that index in production. Well, not me, personally – I’m going to add an issue in Github explaining what I want to do and why, and then open a pull request with the new index. See, I don’t actually have write permissions in production. Only Richie does. I’m the pointy-headed boss around here, and the last thing I should be doing is changing stuff in production.

Your SQL Server knowledge helps you elsewhere.

At the end of the day, a relational database is a relational database. There are differences in the ways that you interact with them, like the tools you use and the format of the output, but it’s not rocket surgery. If you had to take on a new database platform, you could! There are interesting differences between SQL Server and Postgres – the vacuum process is an interesting one, for example – but overall, they’re surprisingly similar.

Performance Tuning Means 3 Things

  1. Managing what requests are made
  2. Managing the efficiency of each request
  3. Managing the hardware capacity available to satisfy requests

And that’s it.

In the grand scheme of things, those are the only 3 knobs you get.

Oh sure, from the outside, performance tuning looks more complicated than that, but at the end of the day, it always comes back to those 3 knobs and knowing which ones you can turn.

Managing what requests are made and when starts with understanding your workload: analyzing the queries that are running, where they’re coming from, what the data is being used for, and why the request is being made. The easiest free way to start with that is sp_BlitzCache, and one of my favorite ways to start the discussion is by using @SortOrder = ‘executions’ to show the most frequent query requests. Time after time, folks just flat out don’t know what the app is actually requesting from the SQL Server – and when we start asking why, the tuning process often progresses to … just not making those requests, like caching static configuration data in the app tier.

Managing the efficiency of each request starts with measuring what the queries are doing, and how the database server is configured to respond to those requests. Tuning efficiency can mean changing indexes, isolation levels, SQL Server settings, or even tuning the query itself.

Managing the hardware capacity available starts with measuring how overloaded the server is. I like using wait time per core per second to figure out if there’s smoke coming out of the server, or if it’s sitting around bored.

As a consultant, I like giving clients a couple/few different options that fall into different categories. For example, if they’re querying the database for static configuration data thousands of times per second, they could either cache it in the app tier, or continue to invest more in hardware capacity to sustain those requests.

When you compare caching to cha-ching, suddenly folks get it. They’d rather go to the registered cache than the cash register. For a good primer on caching in .NET, check out Nick Craver’s post on how Stack Overflow does caching.

Can SELECT * Make a Query Go…Faster?!?

Most of the time, you’ll hear advice that you should select as few columns as possible so that SQL Server can do less work.

However, one of my clients hit a neat edge case scenario where SELECT * queries actually ran faster than picking a subset of columns.

I can’t share the exact query, but I can share a query that can at least illustrate the underlying concept.

If you’ve seen my “But It Worked in Development!” – 3 Hard SQL Server Performance Problems session from the 2017 PASS Summit, you may remember that SQL Server has some odd behavior when it comes to estimating how much data will come out of a query. Starting at around the nine minute mark, I run this query:

The Stack Overflow database doesn’t ship with an index on Reputation by default, so SQL Server needs to sort all of the users by reputation in order to find the top 250. The Sort operator in this plan is going to need a memory grant:

In order to calculate the memory grant needed by the sort, SQL Server looks at the amount of data that it thinks is coming out of the Users table. That guess, as I talk about in the PASS Summit session, is wildly incorrect because it’s based on the data type sizes, not the data that’s actually in the table. Because it thinks this table has 37GB (it really only has 1GB), SQL Server overestimates the memory grant:

SQL Server desires a memory grant of 49GB, but “only” grants 10GB because that’s the max that my SQL Server will allow due to its hardware size. The query only uses 1.5GB of memory – the sort has plenty of memory in order to do its job. Too much, in fact, but who’s counting?

But if we change it to just SELECT Id…

Instead of using SELECT *, then now SQL Server is going to estimate that a much lower amount of data is coming out of the Users table, and now the sort spills to disk:

Because SQL Server underestimated the memory grant this time, only asking for 600MB of RAM.

In this query’s case, the spill doesn’t really matter – the SELECT Id is still faster. I’d say that’s true in 99.99% of cases out in the real world, too: you only want to select the specific columns you need. However, this is just a neat introduction to a complicated issue: trying to get memory grants to be high enough to handle your data, but not so high that they lead to RESOURCE_SEMAPHORE poison waits under concurrent load like I talk about in the PASS Summit session.

Building SQL ConstantCare®: Which Recommendations Do People Mute?

Our SQL ConstantCare® service sends you daily recommendations for specific tasks that will improve your SQL Server’s health and performance.

Next to each recommendation, there’s a Mute link that you can click on to either mute the recommendation entirely, or mute it for specific databases, indexes, or queries:

When we brought out that feature last year, I was really curious to see what recommendations people would mute. Before the Mute links came out, users had always been able to mute things by hitting reply and telling me what to mute, but I guessed that they’d get more ambitious with muting stuff once they were allowed to mute it themselves without talking to me. You can read what they were muting back then.

Here’s what you’re muting:

  1. Consider Enabling Query Store – now muted by 151 users. I’ve blogged about why people aren’t using Query Store, and this is still a bit of a bummer. I think we could turn this number around, but users need to see value out of the feature. It’s just not easy enough yet for most folks to go from turning it on, to configuring it properly, to getting value out of it.
  2. Check for Corruption ASAP – 110 users. Keep in mind that this isn’t the number of people who have been alerted about the lack of corruption checking, and then actually fixed it. These are just the people who said, “Yeah, I know I’m not checking for corruption, and you can mute that recommendation.” That’s why I’m asking for SQL Server to just start automatically check data integrity in the background. It’s too important to leave to end users to implement.
  3. Transaction Log Larger than Data File – 104 users. I need to go back and tweak the code on this rule because I’ve seen some commonalities: for example, people seem to run into this problem a lot on the SSIS DB and want to ignore it.
  4. Set Fill Factor Higher – 65 users. This rule currently fires for fill factors 79% and lower. I may tweak the threshold on this one too.
  5. Check Unusual Database States – 60 users. Dang, y’all really like keeping offline databases around. I think I wanna tweak this one to just alert when a formerly-online database drops offline.
  6. Take a Full Backup – 59 users. COME ON NOW.
  7. Move User Databases Off the C Drive – 56 users. That’s fair: once we’ve found this problem, you probably don’t want to hear about it repeatedly. I’m not sure there’s value in keeping this rule around because it doesn’t look like a lot of people are willing to fix it.
  8. Take a Log Backup – 44 users. We’re only alerting you for databases in full recovery model that haven’t had a log backup in several days. Right now, we don’t have logic to detect if you’re offloading log backups to a readable secondary, but based on this high number of mutes, that’s something we need to build.
  9. Move TempDB Off the C Drive – 42 users. See #7.
  10. Enable Checksums for Corruption Detection – 39 users. This one kinda surprises me because it’s so safe, easy, and fast. I think I need to work on my communication about the rule to explain why it’s so important.

Three Reasons to (Temporarily) Change Max Worker Threads

When you go to the deli, there’s a “take a number” dispenser where you tear off a number and wait for your number to be called.

If you walked into a deli and you saw 50 people lined up all waiting their turn, and you went to take a number and the dispenser was completely empty, you’d say to yourself, “I think I’ll come back later.”

The deli could solve the problem by getting a larger take-a-number dispenser – or could they? All that would let you do is grab a higher number, but it’s not like they’d be able to start working with you any quicker. More numbers don’t let the people behind the counter actually do more work. In fact, it’s almost the opposite: it just means that the queues are going to grow larger and larger.

SQL Server is the same way:

  • The workers behind the counter are your CPU cores.
  • When a query wants to start, it grabs a numbered ticket.
  • If there aren’t any tickets left, there’s a problem – but to be honest, there was already a problem if the deli was chock full of angry customers waiting their turn.

If a user walks into your SQL Server and there aren’t any tickets left, you run into a poison wait called THREADPOOL. To a user, THREADPOOL feels like the SQL Server isn’t responding to network requests – but it’s just because the workers behind the counter are really busy right now, and so many customers have piled up already.

You could solve the problem (except it’s not really solving the problem) by adding more numbered tickets. In SQL Server, the max worker threads setting is how many tickets are available. In the max worker threads documentation page, Microsoft explains that for most servers (64-bit operating systems with 5-63 CPU cores), the default max worker threads is 512 + ((logical CPUs -4) * 16):

  • 8 cores: 576 max worker threads
  • 16 cores: 704
  • 32 cores: 960
  • 128 cores: 4,480
  • 256 cores: 8,576

In theory, if you had 8 workers at your deli, and you wanted thousands of customers to be able to grab a number and sit around waiting forever to be called, you could set max worker threads to be, say, 8,576. However, in most scenarios, this is just going to mean a serious performance resource problem and a whole lot of pissed-off customers. Trust me, your sandwiches aren’t worth waiting that long.

There are 3 unusual customers who often stand around without ordering food.

SQL Server has a few features that consume worker threads even when they’re not doing anything:

  1. Availability Groups – which consume threads based on the number of databases you’re protecting
  2. Database mirroring – similar to AGs
  3. Query notifications – which let apps subscribe to query updates, but consumes threads based on the number of SqlDependencies (times the number of servers subscribed)

In our deli analogy, these 3 features can consume all of your ticket numbers – even when they’re not actually doing any work. They’re considered active, running queries – even though they’re just standing around the deli, and whenever they’re called on, they just say, “Go on to the next customer – I haven’t figured out which cheese I want yet.”

If you’re backed into a corner where you’re experiencing worker thread exhaustion and THREADPOOL waits – running out of numbers to start new queries – just because these 3 features have consumed all your worker threads, then you could temporarily solve that problem by increasing max worker threads. That will let other customers go ahead and start placing orders – and at first, it’ll seem like you’ve averted disaster, and your deli can make money again.

But sooner or later, those 3 customers WILL order food.

And that’s when your deli is going to be in a world of hurt.

Great example: I’ve had several clients with thousands of databases protected with Always On Availability Groups spread across several replicas. They all had to raise max worker threads to a ginormous number just so that they could continue to service end user queries. They told themselves, “It’s fine, it’s not like all of these databases are changing data at the same time – most of the time, those worker threads are idle, so it doesn’t matter.”

Until one of the replicas went offline for more than a few minutes.

Then, as soon as the replica came back online, all of the databases on the primary replica had been active over the last few minutes, and all of them had changes they needed to push over to the replica. All of a sudden, the deli was chock full of angry customers, all of which wanted to place an order RIGHT NOW, and CPU went to 100% in a really ugly way, with new queries timing out and experiencing THREADPOOL waits.

When you buy a bigger number dispenser, you also need to start architecture discussions.

The bigger number dispenser buys you time, but only as long as not all the customers wanna place an order. The very fact that you had to buy more numbers is your big sign that at some point, workloads can be unsustainable, and you’re going to need to figure out how to:

  • Reduce the number of potential active worker threads, or
  • Get yourself more CPU cores (workers behind the deli counter), or
  • Divide the work across more SQL Servers

When you’re making these decisions, stop to talk to the business about their future growth plans. Think about the number of databases you’ll have 3 years from now, and the number of query notifications you plan to add. If those numbers are up in the thousands, it’s time to revisit the architecture strategy.

[Video] You shouldn’t dynamically create/alter/drop triggers.

Let’s get one thing out of the way first: this isn’t about the relative good or evil of triggers. I kinda like triggers, in fact – they’re good for enforcing business logic in a hurry when you can’t redo code across an entire application. Sure, in a perfect world, we’d always have the code do whatever the code needs to do – but there are times when you can’t, and triggers come in handy then.

Just as you need to be judicious about what you do inside a trigger, you also need to be careful about when you actually deploy one, and you should probably never dynamically create/alter/drop a trigger. See, when you create, alter, or drop a trigger, you’re changing the table’s structure, which requires a higher level of locking than you might expect.

To demonstrate it, I’m going to create a trigger on the Votes table in the Stack Overflow database at the same time end user activity is happening, and show the blocking:

Here’s the demo code that I use in the video:

So what should you do instead? Treat trigger changes just like any other kind of deployment of a table change, like adding a column or dropping a column:

  • Try to schedule it off-hours during a low load period.
  • Consider the blocking implications of changing the object.
  • Only do it when there’s a human being around to watch what’s happening, and cancel the deployment if it’s causing a huge disruption.
  • And maybe most obviously, only do it once. Don’t automate the same change over and over if you can avoid it – it’s just too disruptive.

The 2020 Data Professional Salary Survey Results Are In.

We asked what you make, and 1,734 of you in 63 countries answered. Altogether, you made $170,753,936 this year. Damn, y’all are bringing home the bacon!

Download the 2020, 2019, 2018, & 2017 results in Excel.

A few things to know about it:

  • The data is public domain. The license tab makes it clear that you can use this data for any purpose, and you don’t have to credit or mention anyone.
  • The spreadsheet includes the results for all 4 years (2017-2020.) We’ve gradually asked different questions over time, so if a question wasn’t asked in a year, the answers are populated with Not Asked.
  • The postal code field was totally optional, and may be wildly unreliable. Folks asked to be able to put in small portions of their zip code, like the leading numbers.
  • Frankly, anytime you let human beings enter data directly, the data can be pretty questionable – for example, there were 14 folks this year who entered annual salaries below $500. If you’re doing analysis on this, you’re going to want to discard some outliers.

I did some quick slicing and dicing, focusing on SQL Server DBAs in the United States, and deleted the top & bottom 10 outliers (because they were kinda bananas.) Looks like DBA salaries are up again this year:

We had a lot more responders this year overall – responses went down in 2019, possibly because we got a little overzealous with asking a lot of questions. We took a lot of questions out this year because they just weren’t proving to be relevant, but if you want to slice & dice through past history on those, they’re still included in the spreadsheet output.

We did ask a couple of new questions though: what’s the newest version of SQL Server you have in production?

That’s awesome! 11% of respondents are already on SQL Server 2019. I know what you’re gonna ask: are those people paid more? For that, I’ll point you to the spreadsheet to do your own analysis, because you’d also probably want to filter for job positions you’re interested in, countries, etc. I don’t know much about stats, but I know that as sample size gets smaller, that kind of data – like salaries – is less useful, so be careful.

And the oldest version you have in production:

Ouch. 49% of shops are still running unsupported versions in production. But hey, at least the adoption rate of SQL Server 2019 (the newest version in the top table) has already caught up with the combined installations of SQL Server 2000 & 2005!

Hope this helps make your salary discussions with the boss a little bit more data-driven, and hope it helps justify keeping your salary competitive. If you’ve got questions about the data, go ahead and jump in – download the 2020, 2019, 2018, & 2017 results in Excel. Don’t ask me to slice and dice the numbers for you – you’re a data professional, remember? Get in there and analyze it to answer your own questions. You’ve got this!

If you write a blog post about the data, let me know – I’ll post a followup next Tuesday with a link to your blog, and that’ll help drive traffic to you.

My 11 Favorite 2019 Blog Posts

Yesterday, I showed the stuff you read the most, but enough data – now, let’s talk about feelings. These were the 11 posts I remember with the most fondness, in date order. I did indeed like some of the ones from yesterday’s list a lot, but I’m going to set those aside.

Erik and Tara are Heading Out (Jan 3) – It doesn’t seem right to have a 2019 retrospective without mentioning how this year started. It wasn’t a fun blog post to write, nor a fun business decision to make. However, looking back as the year closes out, I can say with more confidence that it was the right business decision to make, and I’m glad I did it.

Consultants: want a utility to gather SQL Server data? (Jan 8) – As soon as I decided to stop growing our own consulting work force, I decided to start selling the consulting tools that we’d built. The Consultant Toolkit is now used by hundreds of consultants around the world. That’s kinda awesome.

How fast can a $21,468/mo Azure SQL DB load data? (Feb 13) – spoiler alert: slower than a USB thumb drive.

Can deleting rows make a table…bigger? (Mar 5) – Anytime I can help Michael J. Swart solve a problem, I’m all over it. Plus, this issue pops up again in a new way with SQL Server 2019’s Accelerated Database Recovery, but I’ll leave that one for 2020.

Fixing Performance By Throwing Cloud At It (Apr 23) – This was one of those posts that just seemed obvious to me, but it came up repeatedly with clients, so I figured I’d write it down. I wasn’t proud of that post at the time – it just seemed so obvious – but I heard from a few people afterwards that they’ve integrated the term “throwing cloud at it” into their vocabulary, so that’s kinda cool.

15 Reasons Your Query Was Fast Yesterday, But Slow Today (Apr 18) – Even today, it’s surprisingly hard to get definitive answers from monitoring tools when this question comes up.

Pop Quiz: SQL 2000 Bug or SQL 2017 Bug? (May 1) – I knew this post wouldn’t be timeless when I wrote it, but hot dog, I really enjoyed writing it.

Yep, Developers are Using Unmasked Production Data. (July 5) – I get so frustrated when I hear trainers/presenters/bloggers/idealists talk about how developers should be using purpose-built-from-scratch data sets with no real customer data in ’em. The real world just rarely works that way. Sure, I get that it should work that way, but it’s just not reality today, so I wanted to poll the audience to get some hard numbers for proof.

The 6-Month DBA Training Plan (July 15) – Wow, time flies: I first wrote this 8 years ago as an email autoresponder. I kept getting emails asking me to update some of the content, so in the summer, I figured I’d sit down and just turn the private autoresponder into a public series of blog posts, and also update the autoresponder content while I was at it. I thought it would be quick and easy. I was wrong.

“But NOLOCK Is Okay When The Data Isn’t Changing, Right?” (Aug 5) – Simple demo & video proving why you probably shouldn’t be using NOLOCK.

So, uh, you can, like, draw me in SSMS. (Nov 16) – Because it’s amazing to have someone do something like this:

Portrait in T-SQL

Your Ten Favorite Blog Posts from 2019

Here are the 2019 posts you visited the most this year:

10. The Silent Bug I Find in Most Triggers – Once you know how to recognize this bug, you’ll see it everywhere. I wrote the post as an interactive Azure Data Studio notebook, something I thought I’d be doing a lot in 2019, but the lack of execution plan support in notebooks put the kibosh on that one. I don’t expect Microsoft to solve that anytime soon – there can’t be a lot of us who need query plan support in notebooks, just presenters. (I’m not being sarcastic there – the number of presenters who would use ADS for this kind of thing is really, really tiny compared to the number of, say, data scientists who use ADS daily. I love ADS and use it daily myself.)

9. Should we use stored procedures or queries built in the app? – Client DBAs ask this one a lot because they believe if they simply require untrained developers to start writing T-SQL instead of using ORMs, the queries will somehow magically be instantly faster, as if developers are just going to naturally write fast T-SQL.

8. SSMS v18 and Azure Data Studio: The Golden Age of SQL Server Tooling – How the hell did this end up in the top 10?!?

7. [Video] What’s New in SQL Server 2019 – When SQL Server 2014 released, the cardinality estimation changes caused legendary surprises for early adopters, especially the ones who didn’t test their workloads. (I get it – workload testing is hard.) As 2019 approached, I thought, “Can I write a demo for SQL Saturday attendees that will prevent a similar problem with 2019’s new engine features by teaching them while making them laugh?” This session went over really well at SQL Saturdays, so I figured I’d record it for the blog.

Big tequila

6. Free SQL Server Load Testing Tools – Everyone thinks they wanna load test their SQL Server until they realize how freakin’ hard it is. Classic example: try load testing a query with a DELETE in it, and it’s only gonna really show load one time before you have to reset the database.

5. What’s Better, CTEs or Temp Tables? – I field this question constantly.

4. How to Think Like the SQL Server Engine, Post 1 – This fall, I took my Engine presentation videos series and wrote them out as blog posts. I’ll also rerecord these with better A/V quality in 2020.

3. There’s a bottleneck in Azure SQL DB storage throughput. – When I discovered this, I mostly just wanted to document it for other Azure users so they understood why they weren’t getting the performance they wanted. Some blog posts are evergreen – always growing in popularity because they’re timeless topics. This one, not so much – it got a lot of attention (including from Microsoft) right when it went live, but then tapered off over time:

2. Which version of SQL Server should you use? – I put a ton of work into this one, and have already updated it since 2019 went live. This is an example of a more evergreen topic where traffic grows over time.

1. How to fix the error “String or binary data would be truncated” – If you ask Google a question, and if it thinks it can give you a snippet of text from a search result as the answer, then it displays that answer directly in your search results, like this:

You can write blog posts specifically for that kind of search optimization, so I decided to give that a shot with this blog post about the string or binary data error. In one way, the experiment worked – it got a lot of hits – but I forgot to follow up later to see if the right snippet showed up as an answer in Google’s results. It didn’t:

Ah, well. Gives me something to play around with in 2020.

How to Make SELECT COUNT(*) Queries Crazy Fast

When you run a SELECT COUNT(*), the speed of the results depends a lot on the structure & settings of the database. Let’s do an exploration of the Votes table in the Stack Overflow database, specifically the 2018-06 ~300GB version where the Votes table has 150,784,380 rows taking up ~5.3GB of space.

I’m going to measure each method 3 ways:

  • How many pages it reads (gauged with SET STATISTICS IO ON)
  • How much CPU time it uses (gauged with SET STATISTICS TIME ON)
  • How fast it runs

Don’t obsess over small differences between the operations – I’m writing this blog post fast & furious to show you the big-picture differences, and to show you how my thought process works when comparing the different operations. In your own environment, for the tables you’re trying to count and the hardware you’re using and the version you’re on and the phase of the moon, you’re going to get different results, and that’s fine. There are also other ways to measure these methods depending on your own performance requirements: memory grants, ability to run without blocking, and even the accuracy of the results under concurrency. For the sake of these tests, I’m not going to talk about isolation levels or blocking.

I’m running these tests on SQL Server 2019 (15.0.2070.41) on an 8-core VM with 64GB RAM.

1: Plain ol’ COUNT(*) with only a clustered rowstore index, compatibility level 2017 & prior

The Votes table is only ~5.3GB, so I’m able to cache the whole thing in my SQL Server. Even after the query runs the first time and the data’s cached in RAM, this still ain’t fast:

  • Pages read: 694,389
  • CPU time: 14.7 seconds of CPU time
  • Duration: 2 seconds

2: Compatibility level 2019 (batch mode on rowstore indexes)

SQL Server 2019 introduces batch mode operations on rowstore indexes, previously only available on columnstore indexes. The payoff here is pretty awesome, even though we’re still dealing with just the rowstore index:

  • Pages read: 694,379
  • CPU time: 5.2 seconds
  • Duration: 0.7 seconds

Presto change-o – CPU just instantly drops thanks to batch mode. This isn’t obvious in the execution plans until you start hovering your mouse over individual operators:

Batch mode is a great fit for a lot of reporting-style queries doing aggregates over a lot of data.

3: Add nonclustered rowstore indexes, but use 2017 & prior’s row mode

I’m going to create an index on each of the 5 columns of the Users table, and then compare their sizes with sp_BlitzIndex:

Check out the number of rows in each index versus its size. When SQL Server needs to count the number of rows in the table, it’s smart enough to look at which object is the smallest, and then use that one for the count. Indexes can have varying sizes depending on the datatypes of the contents they’re indexing, the size of each row’s contents, the number of nulls, etc:

I’ll go back to 2017 compat level (removing batch mode operations) and then run the count:

SQL Server chooses to use the BountyAmount index, one of the smaller 2GB ones:

Which pays off in reading less pages, but we’re still performing the same count of 150M rows, so the CPU time & duration don’t really change:

  • Pages read: 263,322
  • CPU time: 14.8 seconds
  • Duration: 2 seconds

If you want lower CPU time & duration, you really need to approach the count differently – and that’s where batch mode operation helps.

4: 2019’s batch mode with nonclustered rowstore indexes

So now let’s try batch mode operation with the indexes in place:

It still uses the BountyAmount index and does the same number of reads as #3, but we get the lower CPU time & duration from step #2:

  • Pages read: 694,379
  • CPU time: 4.3 seconds
  • Duration: 0.6 seconds

So far, that’s the winner. But remember that batch mode originally went live with columnstore indexes, which are awesome tools for reporting-style queries….

5: Nonclustered columnstore index with batch mode

I’m purposely running in 2017 compat mode here because I want to make it clear where the awesomeness is:

The execution plan has our fancypants new columnstore index scan operator, and all of the operators in the plan are in batch mode:

I have to change my units of measure here:

  • Pages read: 73,922
  • CPU time: 15 milliseconds
  • Duration: 21 milliseconds

Hubba hubba. Let’s put that in perspective: I know some developers who try to hit system tables in order to count rows quickly, and they can’t even generate speedy results like this.

So to make SELECT COUNT(*) queries fast, here’s what to do:

In descending order of preference & speed, with the best results first:

  1. Get on SQL Server 2017 or newer, and put a columnstore index on the table.
  2. Get on any version that supports batch mode on columnstore indexes, and put a columnstore index on the table – although your experiences are going to vary dramatically depending on the kind of query you have. To learn more about the specifics, read Niko’s series on columnstore indexes, specifically the posts with the word “batch” in the title.
  3. Get on SQL Server 2019 or newer, and put your database in compat level 150 (2019) – even with rowstore indexes, you can still cut your CPU usage dramatically thanks to batch mode on rowstore. This one’s really easy to do – it probably requires the least changes to your application and database schema – but it’s just that you won’t have the amazingly millisecond-fast responses that a columnstore index can get you.

Here are the demo scripts if you wanna play around with ’em, like adapting ’em to work on your own tables.

My Home Office Studio Setup: Winter 2019

I run a lot of online training classes, and some of you who do webcasts from home might be interested to see how all this works. I did pretty well this year, so I upgraded my home office setup to sneak some expenses in on my taxes. Here’s what I’m running at the moment:

Desktop: Apple Mac Pro (2019) – the big investment this year. 16 cores, 96GB RAM, AMD Radeon Pro Vega II 32GB, and 4TB SSD, all of which are pretty easily expandable. I took the plunge here to keep my training class workflow going as quickly as possible. This isn’t about SQL Server at all: it’s about the fact that when I do live online video classes, I need a machine fast enough to stream video & audio live, and record in 4K, and encode other recordings, all without making any fan noise whatsoever.

I tend to teach in 45-60 minute modules, and then take a 15-20 minute bio break. In each bio break, I rapidly edit the video recording of the last hour, put in chroma key effects, clean up the audio, and kick off the encoding process. The encoding can take 15-20 minutes since the source material can be 4K depending on which screen I recorded in that last module. I wanted to be able to let the encoding & uploading happen even while I was teaching the next module, but I just couldn’t do that on my MacBook Pro – the CPUs would max out.

Could I do it by cobbling together a home-built machine (or multiple machines) with water cooling and silent fans? Maybe, but then I’m spending time troubleshooting desktop problems, and they always seem to crop up right when you least want ’em, right? I wanted something bulletproof that just worked.

CPU usage while encoding videos, recording, and streaming

The Mac Pro’s excellent silent thermal management easily handles 16 physical cores. In the screenshot at right, my CPU rarely hits 50% – leaving me plenty of headroom to start experimenting with live video effects. I’ve never heard the fans, so my training class audio is great, and the streaming & recording never stutters at all despite the simultaneous encoding & uploading. (In downtown San Diego, we get 1Gbps fiber for $80/month and no caps.)

The desktop was expensive, but totally worth the investment. Now, at the end of each day’s classes, I only have to edit the last module & encode it, and I’m done for the day – rather than having to queue up a whole bunch of encodes. I had almost 100 students in last week’s Mastering Index Tuning class, and the hardware investments let me just focus on the students, not my studio gear. Worth it.

Microphone: Sennheiser AVX plugged into a FocusRite Clarett 2Pre – the Clarett is the little red box sitting atop the Mac Pro, and if you squint, you can see the Sennheiser and its little blue antenna. I wanted a wireless lavalier microphone so I could move around a lot while presenting, and could walk away from the computer during bio breaks without constantly unplugging/plugging in a microphone.

Next up, the desk:

Ring light: Neewer 18″ – casts an even, flattering, dimmable light on the face. In the Amazon pictures, it shows a phone mount, and you can certainly do that – it’s popular for things like beauty podcasters – but you can just unscrew the phone mount, and in its place, mount the…

Camera: Logitech Brio 4K – USB-C webcam that’s pretty good, although not great. Hanselman’s post about webcams shows how to use a “real” camera that outputs HDMI, but I haven’t been able to find a setup that works with GoToWebinar, which is kinda picky about the kind of camera inputs that it’ll take. I’m kinda-sorta on the lookout for a replacement to GoToWebinar in 2020, something with better GDPR compliance and the ability to use any webcam, including the Elgato CamLink. For now, this is the easiest way to get good-enough video that can be simultaneously used by both ScreenFlow (which I use for screen recording) and GoToWebinar.

Displays: ViewSonic 32″ 4K and AOC 22″ – the ViewSonic was on sale for $340 and the AOC was $99. While I might splurge on an Apple desktop, ain’t no way I was going to blow $5K on an Apple Pro Display and another $1K for the stand. I was tempted, but not that tempted.

  • When I’m lecturing with slides, my main monitor has the PowerPoint presenter view, and the students see the 1080p feed on the AOC monitor. (Any higher resolution is just a waste of bandwidth for the attendees.) I record the 1080p screen. I also put an iPad & keyboard on my desktop (not pictured) showing the Slack feed with the student questions & discussion. If I need to put a link in the Slack chat, I can do that quickly on the iPad without disrupting the screen recording.
  • When I’m doing demos, I escape out of PowerPoint, and I use a 1920×1080 RDP or VMware session on the main monitor, just under the ring light & webcam. (That way, I’m still looking at the camera when I’m working on the demos.)

Green screen: Elgato Green Screen MT – mounted to the wall behind me, and pulls down like a projector screen. In post-production, when I’m editing the recordings for Instant Replay here on the site, I can use Screenflow’s chroma key effect to make my video overlay on top of the slides and SSMS. I light the green screen with a pair of Emart LED lights, which I like because they don’t get warm.

Screen capture: Telestream ScreenFlow – it’s like Camtasia, but I found it much easier to use and more powerful. It lets me record my desktop, webcam, and audio locally while I’m broadcasting, thereby getting a much higher quality than just getting the GoToWebinar recordings (which are heavily compressed.) After recording stops, I can edit in ScreenFlow, use audio plugins like iZotope RX7, make my background disappear with chroma key effects, add callouts and transitions, and much more. The only criticism I have of ScreenFlow is really small: it only records one videocamera at a time. I’d like to step away from the computer from time to time during the webcasts, and do a fireside chat story time with the attendees. That’s doable with GoToWebinar – I can switch camera feeds – but not doable with ScreenFlow yet.

Control panel: Elgato Stream Deck XL – I bought this earlier in the year, and I still haven’t gotten around to configuring it yet, but it’s on my list for the next round of Mastering classes. The Stream Deck has 32 customizable keys that I should (in theory) be able to customize to play sound effects, put gifs into Slack, award trophies for questions, etc. I’m curious to see how it cooperates with apps while PowerPoint is running fullscreen – I may need to start running PowerPoint in a VM instead.

Voting Open for #SQLBits Sessions, and Here’s My Oddball Idea

Voting is open now for SQLBits sessions, and I have to warn you that there are a LOT of sessions to choose from. Bits doesn’t just flat out take the top vote-getting sessions – they use votes as part of the session selection factor – and I don’t want you to think I’m campaigning for votes.

In fact…it’s kinda the opposite.

See, this year, Bits suggested that we offer multi-part sessions: several sessions in a row that build on each other. A few years back, I gave a half-day 500 Level Guide to Career Internals session at the PASS Summit, and it was really fulfilling. I figured, why not try breaking that up and updating it? Here’s the new 3-part series:

They’re meant to build on top of each other, layering in more details in each subsequent session. If these get picked, I fully expect attendance to drop down through the sessions – my goal is to give people enough information to decide whether the next session is right for them. Not everyone is going to be willing to take this kind of journey, but at least attendees will feel more comfortable about making the compromises necessary to get financial independence earlier.

So if you want those sessions, great – vote for ’em. But if you don’t, take a few moments to search through the abstract list and cast votes for sessions you DO want to see, whether they’re mine or not. Picking the right material is hard – conference organizers have a hell of a job – and this is the part where you can help. I know it’s not easy – there are a lot of submissions – but you really can help.

The Two Ways to Fix Non-Sargable Queries

In the database world, when we say something isn’t sargable, what we mean is that SQL Server is having a tough time with your Search ARGuments.

What a sargable query looks like

In a perfect world, when you search for something, SQL Server can do an index seek to jump directly to the data you want, then read only the rows you want, and nothing more. As an example, say we’ve got an index on DisplayName, and we look for users with a specific display name:

Then our execution plan shows a nice, tidy index seek with one row read and one row actually produced:

What a non-sargable query looks like

Say I don’t trust the data in my database, and I ask SQL Server to strip off the leading & trailing spaces:

Now, my execution plan looks a lot worse: I’m reading a lot more rows than I’m actually producing.

The two ways to fix it

If you want to read less data, you can either:

  • Change the query to make it sargable, or
  • Pre-bake data on disk to match what you’re looking for

Changing the simple query in this case is pretty easy: we’d just remove the LTRIM/RTRIM. (I’ll talk about more complex situations at the end of the post.)

Pre-baking the data on disk can be as simple as adding a computed column, and then indexing it:

Now my execution plan is awesome again, reading just one row and producing one row:

My favorite part is that we didn’t even have to edit the query! The query still refers to LTRIM(RTRIM(DisplayName)). That good ol’ SQL Server looks at the query and says, “Ah, they’re looking for LTRIM/RTRIM DisplayName, and it just so happens I’ve already pre-baked that data on disk!” He uses the index on the trimmed column even though we didn’t ask for it.

My least favorite part is what happens if I flip the order of LTRIM/RTRIM, or if I use the new TRIM command:

SQL Server’s all, “Whoa, I’ve never seen functions like that in a WHERE clause, and I don’t have any of that data pre-baked,” so he ends up scanning the table again:

So if my users were regularly running non-sargable queries using all 3 methods (LTRIM/RTRIM, RTRIM/LTRIM, and TRIM), then I would either have to create 3 different indexes (bad idea), or else create just one of ’em, and incentivize my users to query the data in a consistent way.

To learn more about this technique and how to solve more complex non-sargable queries, check out the Mastering Index Tuning module on filtered indexes, indexed views, and computed columns.

Join Me in London for My SQLBits Pre-Con Workshop: Mastering Index Tuning

Mastering Index TuningI’m teaching an all-new hands-on workshop, Mastering Index Tuning. I’m taking my favorite, toughest modules of my 3-day online class and condensing them down into a 1-day workshop.

You’ll learn quick, easy techniques to use the right nonclustered indexes, filtered indexes, indexed views, and columnstore indexes that will work on SQL Server 2019, 2017, 2016, and even Azure SQL DB.

The most important stuff:

  • SQLBits is 31 Mar to 4 April at ExCel London.
  • The full registration is just £999 for the whole event, including pre-con workshops!
  • If you want to follow along during the workshop, you can bring a fully-charged laptop with the Stack Overflow database (any size will work) and you’ll be able to solve the same problems I’m solving onstage.
  • Bonus #1: Attend my pre-con workshop, and I’ll give you a free year of access to my Consultant Toolkit, the very same app I use to gather data from my own clients.
  • Bonus #2: After the workshop, you’ll even be able to attend a 3-hour online webcast where I’ll show you how I use the Consultant Toolkit for index tuning.

I’m really excited, and I can’t wait to share my latest class with you in London.

Registration is open now, and my pre-cons usually sell out – get your seat quickly. See you there!

Microsoft’s Guidance on How to Set MAXDOP Has Changed

For years, SQL Server’s Maximum Degree of Parallelism (MAXDOP) – the number of cores that a parallel query could use – defaulted to 0, meaning unlimited.

This led to rampant CXPACKET waits because queries went wild with parallelism. Overly wild, like me after six shots of tequila within an hour wild. (It’s not my fault: tequila seems to be my only hangover-free liquor, and when you know you’re not gonna get a hangover…well, buckle up.)

And for years, Microsoft’s knowledge base article 2806535 – the only KB article number I know by heart – basically said set MAXDOP to the number of cores in a processor, up to 8, but no higher than 8. That led to a lot of really awkward discussions around logical vs physical cores (because the KB didn’t say) and how to handle VMs, especially VMs that could move across hosts.

This year, though, Microsoft updated that post to make it a lot more detailed, including different advice for SQL 2008-2014 versus 2016 & newer. Here’s the decision grid for 2016:

This is kind of good news and bad news: the good news is that Microsoft is giving you more details, but the bad news is that your life isn’t getting easier. You still have to deal with the many ways to set & override MAXDOP, figuring out how many NUMA nodes your server has, and monitoring for changes when the sysadmins shut down the server, reconfigure the VMware hosts, and boot you up with a different sockets/cores mix.

Down the road, I look forward to the day when database administrators don’t have to care about this kind of thing because SQL Server sets it by default on startup and adapts it based on the server’s workload. Right now, it’s pretty important, as evidenced by this rather odd line in the KB article:

Each (execution plan) step will use one CPU or the number of CPUs that is specified by MAXDOP and never anything in between.

Are You Underpaid? Let’s Find Out: The Data Professional Salary Survey is Open.

It’s time for our annual salary survey to find out what data professionals make. You fill out the data, we open source the whole thing, and you can analyze the data to spot trends and do a better job of negotiating your own salary:

I make my query bucks the hard way

Take the Data Professional Salary Survey now. The survey is now closed.

The anonymous survey closes Sunday, January 5, 2020. The results will be completely open source, and shared with the community for your analysis.

Want to write up your analysis and get new readers for your blog? Check out the format of the results as they come in, and click File, Download to get them in Excel format. Get your data flows ready, and then check back here on Tuesday, January 7 for the final results. We’ll gather the list of blog posts as they come out, and then on the following Tuesday, January 15th, we’ll publish a followup post with links to yours. (It helps you get more readers.)

Thanks for your help in giving everybody in the community a better chance to talk honestly with their managers about salary.

Upsizing the SQL ConstantCare® Database Servers

If you’re looking for SQL Server news, just skip this post. I’m going to talk about general business & capacity planning type stuff here today.

Our SQL ConstantCare® customers install an app that polls diagnostic data from their SQL Servers and Azure SQL Databases, upload it to us daily, and then automatically get advice on what they should do next.

SQL ConstantCareWe use Amazon Aurora PostgreSQL as the back end, and I wrote about that design decision last year. I followed up later with a discussion of our hosting costs. Throughout 2018 and most of 2019, the production servers were:

  • Two r4.xlarges in a cluster (for high availability)
  • 2 physical cores, 4 with hyperthreading on
  • 30GB RAM

That served us really well, except for the times when Richie needed to vacuum out deleted rows. PostgreSQL keeps deleted rows inside the user database (just like SQL Server 2019’s new Accelerated Database Recovery), and the automatic vacuuming processes can’t always keep up, especially with app designs like ours. We only keep a rolling 30 days of data, and we roll through billions of rows per month in some of the tables, so there’s a lot of deleting going on.

When Richie needs to do big vacuums manually, he spins up the biggest RDS instance available: the r5.24xlarge with 96 cores, 768GB RAM, and 14,000 Mbps storage throughput. (One of the slick advantages of Aurora’s architecture is that since the compute and storage are completely separate, you can amp up to insane performance quickly without waiting for storage to catch up.) At only $14 per hour, that’s worth it to make maintenance tasks fly temporarily.

However, our workloads are growing.

We’ve been adding new features that gather more data and run more queries against it, like our index recommendations and multi-server emails. I found it incredibly hilarious that our most resource-intensive Postgres query by far was the query that analyzed your indexes and told you what to add. And you know what that query needed? Indexes. Of course.

By October, our reserved instances had expired. With Amazon, you can reserve your database server capacity for 1-3 years, even prepay in advance if you want, and get deeper discounts. In the hosting cost discussion post, I talked about how I’d decided to reserve a pair of db.r4.xlarge instances for production for 1 year – but that reservation was up. I didn’t want to make a decision back then, though, because…

We’ve been adding users and servers: the population had been fairly stable through 2018/2019 because I didn’t promote SQL ConstantCare® too hard, and like any SaaS product, we had a churn of subscribers. New people would sign up, and older users would let their subscription lapse once they’d taken care of the basic issues on their servers. However, with the annual Black Friday sale, I knew a big population jump would come – I just didn’t know if it would be big or fast.

So we decided to wait out through November to see how the sales & installations went. Now, in early December, we’re approaching 500 users sending in data on almost 4,000 servers. I expect those numbers to grow over the coming year as more Black Friday sale buyers start to leverage their newly purchased goodies. Over 1,000 folks have active subscriptions that include SQL ConstantCare as a benefit, and we’ve seen that folks tend to monitor just a server or two at first, then gradually add more of their servers over time as they see the value in the service. I wouldn’t be surprised if we were analyzing over 10,000 servers by this time next year.

So what should we use next?

Richie and I talked through our options:

Switch to Azure SQL DB or Azure PostgreSQL: I’m only mentioning this because I know you’re gonna suggest it in the comments, but no, this didn’t make any financial sense. Changing back ends or cloud providers would have been an expense with no appreciable benefit to end users.

Switch to Aurora Serverless: it’s still PostgreSQL-compatible, so no code changes required, but the Serverless version can automatically scale up and down very quickly. This would be really cool for a business like ours where the database server is idle for hours at a time, then gets hammered when lots of users send in data on a timer:

Database server CPU

Database server CPU%

That graph is from our new upsized primary – it used to be banging up against 100% for hours at a time during peak loads – but you get the idea. Serverless seems like a good idea here, but the Aurora Serverless limitations had just enough gotchas that we couldn’t quite migrate production to it yet. (Once a couple of those limitations are gone, though, we might be able to cost-effectively deploy SQL ConstantCare® in the EU. Fingers crossed.)

Stay with on-demand (not reserved) instances: this didn’t make sense because the savings are so large on reserved instances. You can easily save 35%-45% if you’re willing to lock in rates for a while:

Buy reservations for larger instances: instead of the r4.xlarge (4 cores, 30GB RAM), we could commit to larger instance sizes:

  • r5.xlarge: 4 cores, 32GB RAM = $5,540/year for 2 reserved instances (current state)
  • r5.2xlarge: 8 cores, 64GB RAM = $11,082/year
  • r5.4xlarge: 16 cores, 128GB RAM = $22,162/year
  • r5.12xlarge: 48 cores, 384GB RAM = $66,486/year

Since there’s nothing in between the 4xlarge and 12xlarge, that made my decision process pretty easy: we went for a pair of 4xlarge reservations for the next year. It lets Richie focus on building features rather than troubleshooting Postgres performance problems that neither of us are particularly good at, and it also let me add more queries into my Power BI dashboard that I use for analyzing client data quickly. That even drove improvements to the way I do my consulting process, and I’ll be updating that product early next year.

The database is still by far, far and away the most expensive thing in our entire infrastructure. It’s not even close: serverless apps are just so, so affordable to host compared to fixed assets like database servers. I love seeing the same problem that my clients see: the decision about when it makes more sense to hire a performance tuner versus just throw hardware at it. SQL ConstantCare®’s asynchronous queue-based serverless architecture definitely helps keep database costs low, but there’s going to come a time in the next couple/few years when we’ll have to bring in a serious Postgres performance tuner to help push our bill back down. (And no, I’m not gonna be the person who does it!)

What Should We Change About This Year’s Data Professional Salary Survey?

Every year, we run a Data Professional Salary Survey here. Thousands of y’all have answered – here are the past results in Excel – and it helps everybody get a better idea of how overpaid they are. (Did I word that right?)

Here are the questions I’m planning to ask this year:

  • What’s your total salary in US dollars, annual before taxes?
  • Your country
  • (Optional) Postal/zip code
  • Primary database you work with
  • Years that you’ve worked with this database
  • Other databases you worked with in 2019
  • Job type: (FTE, FTE of consulting/contracting company, independent, part time)
  • Job title
  • (Optional) Other job duties
  • Do you manage other staff
  • Years of doing this job
  • At how many companies have you held this job?
  • How many other people on your team do the same job as you?
  • How many database servers does your team work with?
  • What’s the newest version of SQL Server you have in production?
  • What’s the oldest version of SQL Server you have in production?
  • Highest level of higher eduction completed
  • If you have a college degree, is it computer-related?
  • Do you hold industry certifications?
  • How many days per week do you work from home?
  • What is the population of the largest city within 20 miles of where you work?
  • Employer sector (private, government, non-profit, etc.)
  • What are your career plans for the year 2020?
  • To which gender do you most identify?

If there’s any changes you want to make, leave a comment and we’ll talk through it.

Updated First Responder Kit and Consultant Toolkit for December 2019

It’s the hap-happiest season of all. To get the new version:

Consultant Toolkit Changes

When you have to troubleshoot deadlocks, wouldn’t it be nice to get the deadlock graphs all saved off nicely in separate XDL files that you can open with Plan Explorer? Good news! For SQL Server 2012 & newer, The Excel file now has a new Deadlock Summary and Deadlock Detail tabs that show the output of sp_BlitzLock, plus the deadlock graphs are written to separate xdl files. (This does mean there is duplication in the output: there’s a copy of the deadlock graph saved for every query in the deadlock. I can live with that for a first version, though.) Here’s what the output folder looks like:

Consultant Toolkit output folder

Other changes this month:

  • Improvement: on the HA-DR tab, the list of AG replicas is now sorted by AG name, primary first, then the sync replicas first, then by replica name.
  • Plus includes the below updates from the First Responder Kit.

sp_BlitzCache Changes

  • Improvement: @BringThePain = 1 no longer forcibly sets @Top to the max. Instead, it’s used as a confirmation switch as in the other First Responder Kit procs. For example, if you set @SortOrder = ‘all’ and @Top > 10, and you didn’t set @SkipAnalysis = 1, then we require you to set @BringThePain = 1 to make sure you understand that processing will be slow. (#2172)

sp_BlitzFirst Changes

sp_BlitzIndex Changes

  • Improvement: when you run it for table-level details, like sp_BlitzIndex @TableName = ‘Users’, the first result set with the list of existing indexes has a new last column with the drop T-SQL. Useful for rapidly building undo/redo scripts. (#2177)
  • Fix: when saving the output to table, index definitions longer than 4,000 characters were truncated. Changed output table definition to NVARCHAR(MAX) for the Drop_tsql, Create_tsql, and index_definition columns, and cried just a little. (#2183, thanks Helio Lino de Almeida Filho for the bug report.)

SqlServerVersions Changes

  • Improvement: added SQL Server 2019, updated a few versions. (#2181, thanks sm8680 for the bug report and Emanuele Meazzo for the fix.)

sp_DatabaseRestore Changes Awaiting Review

There are two pull requests with improvements, but I’ll be honest: I’m not really qualified to do code review on these because I don’t personally use sp_DatabaseRestore. (I just don’t have a use for it myself since most of my work focuses on performance tuning.) If you can give these authors a hand by testing it and making notes in the Github issues, we can get these merged into the main branch:

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

SQL ConstantCare® Population Report: Fall 2019

Our SQL ConstantCare® service lets users send data in about their SQL Servers and get health & performance advice via email. Users can opt in to share their anonymized data to get better recommendations overall.

With 3,568 database servers sharing data as of November 28, there’s a lot of useful information in here for the community. Understanding what servers are out there will help you put context around where your own servers rank, and helps the community better build training material that’s relevant for most users. Let’s start with a simple one:

What versions are the most popular?


Or if you’re into percentages:

A few things jump out at me right away:

  • SQL Server 2008 has more adoption than Azure SQL DB and Azure Managed Instances combined.
  • SQL Server 2019 has only 1 production instance sharing data in the population.
  • Almost 50% of the population are already on SQL Server 2016 & 2017. (That’s better than I expected!)
  • About 1 in 10 SQL Servers are no longer under support (and don’t even get me started on patching levels.)

Now, I’ll be the first to tell you that SQL ConstantCare® users simply aren’t an accurate representation of the population overall. Quest Spotlight Cloud’s metrics as of 2019/11/28 look a little different: they have 4% of the population on SQL Server 2008, but only 1% of the population on SQL Server 2017. My guess is that that page isn’t kept up to date.

I’m looking forward to trending this data over time to see the rate at which people adopt newer versions of SQL Server. I’m going to publish these quarterly. In the next couple/few, I’ll stick with this same presentation format, but once we’ve got some timeline-friendly data, I’ll look at more interesting ways to present it.

I’m not sure if people who use SQL ConstantCare® are more or less likely to jump to newer versions faster. I’d like to think that you, dear reader, are more likely to be on the cutting edge. However, I might have a couple of different kinds of readers: one conservative group that likes mainstream versions and monitoring tools, and another group that likes cutting edge stuff and doesn’t want SQL ConstantCare®.

But what about development? Are people maybe testing their apps on newer versions?

What versions are people using in development?

It’s interesting how different this mix is than the last chart! 50% of all Development Edition servers are running SQL Server 2016, and 11% are on 2017.

Twice as many people are using SQL Server 2019 in development as there are in production. (I’ll leave the math to you, dear reader.)

For the rest of the census report, I’m going to focus on just the production servers.

How much data do servers have?

I think this chart really helps to set the rest of them in perspective:

A lot of the SQL ConstantCare® population have what I would consider to be a fairly small server: 30% of them are hosting <25 GB data.

About 15% have 1TB of data or more, but even though that number sounds low, the population is still large. That still gives us about 500 servers’ worth of metadata to help guide folks who are in those larger tiers. There are a dozen instances in the population with over 10TB – good times there.

How much hardware do production servers get?

This really helps to reset folks’ expectations around what’s normal for SQL Servers out there. There are a huge, huge amount of production SQL Servers that just don’t need a lot of CPU power. Almost 70% have 1-8 cores, while only 10% of servers have more than 24 CPU cores.

Given that almost 50% of the population have 1-4 cores, that isn’t too surprising. But what if we exclude the folks using 1-4 cores?

Okay, now that’s much more like it: 68% of that population is using 64GB of RAM or more. Good to see folks investing in the resource that has an amazing impact on performance, especially given the SQL Server licensing costs involved.

What questions would you ask the data?

If there’s something you’d like to see analyzed, leave a note in the comments. To make it more likely that I’ll actually run your experiment, include why you want to know it, and even better, your guess as to what the data’s going to say. (That’ll make it a more fun game to play, heh.) I’ll take my favorites and do a followup post with those answers.