Updated First Responder Kit and Consultant Toolkit for September 2019

This month’s release has a lot of new feature goodness: the Consultant Toolkit can automatically upload the results to Amazon S3, plus sp_BlitzCache adds a new sort order to better catch unparameterized queries burning up CPU, a new warning for selects that are doing writes (other than >500MB spills, which we’ve been warning you about for years), and support for SQL Server 2019’s air_quote_actual plans.

To get the new version:

Consultant Toolkit Changes

Updated the First Responder Kit with this month’s script updates, plus:

  • Improvement: the –upload command line switch now uploads the data directly to a bucket in your Amazon S3 account. This helps you set up a scheduled task at the client’s office to send you the data weekly or daily – enabling you to keep tabs on their SQL Server’s health without having to constantly VPN in. See the S3 uploading documentation for more details.
  • Improvement: added Plans by Query Hash tab using sp_BlitzCache’s new @SortOrder = ‘query hash’ option. This is basically the same as @SortOrder = ‘cpu’, but it filters for queries with multiple plans in the cache with the same query hash.
  • Fix: on the Databases tab, on the Total line, the “# of Data Files” and “LOB Reserved GB” columns were swapped. This only affected the Total line though, not the individual databases.
  • Fix: on the Databases tab, data sizes are now populated for databases with NVARCHAR names.
  • Fix: formatting improvements.

sp_Blitz Changes

  • Improvement: added Uninstall.sql for those of you who wanna nuke it from orbit. (#2080, thanks ShawnPOWER for the request and Emanuele Meazzo for the code.)

sp_BlitzCache Changes

sp_BlitzFirst Changes

  • Fix: @CheckProcedureCache was being skipped for some time zones, especially British Standard Time users. I had to fix this before stepping foot on the Isle of Man again, naturally. (#2096, thanks PlanetMatt for the delightfully easy-to-reproduce bug report.)

sp_BlitzIndex Changes

  • Improvement: faster compression analysis on partitioned databases. (#2065, thanks Erik Darling.)

sp_DatabaseRestore Changes

  • Fix: last month’s release broke on case-sensitive systems, fixed. (#2097, thanks sm8680 for the bug report.)

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

DBA Training Plan 23: Congratulations! Now, It’s Your Turn.

I bet you think you suck at databases.

You’re sitting there on your computer, reading all kinds of blogs, watching presentations online, and you’re thinking, “Everybody knows more about databases than I do.”

We all feel that way. There’s even a name for it: Impostor Syndrome.

I have it too. For example, about a decade ago, when I headed to Redmond to attend Microsoft’s $20,000 Certified Master program, I kept thinking, “Everybody else there has gotta be different than me. I’m not really supposed to be there, for sure, but at least I’ll be able to spend a few weeks rubbing elbows with people who really are good with databases.” The instructors told us that the program was designed so that when Microsoft had a really ugly customer support situation, they could send out a Microsoft Certified Master, and know for sure that the problem was gonna get handled, no matter how severe the problem.

When I passed all 3 exams and the lab, my opinion of myself still didn’t change. I still thought I sucked at databases.

Today, in the year 2019, I still think the same way: I don’t know enough. It’s just that now, I can say that I’ve made a lot more mistakes, and I just like to help other people avoid the same mistakes that I’ve made.

You’re just like me.
Except you need to start sharing.

If you’ve been reading my blog posts, even just the 20-some posts in my DBA Training Plan, you already know a lot more than most database administrators. Seriously.

What, you think other people with your job aren’t Googling their way through their daily work? Go read the questions at DBA.StackExchange.com or StackOverflow.com. We’re all winging this together. We’re all facing an overwhelming number of databases, applications, error messages, and email alerts. We’re all trying to find help in any way that we can get it.

You can offer some of that help.

There are so many ways you can give back to the community: blog, present, answer questions, record YouTube videos, chat on SQLhelp, or contribute improvements to open source tools.

Giving back pays you back, too.

Oh sure, there’s a fantastic feeling when you see the light go on in someone’s eyes because they suddenly grasp a new concept, or when you watch the web site visitor numbers go up. (Slowly. Very slowly.) But there’s much more than that, and they’re selfish benefits, but I’m going to tout them here in an effort to win you over into the Contributor Camp.

You get recognized when in the speaker room. When you introduce yourself at events, people will recognize your name and thank you for your contributions – because the speakers needed your help, too.

People will start seeking you out. They’ll say hi to you at events, follow you on Twitter and Instagram, and get to know you as a person. You’ll build relationships that will last the rest of your life.

Companies will start seeking you out, too. I can’t emphasize enough how important this is for your career. Go read Rock Stars, Normal People, and You to learn how it turned my own job hunting around. If you’ve already read that, go re-read it again. No, I’m not fishing for web site hits – I’m doing just fine, thank you, hahaha.

People will leave comments. Well, okay, this one’s kind of a mixed bag, because sometimes they thank you for all your hard work to make their life easier, but sometimes…sometimes they won’t.

Yes, people will criticize your work.
It happens to all of us.

Sometimes, they’re going to point out an error in your work. You’re going to feel like an idiot, but…you already feel like an idiot, right? The only difference is, it’s going to happen in public. But “public” is just your blog or a Stack Overflow answer or a Github issue. It’s not like it’s happening on national TV. We’re a tiny industry, and nobody’s gonna remember that one time you made a mistake – especially if you handle it by testing your work afterwards to double-check, thanking the commenter, and integrating the correction into your work.

Other times, they’re mistaken, and this is kind of a mixed bag too: sure, you can be proven correct, but you still have to go the extra mile of working with the commenter to explain yourself in more detail.

Other times, there’s no right or wrong answer – the work is a matter of opinion – and that can be kinda grueling, too. But even then, be excited and thankful because your work prompted them to get up off their rears and leave a comment.

Let’s do this.

Don’t let the prospect of comments trigger your impostor syndrome: you can do this. It’s time to decide how you’re going to give back and make a plan to pull it off. To help, here’s my PASS Summit 2016 half-day session, 500-Level Guide to Career Internals:

DBA Training Plan 22: Where to Learn More

Whew. We’ve worked through a couple dozen posts together in this DBA Training Plan, and you’re still just getting warmed up.

First, pick your specialty. Here’s some of the more popular specializations in the SQL Server business:

  • Production administration – managing uptime, building new boxes, troubleshooting clusters, setting up storage, designing virtualization infrastructures
  • Performance tuning – taking queries from others and making them faster via rewriting the queries, tuning indexes, and tweaking SQL Server settings
  • Business intelligence – working with end users to get what they need from the mysterious database using SQL knowledge plus visualization tools
  • SQL development – building long, complex stored procedures and functions to transform and return data

There’s no wrong answers here because all of these will have great long-term markets. You need to pick the one that calls the most to you because you’ll need to spend your spare time learning it. It’s rare to find a company willing to invest in your future skills, so your career needs to be in a specialization that you’re genuinely excited to study. After all, it’s hard to suck up and study after hours when you don’t really love the topic.

Next, pick your favorite learning method:

Reading – books, articles, blog posts. Just be aware that the shorter the delivery method, the more you’re probably going to have to consume to get the desired effect. My favorite free reading resource that walks the line between free/short and expensive/long is SQL Server Central’s Stairways. They cover a single topic well from start to finish, and they’re arranged in single-serving chunks. It’s about as deep as you can get without paying for a book, but when you’re ready for books, check out the free book library from Red Gate.

If you want an absolute firehose of blog posts, here is a Feedly collection of the blogs I subscribe to, and an OPML file that you can import into your feed reader.

Watching videos – aside from our YouTube channel, my next favorite free video resource is SQLbits. It’s a UK conference that records videos at all of their sessions and makes them freely available to the public online. You can also search by speakers.

In-person training classes – start by finding your local chapter of PASS, the Professional Association for SQL Server. These local user groups meet monthly, and volunteer speakers talk about a topic for about an hour. Then search for upcoming SQLSaturdays – they’re free one-day conferences with multiple tracks on SQL Server. You may have to travel to get to one, but it’s still cheaper than any other one-day training class options.

When you’ve exhausted those free resources or you’re ready to dive deeper into a topic, we’re here. Sometimes you need to dive deeper into a topic than you can get in blog posts or community videos, and that’s where our paid options come in. For example, our Senior DBA class covers advanced real-world experience on high availability and disaster recovery topics, and our SQL Server Performance Tuning class shows you how to save valuable time and make the server go faster immediately. Learn more about our training now.

DBA Training Plan 21: Building a New SQL Server

I have a secret setup checklist.

Oh sure, our free First Responder Kit has a SQL Server setup checklist that walks you through a few things you need to do before you install, plus a few things to do immediately afterwards. That’s good. It’s actually pretty darned good.

But it’s not the secret checklist.

Gather round and cover your monitor for a minute because I’m about to share the good stuff.

1. Hardware and Windows prep. Apply the latest bios and firmware to the hardware. Install Windows, then apply all patches. Install and configure your server vendor’s hardware monitoring, and the monitoring tools your sysadmins use. For clusters, run the Validation Wizard, make sure all tests pass, and save a copy of the validation report.

2. Smoke test it. Run CrystalDiskMark against all of the storage (local and SAN) and make sure it meets your expectations. (See my Building the Fastest Servers session for more details.) If it’s a physical server, test it by removing and/or disabling the network, storage, and power cables individually, and make sure that all of the server’s connections are fully redundant. If it’s a virtual server, do those same things to the host. Yes, your VM admins are going to be concerned – but honestly, they haven’t tested any of this stuff in the last year either. They’re just assuming it’s going to work, and you know what they say about assuming: you make an ass out of you and Uma Thurman.

3. Install SQL Server and the latest CUs. Here’s where our setup checklist comes in handy, but you’re not done. Restore your existing databases and time how long that takes. It’s your one chance to know how long a production database restore will take on the new hardware, and that helps you find out if you can meet your RPO/RTO goals. Configure your maintenance jobs like with Ola Hallengren’s scripts, and time how long they take. Try performance tuning the backups by striping them across multiple files – often splitting the backups across 4 files will make them complete 4x faster. Try performance tuning the index rebuilds by doing sorts in TempDB and see if that helps.

4. Smoke test SQL Server. Set up your high availability and disaster recovery features like log shipping or AlwaysOn Availability Groups, and fail the databases over between production and DR. Test a planned failover and document all of the steps involved along with how much time they take. Include copious screen shots. Your goal is to make a planned failover easy enough that any of your company’s admins can do it without knowing SQL Server. (Hey, the more you document, the more time you can take off for vacations.) Then try the same with an unplanned failover – yank the power cables out of the production servers (or hosts) and do a failover. This is usually a more manual process with more steps.

5. White-board out the failure possibilities. Diagram out all of the moving parts in the architecture – shared drives, cluster IPs, service accounts – and think about what would happen if each part fails. What would failure symptoms look like? What troubleshooting steps would help you discover that this was the real root cause? If you’re not sure how you’d recognize a failure of that component, now is your chance to force the failure and plan out your troubleshooting fails.

Only then do you start to go live on the new infrastructure.

This sounds like a lot of work, and it is. I’m not saying this as a consultant who wants to bill you a bunch of money – this isn’t usually the kind of work that I tackle myself. Instead, I work with you to build a list like this, and then you go off and execute the plan. You’re already comfortable with each of these tasks in theory – you just need to do them hands-on. Don’t just assume that the hardware will react a certain way when there’s a failure – go test it, find out for sure, and when disaster strikes, you’ll react with calm confidence.

DBA Training Plan 20: Planning Your Next Server

When the business decides it’s time to give up fixing the old server because it’s not fast enough or reliable enough, stop. The worst thing you can do is jump directly to the hardware vendor’s site, check a few boxes, and order a new shiny server that ends up with exactly the same problems as the last server.

1. Get the business’s RPO and RTO goals in writing. These two numbers determine whether you build a cluster, an AlwaysOn Availability Group, a log shipping backup, or a pair of tin cans connected with string. Learn what RPO and RTO are here.

2. Find the right SQL Server feature for your RPO/RTO goals. For example, if you’re not allowed to lose any data, and you need to fail over in less than 1 minute, then your options include failover clustered instances, AlwaysOn AGs with synchronous mirroring, or synchronous database mirroring. Use Page 2 of the HA/DR worksheet linked in the last post to figure that part out, or watch the HA/DR architecture modules of the Senior DBA Class.

3. Review the current server’s wait stats. To help figure out what kind of hardware you need for tomorrow, look at the pains you’re experiencing today. For example, if you’re suffering from PAGEIOLATCH% waits because you can’t cache enough data in RAM, the next server may need significantly more memory (or faster storage, or time spent tuning indexes & queries.) If you’re suffering from LCK% waits because the apps are holding locks on their side, though, you may not see relief no matter what kind of hardware you throw at it.

4. Benchmark your current hardware’s speed. I’ve written about how to check performance on a new SQL Server, but this is also really useful when you’re transitioning away from an old server. For example, take your backup & CHECKDB job runtimes, and the new server should be able to perform at least that quickly (if not more so, depending on what your bottleneck was in the above step, and which kind of bottleneck you’re trying to fix with the new server.)

5. Get the server’s proposed expiration date.  Servers are like milk cartons: both need an expiration date printed clearly on the outside. Ask the business users, “How long does this server need to last?” They usually say, “Forever,” and start laughing, but once the laughing stops, here’s how to handle the discussion.

6. Ask the developers, “What features are we adding during that time?” I don’t expect them to have specific line-by-line answers about which features they’re adding over the next 3 years, but I do expect them to have rough architecture ideas. Are we going to add some kind of big data analysis projects? Machine learning in the database? R, Python, or Java running inside it? Storing new kinds of data, like tracking every web site click?

7. Ask the business, “How much will load grow during that time?” Are we expecting to acquire our next several larger competitors, or maybe conquer sales in a new country? Or are we just expecting to tread water and only deal with incremental growth?

For all of the above questions, the more specific the answers are, the more precise your server sizing can be.

My new hardware

My new hardware

In reality, they’re very rarely specific.

And that’s fine!

But it just means that your server design needs to be more flexible: we need to design something that we expect will change rapidly over time as we learn more about the answers above. That’s one of the reasons why I really love building new environments as Availability Groups up in the cloud: it’s so much easier to add in new, more powerful replicas, join them to the cluster, and then get rid of the old ones. Your server design can leverage the flexibility of the cloud to handle new features or workloads, and you can throw cloud at performance issues.

How to Set Up Your Own Mastering Class VM

When you buy one of my Live Class Season Passes, you can save a lot of money by building your own VM to follow along with the labs. Or, you might just wanna re-run the labs later to see if you can do a better job, or if you’re still mastering a concept.

You’re going to need:

  • A server
  • The Stack Overflow database
  • The indexes & scripts set up
  • Then ideally, back the database up, and set up an Agent job to rapidly restore the database between labs
  • SQLQueryStress

Let’s talk through each piece.

Building a SQL Server

To get a rough idea of how much hardware to use, let’s look at the AWS EC2 VMs I give the class students:

  • Mastering Index Tuning & Mastering Query Tuning use an i3.xlarge: 4 cores, 30GB RAM, local NVMe SSD storage
  • Mastering Server Tuning uses an i3.2xlarge: 8 cores, 61GB RAM, local NVMe storage – we use more cores & RAM here because we run heavier stress tests, since we’re doing server-level discussions

Each server needs at least 500GB local SSD space to deal with the 350GB Stack Overflow database, index creation space, TempDB, backups, and restores. It needs to be fast storage, too: for a rough idea, you’ll need to be able to restore the ~350GB backup in 15 minutes or less. (I’m just using restore speed as one example here – you’ll be doing lots of storage-intensive tasks, like creating indexes on large tables.)

If you want to check your storage speed before downloading the Stack database, run a test with CrystalDiskMark. You want at least 1,000 MB/sec for sequential reads and writes. As an example, my laptop’s SSD speeds are shown at right, and it would be fine for the labs.

Can you get by with less hardware? Sure, but of course your performance will be different than what we’re seeing in class. I actually don’t think that’s a bad thing – every server out there is different – but just be aware that it’ll pose additional challenges for you if you try something like 4 cores, 8GB RAM. Your index creations will be terribly slow, and you probably won’t be able to keep up in class.

Can you use more hardware? Yep, and as long as you’re not caching the entire database in RAM, you’ll probably still have the same basic challenges that we tackle in all of the classes.

After building the server, install:

Getting the Stack Overflow database

Download the 2018-06 version of the Stack Overflow database: 38GB torrent (magnet.) If you’re not familiar with BitTorrent, here are detailed instructions, but just make sure to get the Stack version mentioned earlier in this paragraph – it’s important so you reproduce demos the same way. Your company may block BitTorrent – many do – but it’s the most efficient way to get a >300GB database out there. There aren’t a lot of file hosts willing to share a file that big, heh. You may need to download it from home.

When you extract it, it’ll expand to a ~304GB SQL Server database that you can attach. There are other smaller versions of the database too, but be aware that you’ll get radically different query plans, and some of the demos won’t work the same way since we’re querying for specific date ranges.

It ships in SQL 2008 compatible format, but you usually wanna learn on the most recent Cardinality Estimator (CE). To set that, in SSMS, right-click on the database, go into Options, and change Compatibility Level to the most current version you’re on – but not SQL Server 2019, which behaves quite differently – hold off on that compat level until SQL Server 2019 is released. I’m updating the classes with more modules for SQL Server 2019’s behavior, but if you try that now, you’re going to get wildly unpredictable behavior compared to the rest of your classmates.

Setting Up the Stored Procs and Indexes

We use the First Responder Kit scripts for performance analysis, and we use the below procs to help the demos along. Create these in the Stack Overflow database:

Don’t install that in production, bucko.

Next, run the index creation scripts below. It’s going to take a while, depending on how fast your lab’s storage is. After creating these indexes, we’ll take a backup of the database so that you can restore it each time to set back to a known starting point – rather than creating the indexes every time you restart labs, which can take tens of minutes.

Back It Up and Set Up a Restore Job

During class, between each lab, the students restore their database to this starting point. For sheer performance, you want to run the backup across 4 files – SQL Server backups & restores can actually go faster when you stripe them across more files, even when writing to the same storage.

Then, set up an Agent job to restore them. Here’s the script I use – just modify the file path locations for your backups. Note that mine also has a step to set Cost Threshold for Parallelism and MAXDOP back to normal defaults for the VM we’re working with in the cloud.

In my example below, I’m doing my restores from E:\MSSQL\BACKUP – you may need to modify your backup path and file names.

Test your restore job to get an idea of how long it’ll take – if it takes more than 15 minutes, you probably want to use a faster VM to be able to keep up in class. If the restore takes, say, 20-30 minutes, you’re going to have a tough time keeping up when you’re doing storage-intensive things like creating indexes.

SQLQueryStress for load testing

SQLQueryStress is an easy, free, open source load testing utility. Download the latest version (zip), and extract it to a folder. You don’t have to run a setup or anything – it’s just a single .exe file that you can run whenever you want to do a load test.

Then, download my load test files and extract those to the same folder. Your folder will end up looking like this:

SQLQueryStress folder

And that’s it! See you in class, and hope you have as much fun as Vadim had.

#TSQL2sday: I Just Wish SQL Server Could Restore a Single Object from Backup.


For this month’s T-SQL Tuesday, Kevin Chant asked us to dream up our fantasy SQL Server feature. I’m sure people are gonna ask for flying cars and gene splicing, and that’s awesome. I’m excited to hear the kinds of things they dream up.

Me? I’ve got a really simple ask.

(When you’re talking to Microsoft people, you have to use “ask” as a noun.)

Today, if you have problems with a single file in your multi-file database, you can do an online file restore like this:

That one file is restored from backup, and you can update it by applying subsequent transaction logs, and then eventually bring it online. You can restore individual (or multiple) pages, too. But the thing is – I rarely need to restore a file, or specific pages.

So my ask is:

I just need a single object, like a table someone dropped, or a stored procedure from a mucked-up deployment that somebody didn’t really save in source control the way they were supposed to. I know, there are gonna be foreign key issues, and I’m fine with the keys being marked as untrusted at that point.

Like it too? Vote for Gonzalo Bissio’s feature request.

What’s It Like to Have a Live Class Season Pass?

I don’t usually post product reviews here on the blog, but Vadim Kulikov (aka BlondeDBA in Slack) recently finished all of the Mastering classes, and he took the time to write a really detailed review. Here it is, unedited:


I heard some folks refer to Brent’s training as “Brent Ozar’s Full Experience”. I really like this description, as it accurately describes the training. For some of us who have been in the DBA industry for some time, training is a mandatory part of the job if you want your skills to remain relevant. However, I noticed that most other trainings that I have taken have one common issue, they help to introduce the subject but you don’t retain much due to the lack of hands-on labs and /or long lecturing often reciting from books-online that provides little practical value. In my opinion, Microsoft is not always the best source of information about SQL Server.

One main thing that sets Brent’s training apart from other is the tremendous practical value. The training is very hands-on. The tools that his team developed are the actual tools Brent uses himself in his consulting business. The class introduces the tools and teaches how to properly use them. You learn how to identify bottleneck(s) and how to develop an approach to come up with the most effective solution for the real-world performance tuning scenarios. I purchased the annual pass and went through all the classes. Below is the list of my favorite things about Brent Ozar’s online training:

1. The class provides hands-on material and demonstrates a comprehensive approach to troubleshooting performance problems and identifying root cause. You will learn how to efficiently (with set time limit) improve the situation or resolve discovered bottlenecks completely.

2. In addition to being technically skilled, Brent is a fantastic instructor. Either its speaking with “Clippy” voice or “taking on a personality” of the SQL Server Engine Optimizer, sharing war stories from his consulting gigs and adding constant humor (at times self-deprecating ?) keeps your attention and remarkably doesn’t get old. I have gone through close to a hundred of hours of Brent’s online training (both free and paid) and often run into the same jokes over and over again and still find them hilarious, which adds levity (Brent does get into complex performance tuning scenarios that can be intimidating) and also helps to remember the material. Brent is very creative with his presentation, often uses analogy with car performance or hospital triage situations which I found to be helpful to understand the subject and retain it for a long time.

3. The classes are well structured and tend to build on previously presented material. So, its recommended to take the fundamental classes first to feel more comfortable with more advanced topics. I would also recommend to go through all the specified pre-requisite training to keep up with the class. While lectures have perfect tempo (not too fast not too slow) but the sheer amount of information presented might be overwhelming for the 1st time. I take notes during the lecture and save the Slack chat as Brent addresses other participant ‘s questions in real time.

4. The ability to watch the recorded class before and after (with purchased annual pass) is by far the best option which sets this training apart from every other I have seen. I replaced my “Audible” with “Brent Ozar” subscription, so I can play the courses on my way to work.

5. The Labs are well-thought-out and could be quite challenging. They reinforce the presented material and also demonstrate the effective use of the First Responder Kit tools. You get to use beefed up VMs with large amount of cores and RAM to bring the lab environment very close to the actual production systems you encounter in the field. I love the option to download the different sizes of StackOverflow database to fit your local environment constraints, so you can practice with the labs at your own pace later on your own server.

Every little aspect of this training appears to be carefully considered and planned. Brent started to record with a green screen behind him, so you don’t have to manage multiple screens when viewing the recorded class later. Little detail but makes a big difference and contributes to a pleasant experience.

6. I purchased the annual pass, so could take all the classes I want multiple times. I love that the class starts so early (6am PST), as it give me an opportunity to do at least the first part of my training undisturbed. Most of us have full time DBA jobs and training is often considered a luxury, that you are supposed to arrange on your own time, so “it doesn’t interfere with your DBA responsibilities” ? Re-taking classes helps to capture the parts you missed the 1st time due to the pressures of your job.

7. Another valuable aspect of Brent Ozar training is that you get to ask questions, at times outside the context of the actual class. The feedback you get – provided you can concisely articulate your issue – is very helpful especially if you can’t afford to hire Brent as a consultant.

8. There are many Senior level DBAs who enter the training and contribute to the class by asking interesting questions or share their own experience deploying a particular solution. You can benefit from someone else’s success or failure. It also helps to see how your skills measure up to other DBAs.

9. Brent is constantly improving his content factoring-in students’ feedback. So, you benefit by taking the same class multiple times. I would, however, wish that the classes’ schedule could be a little bit more aligned with the progression of difficulty of a particular class, so you start with basic like “Fundamentals of Index Tuning ”, then progress to “Mastering Index Tuning” for example, and then followed by the most challenging “Mastering Server Tuning”

I highly recommend the training. You will be amazed how much you didn’t know. It’s kind of embarrassing …?


Wow! Thanks, Vadim – I’m honored that you took the time to write this out. Wanna find out what it’s like for yourself? Check out the Live Class Season Pass, and you can also save a lot of money by building your own lab server. Enjoy!

DBA Training Plan 19: The L-Word, Licensing

You’ve gotten through a lot of this training. You’ve tried fixing the problem with indexes, query tuning, and judicious application of various configuration switches. You’ve been watching webcasts and using free scripts, but…

Things just aren’t getting better.

Before you start browsing your cloud vendor’s sizing price list or your favorite server vendor’s build-and-price site, we need to talk about the L-word: licensing. It’s a really complex subject, but we’re going to try to boil it down as simple as possible.

You pay by the CPU core. Even if you’re deploying an older version of SQL Server, you still have to buy and pay maintenance on the current version of SQL Server. That means you’re using core-based licensing whether you like it or not. In the good old days, we’d buy the biggest CPUs possible and clap our hands when it sat around at 0% CPU, but those days are over. These days we want as few cores as possible, but that go as fast as possible.

SQL Server Standard Edition is about $2k USD per core, and it’s limited to 24 cores and 128GB of memory. Even if you deploy just 8 cores (like 2 quad-core processors), that’s $16k of licensing. 64GB of memory is really cheap – you’d be crazy to deploy SQL Server on a physical box with anything less than 64GB of memory. Even on virtual machines, keep things in perspective – 128GB RAM is way, way cheaper than a core of licensing when you’re licensing VMs individually with Standard Edition.

SQL Server Enterprise Edition is about $7k USD per core, but much like Brent Ozar Unlimited, it has no limits. (Okay, actually, both us and Enterprise Edition have some limits.) Enterprise Edition adds a lot of the cool features you really want, like online index creation & rebuilds, readable secondaries in AlwaysOn Availability Groups, and Transparent Data Encryption. However, at $7k per core, you need to be really, really careful about how many cores you buy – most of the time, folks buy way too many cores and don’t spend enough on memory or storage throughput.

Virtualization – if you have less than 4-5 VMs, you’re usually better off buying Standard Edition at the guest level. Once you have more than 4-5 VMs, buy two (or more) virtualization hosts dedicated just to databases. License Enterprise Edition at the host level, and you can run as many guests as you want on those hosts. However, you need to buy Software Assurance (it’s like maintenance) so that you can use License Migration and move your VMs around from host to host. Useful in case one of your hosts fails.

Containers are just like VMs, which makes the whole container deployment thing a really awkward discussion. Microsoft is all, “Just use a bunch of containers!” but you either have to track their licensing individually with Standard Edition, or take every container host where your containers run, and license ’em all with Enterprise Edition. Come true-up and auditing time, you’ll have a lot of interesting questions around, “So, how many SQL Server containers have you been running, and on which hosts?” The easiest answer is to have a dedicated Kubernetes cluster where your SQL Server containers run, and license those hosts with the $7K/core SQL Server Enterprise Edition. When I say the “easiest answer,” I’m referring to your job and mine. The poor finance people, on the other hand, those folks are screwed when they have to write that check. Because of that, I have a tough time selling people on containers for production servers – unless they’re getting free site licenses from Microsoft in exchange for publicizing containers. Those folks seem to be all over it. Go figure.

SQL Server Developer Edition has the same capabilities as Enterprise Edition, but it’s not to be used for production. It’s free, so the rules for virtualization are a little different here. You don’t want to intermingle your Developer Edition VMs on hosts that you’re licensing with SQL Server Enterprise Edition because you’d just be wasting licensing fees. Put these Developer Edition VMs in with the rest of your servers.

Everything’s negotiable if you’re big. If you’re a service provider or a large enterprise, you may have different types of licensing agreements like SPLA or EA. Have their people take your people out to golf (I hear that’s how this stuff works), and people get drunk, and contracts get signed.

SQL Server Licensing GuideIf the licensing costs scare you, that’s where cloud alternatives to SQL Server come in. All the cloud vendors will rent you licensing by the hour, included with your VM pricing. It’s not to say that’s cheaper – over the long term, if you have predictable needs, it’s not – but it can let your accounting team classify the expense differently, and sometimes that’s a good thing.

For the gritty details, check out Microsoft’s SQL Server Licensing Guide PDF. It’s not short, but it’s definitive, and it’s pretty easy to read. If you’ve got any interpretation questions, you have to ask your Microsoft contacts, not folks in the community, and not salespeople. Nobody else can give you answers that will stand up when the auditor comes knocking on your door.

Comments are disabled on this post. That’s because in the past, I’ve learned that anytime I post about licensing, I get dozens of licensing questions in the comments. For licensing questions, contact your Microsoft sales representative. I’m not doing their work for free, ha ha ho ho.

DBA Training Plan 18: Managing Concurrency

Locking isn’t a problem – queries are gonna lock data. As long as nobody else wants the lock, you can take your sweet time locking it, and it won’t show up as a problem. (That’s one of the many reasons queries work so well in development, and then hit performance issues in production.)

Blocking, now that’s where the problem shows up. Blocking means that someone is holding a lock, and they’re blocking someone else from getting the data they need. Even worse is “deadlock” – that term strikes fear in the hearts of database administrators. Deadlock issues strike without warning, kill transactions dead, and slip away without a trace. Because they’re so transient, they’re hard to troubleshoot.

Good news, though – this stuff really isn’t hard to track down once you use a methodology. Here’s mine: first, identify if blocking is slowing things down.

  1. Run sp_BlitzFirst @SinceStartup = 1 to see if LCK% waits are a significant problem for you, or if your server is bored. Note the number of lock waits – if you’ve only had a few dozen over the course of a day, then it might have just been a few queries held up when someone was trying to create or rebuild an index. If you see hundreds (or heaven forbid, millions or billions) of these waits per day, especially with long (seconds or more) average wait times, then we’ve probably got a problem.
  2. Run sp_WhoIsActive and see if LCK% waits are showing up for queries running right now. This will help you identify which queries are involved, and what databases they’re running in.
  3. Finally, run sp_BlitzIndex @GetAllDatabases = 1 and look for “Aggressive Indexes” warnings pointing to the tables that are involved in the blocking. These indexes/tables aren’t necessarily the problem – they’re just involved with the blocking, just like you are. We wouldn’t fire you just because you’re involved with the blocking. (Well, maybe we would.) Interpreting these Aggressive Indexes warnings can be a little tricky (just like interpreting those Human Resources warning emails you keep getting.)
Mitch Bottell is preparing to take an exclusive lock on some BBQ.

Mitch Bottell is preparing to take an exclusive lock on some BBQ at SQL Saturday Sacramento.

Once you’ve identified how big the problem is, which queries are involved, and what tables & indexes they’re trying to lock onto, then we can start fixing the root cause of the problem. There’s three ways to do it, from easiest to most work required:

  1. Tune your indexes – get rid of the indexes you don’t need, and add the right indexes to help queries jump in and out quickly without holding huge locks to do table scans.
  2. Tune your queries – get rid of implicit conversions, functions in the WHERE clause, and other SARGability foes that cause your tables to scan an entire index rather than doing a seek. The less data we can read & lock, the more predictable and small our locks become.
  3. Change your isolation level – let queries cooperate with RCSI or SNAPSHOT isolation instead of waiting in line for locks. That last one is the easiest in theory, but just make sure you read that entire post (and the linked ones in it) to understand what you’re getting into, since it can backfire.

DBA Training Plan 17: Should You Partition Your Tables?

In the last episode, I talked about knowing when it’s time to scale out: identifying when our data is getting to be so large that we have to split it across multiple servers, and I explained why that is so challenging. But what about table partitioning – SQL Server’s ability to break up a single table into smaller ones on the same server?

Great volumes have been written about table partitioning. It’s a complex feature and you can read for days to just understand how you might apply it. But will it improve performance for you? Table partitioning produces great benefits for some applications, but causes giant headaches for others.

What Table Partitioning Does

Table partitioning allows tables or indexes to be stored in multiple physical sections: a partitioned index is like one large index made up of multiple little indexes. Each chunk, or partition, has the same columns – just a different range of rows. In theory, SQL Server handles this transparently for you: a partitioned heap, clustered index, or non-clustered index can be referenced as a single structure although it’s stored in independent physical partitions. In other words, in theory you don’t need to change any code in the calling applications. (Is this true in reality? More on this later.)

Let’s say we want to partition the Stack Overflow Posts table – that’s where the questions and answers are stored. Every question & answer is considered a Post. I’m going to use the 50GB 2013 version of the Stack Overflow database because the Posts table is starting to grow: it’s 17,142,169 rows taking up 37GB space on disk – that’s most of the database, so breaking it up will be a performance boost, right?

Let’s say we want to partition them by CreationDate, and we want to store each year’s data in a separate file. We’ll need to start by creating filegroups for each year:

You’ll notice in this example that I’m using the same drive (M) for all of the files – in partitioning, though, it’s common to put the different years of data in different drives, and even stripe the data across multiple files. Let’s keep it simple for this demo though.

Next, I need to create a partition function and a partition scheme to tell SQL Server which data belongs on which drive/file:

Partitioning pros will note that I’m violating all kinds of best practices. Partitioning pros should also note that I’m trying to teach the basics of partitioning in a really short amount of space. Partitioning pros should go write their own blog posts.

Finally, we move the table over onto the new storage structure:

To pull that off, I had to drop the primary key.

This table – like most tables – has a clustered primary key, and SQL Server doesn’t let me just move the primary key over to the new storage structure. Instead, I have to:

Time to partition my wine collection

Time to partition my wine collection

  • Remove any foreign key relationships that reference this table
  • Drop the primary key
  • Create a new clustered index on the partition scheme (which splits the data across multiple files)
  • Create a primary key again
  • Recreate the foreign key relationships

For more details about why I had to do all that, see this excellent Stack answer by Remus Rusanu, and make sure to read the linked Microsoft article, Special Guidelines for Partitioned Indexes.

See, partitioned indexes come with a whole lot of gotchas:

<sigh>

Partitioned tables work, but…they’re a lot of hard work.

So should you partition your tables?

If you’re dealing with a 1TB+ data warehouse that can do sliding window loads, possibly.

If you’re dealing with a smaller one, or you can’t do sliding window loads, partitioned views are usually a better fit.

If you’re doing transactional workloads, and you can’t guarantee that the partitioning key will be in most of your WHERE clauses, then no, partitioning isn’t a good fit. Plain ol’ index and query tuning will be a better match.

DBA Training Plan 16: Is It Time to Scale Out?

In our last episode, we found out what your server was waiting on. In the closing remarks, I mentioned that you should find out which queries are causing the problem, and focus your mitigation efforts there. You might have circled back to the episode where we covered sp_BlitzCache, used a @SortOrder parameter that lines up with your top wait type, and…

Now you’re frustrated.

Because you’re looking at all these bad queries and thinking, “Isn’t there an easier way? Can’t I just throw hardware at this?”

Scaling out reads with replicas:
splitting queries between servers

When you want to copy the exact same production data to a second server to offload reporting queries, some of the costs are immediately obvious:

  • Hardware and storage – even if you’re running it in a virtual machine, you need to account for the costs of say, 4 cores and 32GB RAM. Not only will you need storage for the databases, but you’ll also need to decide whether this server gets backed up, and copied to a disaster recovery data center.
  • Software licensing – Standard Edition is ~$2k per core, and Enterprise Edition is ~$7k per core. Toss in Windows (especially now that it’s licensed per-core), your management/backup/antivirus tools, and your monitoring software.
  • Project planning – you’ll need to design how to get the data from production to the reporting server, like with Always On Availability Groups, log shipping, or transactional replication.
  • App modifications – the app running reporting queries will need a new connection string. Even with Always On Availability Groups, reads aren’t automatically offloaded to readable replicas – you have to use the connection string parameter ApplicationIntent = ReadOnly to tell SQL Server that you promise not to try to write anything. If you have a single app that does both reads and writes, and you only want to offload some of the queries, you’ll need to go through the code to switch those queries over to the new connection string.

But some of the costs come as cruel surprises:

  • Adding a troubleshooting process – sooner or later, the data replication process will break. Depending on the method (AGs, log shipping, replication) and failure type, it’ll fail in different ways – maybe all of the data is old, maybe just some of it is, or maybe the reports aren’t accessible at all. You’ll want to list out the failure methods and explain what symptoms will look like. This helps business users recognize when their reports are wrong, and react appropriately. If you don’t do this step, then after the first failure, people are just always going to expect that there’s a bug in the report data.
  • Prepare for failure – for each of those failure methods, decide how you’re going to react. For example, if AG replication breaks and reports are out of date, will you point reports at the primary until the problem is resolved, or will users just have to deal with unavailable reports while you troubleshoot or resync the replicas? If you don’t do this step, then you’re going to be winging it every time, and you’ll look unprepared while reports are wrong or down.
  • Set realistic expectations for RPO and RTO – based on your process and preparation, make sure the business users understand how long their reports will be down when things break.
  • Measure the overhead of replication – AGs and transactional replication can add performance slowdowns beyond what the reports used to cost. For example, if you were only running a few reports an hour, and only hitting a subset of the data, then suddenly replicating every individual delete/update/insert operation can have a huge overhead.
  • Add monitoring – you need to start monitoring how far behind the reporting server is, and how performance is doing on both. Performance troubleshooting becomes a lot harder, too – for example, when you’re doing index tuning, you have to combine data across both the primary and the reporting servers in order to find the right mix of indexes across the board.

And at the end of all this, all you’ve got is a way to split out the read-only queries. Those might not even be your biggest problem – you might need to split up writes between servers, and that gets REALLY hard.

Scaling out writes with sharding:
splitting data between servers

Sharding is the technique of splitting your data between multiple shards. You have the same exact database structure on each server, but you split up customers into multiple shards – for example, maybe the US customers are on one shard, and the European customers are on another shard.

Sharding gets trickier when you have to keep data in sync between shards. For example, you might run a web site, and you want to keep all of the product and pricing data exactly the same across shards. You may also need to combine data back from all of the shards into one, like for reporting purposes.

Other techniques for splitting load across servers involve replication with multiple masters – multiple servers that can accept writes at any given time. If you’re interested in that topic, check out Many Masters, One Truth. (Spoiler: it’s a lot harder than it sounds.)

But the real bottom line with sharding, and the reason implementation is so hard, is that your application has to know which shard to query. If you’re connecting all of the apps to a single SQL Server and routing requests from there, you’re right back where you started: bottlenecked on a single SQL Server.

This whole post was a trick.

None of the scale methods are really easy. Whether you’re convincing the boss to buy more memory, taking an outage to upgrade SQL Server, or rewriting the app’s back end to accommodate sharding, you’re going to be putting in serious work.

That’s why I put so much emphasis on tuning indexes and queries. It’s hard to make your server 100x faster – but it’s really easy to make your database design and queries suddenly 100x more efficient. Fixing queries and tables and indexes is indeed hard work – but it’s totally worth it. The better you get at these techniques, the farther you can go in your career.

DBA Training Plan 15: What’s Your SQL Server Waiting On?

You’re here, dear reader, because you weren’t “classically trained” as a database administrator. You didn’t graduate from the University of Sciencing Computerses with a Master’s of Transactional Processing.

You probably got your start as a developer or sysadmin, and gradually fumbled your way here. You’re used to monitoring stuff from the OUTSIDE using things like perfmon counters or CPU graphs.

SQL Server has a way, way, way better tool: wait stats. Whenever your queries need to wait on something – storage, locks, memory, whatever – SQL Server tracks what it’s waiting on, and for how long. It’s like you’re seeing INSIDE the server, with all kinds of possible debug points being tracked.

All you have to do is ask SQL Server, “What have you been waiting on?” by running this:

And presto, you get back something that looks like this:

sys.dm_os_wait_stats

And, uh, it’s…confusing. It’s hard to understand, filled with misleading harmless stuff, and it includes all wait time since startup – which includes overnight jobs, backups, checking for corruption, etc. So I wrote something better.

The better way: sp_BlitzFirst

Go get sp_BlitzFirst from our free First Responder Kit, and run it like this:

Here’s what it does:

  1. Takes a snapshot of a bunch of system data (including sys.dm_os_wait_stats)
  2. Waits 5 seconds
  3. Takes another snapshot
  4. Compares the difference between those 2 snapshots to tell you what’s happening on the server (without starting anything expensive and slow like a trace or XE session)

Here’s what the output looks like – and focus in on the area where I’m pointing at with the spiffy red arrow:

sp_BlitzFirst wait stats

The “WAIT STATS” section lists what your server was waiting on during that 5 seconds:

  • Wait_type: cryptic name from Microsoft
  • wait_category: more human-friendly description
  • Wait Time: how long we spent waiting on that thing. Generally, the top couple of things should get most of your focus, and this is what I’m sorting the results by.
  • Number of Waits, Avg ms Per Wait: helps you understand if this is something that’s not happening often (but sucks when it happens), or if it’s a death-by-a-thousand-cuts scenario
  • URL: a link to SQLskills’ wait types library so you can learn more about each particular wait type

Now, the trick you just learned only shows you performance right now. Your next question is going to be, “How can I track this stuff over time?” You can set up an Agent job to run sp_BlitzFirst every 15 minutes and log this data to table so you can trend it over time:

You can learn more about doing that, and how to use Power BI to read the data, in my post on the free Power BI Dashboard for DBAs.

The modern method of SQL Server performance tuning

I can’t emphasize this enough: screw CPU %, page life expectancy, disk queue length, or any of those other metrics your grandpa told you to watch. If you’re only looking at a few metrics at a time, you’re looking at SQL Server from the outside.

It’s time to get inside the server:

  1. Ask SQL Server, “What have you been waiting on?”
  2. Find the top queries & issues causing that wait
  3. Focus your tuning efforts on those to give you the biggest bang for the buck

Obviously, I can’t teach you about every possible wait type bottleneck on your server in this DBA Training Plan series, but I can get you started on the right steps on the journey. Here are your next steps for learning:

  • Run sp_BlitzFirst @SinceStartup = 1, and look at the top wait types on your servers. Read the SQLskills wait library data for your top waits, and search the blog here too.
  • Free Troubleshooting SQL Server book – once you know your top wait types, pop this book open and turn to the relevant chapter for techniques on how to mitigate it. (It’s also got chapters on typical production outage issues, too.)
  • Fundamentals of Server Tuning class – where I teach you how to use sp_BlitzFirst to measure your server, and we talk through the top wait types on your server together.
  • Mastering Server Tuning class – 3-day hands-on class where you’re given tough workload challenges and you have to fix ’em fast.

DBA Training Plan 14: POP QUIZ!

AH-HA! You didn’t know there would be one of these, did you, hotshot? You’ve been reading along in the series, just smiling and nodding, pretending you’re keeping up, but not really doing the homework. BUSTED! Put your books away and get your pencils out.

Backup & Recovery Questions:

  1. How many production SQL Servers do you have?
  2. What’s the RPO and RTO of your most important production server?
  3. Did your backups take the normal amount of time last night?
  4. When was the last time DBCC CHECKDB successfully finished in production?

Security Questions:

  1. How many different people are sysadmins in production?
  2. Do they each know that they’re sysadmins, and take care to avoid accidents?
  3. How many of your databases hold personally identifiable data like credit card numbers, social security numbers, and passwords?
  4. If someone gets hold of one of those database backups, can your company’s data go public?
  5. Have you informed your managers of that risk, or will they blame you?

Monitoring Questions:

  1. When a database server runs out of drive space, who gets emailed?
  2. Do at least two different people get the email in case one is on vacation or unavailable?
  3. What actions will you take to fix the situation?
  4. Are those actions documented so that everyone who gets the email can take action quickly?

Index Questions:

  1. Does every table in production have a clustered index?
  2. For any exceptions (heaps), do you have a plan to fix them?
  3. Which table in production has the most indexes, and why?
  4. Which frequently queried tables in production have the least indexes, and why?
  5. How are you managing index fragmentation?

The Right Answers

There’s no one right answer for any of these questions, but some answers are more wrong than others. Database administration is a journey, and not everyone in the company is going to appreciate the work you’re putting in. You can spend weeks or months trying to improve your answers on these. No matter how big your company is and how many database administrators you have, you’re probably never going to be truly happy with your answers here.

It's a trick question.

Final question: which one of these is Blanc de Blanc?

You’re not aiming for perfect.

You’re aiming for good enough that your managers accept the base of your Database Hierarchy of Needs pyramid, and that you feel confident in tackling the higher levels like performance and future-proofing.

Next week’s email is going to start digging into performance, and we’re going to ignore the lower levels of the pyramid – but that doesn’t mean that part of your journey is over. Print out this email, cut out the question list, and scribble in a few thoughts. Pin it up on your wall, and a few months from now, when you’re feeling overconfident that your environment is awesome, check that list again. Refresh your memory with the links on the right side of this email.

When an outsider comes in, like a support engineer or a consultant, they’re going to start at the base of your pyramid first. Before they start to help, they have to make sure your data is backed up and checked for corruption. They can’t go making changes without having a safety net.

And you shouldn’t either.

Now might be a good time revisit some of the posts in the DBA Training Plan series.

Updated First Responder Kit and Consultant Toolkit for August 2019

sp_DatabaseRestore can now restore to Azure blobs and change the database owner after a restore, and everything else gets bug fixes.

To get the new version:

Consultant Toolkit Changes

No functionality changes this month, just updating the First Responder Kit with the below changes.

sp_Blitz Changes

  • Fix: the power mode check now includes the label for Windows 10’s new Ultimate Performance Power Mode, which surely goes up to eleven. (#2044, thanks B. Tyler White for the code.)
  • Improvement: added syntax highlighting in the Github repo. (This affects all scripts, not just sp_Blitz.) (#2090, thanks Konstantin Taranov.)

sp_BlitzCache Changes

  • Improvement: if 75% of your plan cache is new as of today, raise the priority on the plan cache time summaries to 1 (instead of 254.) (#2052)
  • Fix: removing air_quote_actual plans for now. This new feature in SQL Server 2019 & Azure SQL DB simply isn’t working consistently even in 2019 CTP 3.2, so disabling for now. (#2022, thanks Jonathon Wyza for the bug report and Shane Holder for the debugging.)
  • Fix: skip AG secondary databases that don’t allow readable connections. (#2072, thanks Adrian Buckman for the code contribution.)
  • Fix: running sp_BlitzCache from multiple sessions no longer duplicates the missing index count on plans. (#2070)

sp_BlitzFirst Changes

  • Fix: skip AG secondary databases that don’t allow readable connections. (#2072, thanks Adrian Buckman for the code contribution.)
  • Fix: was throwing an int overflow on boxes with over 2TB RAM. (#2060, thanks Dan Andrei Stefan for the bug report.)
  • Fix: index reorgs were being reported as DBCC operations due to a bug in sys.dm_exec_requests. (#2062, thanks Erik Darling for the bug report.)

sp_BlitzIndex Changes

  • Fix: when saving sp_BlitzIndex output to table, if you had an index with a definition over 4,000 characters long, it was throwing the legendary data-would-be-truncated error. Could happen if you had a lot of includes with very long column names, or with columnstore indexes. (#2076, thanks Scotti85 for the bug report.)
  • Fix: if the @IgnoreDatabases parameter was broken across multiple lines (like if you copy/pasted from a spreadsheet), they weren’t all getting ignored. (#2053, thanks Erik Darling for the bug report.)

sp_DatabaseRestore Changes

  • Improvement: new @DatabaseOwner parameter lets you set the database owner after the restore finishes. (#2081, thanks gdoddsy for the code contribution.)
  • Improvement: ability to restore to an Azure blob target. (#2067, thanks John McCall for the code contribution.)

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes us!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

DBA Training Plan 13: Why Do Cached Plans Go Bad?

It’s a little bit of a trick question: what could make a good query plan turn bad?

In the last episode, we used sp_BlitzCache to spot the most resource-intensive, longest-running query plans in your server’s cache. Go run those queries again, right now, and I bet you’re going to see plans in the cache that use parameters. Maybe they’re stored procedures, or maybe they’re parameterized SQL.

To explain it, I’m going to need to run a few demos. I’m going to use the StackOverflow database – particularly, the Users table that I demo in How to Think Like the Engine. I’m going to start with an index on the Reputation field:

That index is ONLY on the Reputation field – so it can be used for this below query, but it’s not a fully covering index:

That query finds all of the data for users whose Reputation score = 2. There’s not a lot of folks in the Stack database that match – the default Reputation score is 1, and people either stay there, or they start working their way up the charts.

Here’s what the query plan looks like:

Index seek with key lookup

Index seek with key lookup

SQL Server does an index seek on our IX_Reputation index to find the 5,305 rows that match, then does a key lookup to get the SELECT * part (because the index doesn’t cover all those fields.) Look at the execution plan, hover your mouse over the index seek, and you’ll see that SQL Server expected 5,305 rows – and 5,305 actually came back. Awesome.

Now let’s try that query looking for Reputation = 1:

The Plan with the Scan

The Plan with the Scan

Note that even though SQL Server auto-parameterized the query (that’s the @1 part at the top), SQL Server chose a different execution plan. This time, the actual plan shows that SQL Server expected 3mm rows to come back – so here, it makes more sense to do a clustered index scan rather than first make a list of the users that match, then do 3mm key lookups to get the SELECT * part. SQL Server is using our index’s statistics to guess how many rows will come back.

The same query can produce 2 different plans with 2 different parameters.

(More complex queries can even produce more different plans than that.)

Let’s put it in a stored procedure and see what happens.

This stored procedure is pretty simple:

Run it with @Reputation = 1, and you get the clustered index scan:

Perfect plan for big data

Perfect plan for big data

Then run it with @Reputation = 2, and you get…wait a minute…

Scan, but not as bad as you think

Scan, but not as bad as you think

You get the execution plan from the first pass. That’s because SQL Server caches stored procedure execution plans – it builds a plan for the first set of parameters that happen to get passed in when the plan needs to be built, then caches that same plan to reuse over and over. The plan will stay in cache until you reboot Windows, restart the SQL Server service, rebuild indexes, update statistics, run DBCC FREEPROCCACHE, etc.

Here, that’s not such a big deal. I know, you see clustered index scan, and you think performance is bad – but it’s not really that big of a deal:

  • @Reputation = 1 with index scan – does about 80k logical reads, takes about 30 seconds (but mostly because SSMS has to render 3mm rows)
  • @Reputation = 2 with index scan – does about 80k logical reads, takes about a second (because there’s only 5305 rows)

If you look at the actual plan for @Reputation 2 here, and hover your mouse over the Clustered Index Scan operator, you’ll notice that SQL Server doesn’t just save the plan – it also saves the estimates. We’re expecting 3.3mm rows to come back here – even though only 5,305 do. Who cares, though? Overestimating is awesome, right?

But then something goes wrong.

Somebody:

  • Restarts Windows
  • Restarts the SQL Server service
  • Frees the procedure cache
  • Puts the server under memory pressure (thereby pushing this plan out of cache)
  • Doesn’t run the query for a while
  • Rebuilds indexes on the Users table
  • Updates statistics on the Users table

And somehow the execution sequence is reversed. First, we run it for @Reputation = 2:

The seek shall inherit the mirth

The seek shall inherit the mirth

We get an execution plan beautifully designed for tiny amounts of data. Hover your mouse over the index seek, and you’ll see that SQL Server accurately expects that only 5,305 rows will be returned. With the index seek, we only do 16,268 logical reads – even less than before! Great! Now that plan is in the cache.

You can hear the train coming. Let’s run it for @Reputation = 1:

We reuse the plan for tiny data

We reuse the plan for tiny data

SQL Server uses the cached execution plan, but it’s ugly, which means:

  • We do an index seek, plus 3.3mm key lookups
  • We do a staggering 10,046,742 logical reads (up from 80k) due to those repeated key lookups
  • We only estimate 5,305 rows will come back, which means if we had added joins or sorts in this query, they would have spilled to disk
  • We can’t see the terrible awfulness in the plan cache, which only shows estimates, not actuals

This is parameter sniffing:
good plans turning bad.

SQL Server builds one execution plan, and caches it as long as possible, reusing it for executions no matter what parameters you pass in.

If for some reason, the plan disappears from memory, the very next set of parameters determine the new execution plan.

Your next question is, “So how do I get the good plan back?” And here are the next steps on your learning journey:

Research Paper Week: Constant Time Recovery in Azure SQL DB

Let’s finish up Research Paper Week with something we’re all going to need to read over the next year or two. I know, it says Azure SQL DB, but you boxed-product folks will be interested in this one too: Constant Time Recovery in Azure SQL DB by Panagiotis Antonopoulos, Peter Byrne, Wayne Chen, Cristian Diaconu, Raghavendra Thallam Kodandaramaih, Hanuma Kodavalla, Prashanth Purnananda, Adrian-Leonard Radu, Chaitanya Sreenivas Ravella, and Girish Mittur Venkataramanappa (2019).

This one covers a new feature for both Azure SQL DB and SQL Server that:

  • Stores the version store in the user database file(s) rather than TempDB, which
  • Facilitates near-instant transaction rollbacks, even for huge transactions
  • Lets you clear the transaction log much faster, even when transactions are open

Those of you who are familiar with the magic of Read Committed Snapshot Isolation (RCSI) and its use of the version store in TempDB are probably slapping your heads right about now going, “Ah, it’s obvious!” You’re going to be able to hit the paper running – no need to keep reading.

For the rest of you, I’ll try to condense the relevant parts of Kendra’s intro to RCSI here in a few sentences. When you enable RCSI or SI, SQL Server automatically starts storing versions of rows in TempDB. If you want to hold a lock on a row, SQL Server stores the original unmodified row over in TempDB. This means that if someone wants to read (not write) the row that you currently have locked, the SQL Server can simply hand them the unmodified copy out of TempDB and skirt around your lock. This means writers don’t block readers, which is what makes RCSI so massively popular for fixing blocking issues.

The downsides of RCSI include more workloads happening in TempDB, which causes two problems: slower TempDB access, and larger TempDB files. For years, folks have asked for the ability to use a different TempDB per user database to work around problems like this, but Microsoft went in a totally different direction and decided to let you store the version store in the user database instead.

This is way better than separate TempDBs (for this purpose) because having the original rows inside the user database opens up new possibilities for faster transaction rollback and quicker transaction log truncation, as the authors go into in the paper.

If you’re considering enabling Constant Time Recovery in SQL Server 2019 (when it ships), you should read the paper to understand the risks and drawbacks of a feature like this. This is the kind of feature that will put the cutting in cutting edge: I would fully expect to see a lot of rollback, concurrency, and backup/restore bugs fixed in the early Cumulative Updates for 2019. I’m not saying it’s a bad feature – it certainly looks cool – but there are lots of risks when you make a change this big.

If you enjoy this paper, you’ll probably also enjoy Socrates: The New SQL Server in the Cloud. It explains how Azure SQL DB Hyperscale works, and boy, does it have intriguing new ways of handling data and log files.

I hope you enjoyed Research Paper Week – I certainly did – and I want to stop for a moment to thank all those academics and research professionals out there who’ve done such a fantastic job putting very complex thoughts down into easier-to-understand words for the rest of us. These papers involve hundreds or thousands of hours worth of research, action, writing, and testing, and they’re all available to you for free. How awesome is that?

Research Paper Week: In-Memory Multi-Version Concurrency Control

If you’ve been doing performance tuning for several years, or graduated from my Mastering Server Tuning class, you’ve come across Read Committed Snapshot Isolation, aka RCSI, aka multi-version concurrency control, aka MVCC, aka optimistic concurrency. It’s not the way SQL Server ships by default, although it is the default for Azure SQL DB, and it’s part of the magic in how you can query an Availability Group readable secondary even when it’s applying updates to the tables behind the scenes.

Today’s paper turns it up several notches: PDF: An Empirical Evaluation of In-Memory Multi-Version Concurrency Control by Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo (2017).

In-Memory Multi-Version Concurrency Control

This white paper compares and contrasts the methods used by SQL Server (Hekaton In-Memory OLTP), Oracle, Postgres, MySQL, SAP HANA, and others. It doesn’t really focus on SQL Server rowstore RCSI, though – the only SQL Server thing it focuses on is Hekaton.

It’s only 12 pages, but buckle up: these are very information-dense pages. By page 2, they’re already breaking out each competitor by its protocol, version storage, garbage collection method, and index management (logical vs physical pointers.)

By page 3, you’re facing stuff like this:

Page 3

Do you need this white paper’s contents in order to be a good production DBA, development DBA, or database developer? Absolutely not. This white paper is more for the curious amongst us, those folks who want to know internals work and why Microsoft made the design decisions that they did. It’s the kind of white paper I’ve only started digesting, and will probably take me several more readings before I’d even feel comfortable discussing its contents.

But the cool thing about having a blog like this is that a lot of y’all are smarter than me, so I present this white paper for your amusement. Enjoy.

People who liked this paper (uh, me) also enjoyed this one from Microsoft Research: Improving Optimistic Concurrency Control Through Batching and Operation Reordering by Bailu Ding, Lucja Kot, and Johannes Gehrke (2019). That’s right: taking groups of transactions that are happening around the same time, and reordering them to avoid problems with lock conflicts!

Research Paper Week: Automatic Indexing in Azure SQL DB

Before I give you the link to the next research paper in this week’s series, I wanna give you a few questions to think about:

  • If you had to build something to automatically add indexes to SQL Server databases, where would you start?
  • What data would you use to make your decisions?
  • How would you test your decisions without implementing them in production?
  • How would you apply your decisions?
  • What’s the worst that could happen if your decisions were wrong?
  • How would you find out if your decisions were wrong?
  • What might your users do that would break your hard work?

Now, think about the same points – but for automatically dropping indexes. What data might you base your decision on? What’s the worst that could happen? How would you find out if the worst had happened, and could you correct the situation?

Seriously, print out this blog post or copy/paste those questions into an email, send it to yourself, and just think about it when you’re standing in a line or riding the train. Don’t cheat by reading the white paper first – spend some serious spare time thinking about how you’d answer those questions.

This paper covers those answers.

After you’ve spent a day or two thinking about the answers, read this: (PDF) Automatically Indexing Millions of Databases in Microsoft Azure SQL Database by Sudipto Das, Miroslav Grbic, Igor Ilic, Isidora Jovandic, Andrija Jovanovic, Vivek R. Narasayya, Miodrag Radulovic, Maja Stikic, Gaoxiang Xu, and Surajit Chaudhuri.

Automatically Indexing in Azure SQL DB

This easy-to-read 12-page paper is wonderfully candid about how Azure’s auto-indexing does what it does, the challenges it faces, how it’s succeeded, and times when it’s failed.

I don’t want to give away any spoilers here, but trust me when I say that if you do a lot of index tuning – especially graduates of my Mastering Index Tuning class – you’re going to recognize the challenges in this work, and you’re going to identify a lot with how Clippy runs into problems. Indexing is seriously complex, and there’s no magic answer. Azure SQL DB faces the same problems you do, and it’s making a really cool set of first steps to improve the situation – but with tools you might not expect.

I’ll talk spoilers in the comments as folks write in – feel free to leave spoilers in the comments, and avoid the comments before you’ve taken a day or two to think about your answers, and then read the white paper. This thought exercise really is worth the time. (I say this because I’ve thought a lot about these problems over the years!)

Menu
{"cart_token":"","hash":"","cart_data":""}