Unindexed Foreign Keys Can Make Performance Worse.

I know. You, dear reader, saw that title and you came in here because you’re furious. You want foreign key relationships configured in all of your tables to prevent bad data from getting in.

But you gotta make sure to index them, too.

Let’s take the Stack Overflow database – I’m using the 50GB Stack Overflow 2013 version for this post. I’m going to start with no nonclustered indexes whatsoever, and then I’m going to add in foreign keys between the Users table and several tables where users add content, like Badges, Comments, Posts, and Votes:

Now, let’s say my application wants to delete a specific user, id # 26837. First, the app starts by deleting all of the rows in the content tables for this user:

That does table scans, and let’s say we’re okay with that for some reason because our application rarely does deletes. Now that all the content tables are gone, let’s go back and delete the User row:

It’s just deleting one row, right? How bad can it be?

Terrible, as it turns out. SQL Server wants to double-check all of the content tables to make sure User 26837 doesn’t own any Badges, Comments, Posts, or Views. We do a ton of logical reads, too:

The first solution you probably think of: index all of the foreign key columns:

That makes the deletes go way faster, but it comes at a few costs: slower inserts & updates to maintain those indexes, a 3% larger database size in this case, 3% longer maintenance jobs, etc.

At one recent engagement, we simply couldn’t use that solution. They already had serious performance problems during inserts: their hardware couldn’t handle any additional slowdowns during inserts & updates, and they already had extensive blocking & writing problems. Other solution options included:

  • Queueing the deletes for later – in this Stack Overflow example, say the application doesn’t directly delete the Users rows, and instead, adds rows to a queue table to be handled later. During a maintenance window, we could do the deletes in batches, dropping groups of users across the tables with only one pass.
  • Drop the foreign keys – at this particular client, the keys were configured with NOCHECK anyway due to past bad data, and there wasn’t a way we could fix that quickly enough. In that case, dropping the foreign keys made the deletes instantly faster, which was part of the overall solution in getting enough pain relief to go through their busy season.

Do I wish every database had perfect data, perfect foreign keys, perfect indexes, and perfect hardware to sustain all of the load? Absolutely. But in real life, constraints aren’t black and white: they’re fifty shades of … oh, you know.


How to Create a Table with a Partitioned Clustered Columnstore Index

T-SQL
5 Comments

If you need to create a table and you want it to be partitioned right from the start, AND you want it to have a clustered columnstore index, here’s a creation T-SQL example:

The partition scheme name goes on the outside of the parenthesis, the index name goes on the inside, and the order of the whole thing is a little counter-intuitive.

That’s it. That’s the entire post. I just had to do that this morning and had a tough time finding an answer online, so figured I’d blog it. There you go.


Free Webcast Wednesday: Pushing the Envelope with Indexing for Edge Case Performance

Most of the time, conventional clustered and non-clustered indexes work just fine – but not all the time. When you really need to push performance, hand-crafted special index types can give you an amazing boost. Join Microsoft Certified Master, Brent Ozar, to learn the right use cases for filtered indexes, indexed views, computed columns, table partitioning and more.

This is a free webcast on Wednesday, September 16th at noon Eastern.

Register now to save your spot. See you there! If you can’t be there live, a link to the recording will be sent to the registrants.


Updated First Responder Kit and Consultant Toolkit for September 2020

Let’s give a big round of virtual germ-free applause to Adrian BuckmanErik DarlingGarry BargsleyGreg DoddIan MantonJohn McCallKuldar AndaresPierre LetterStephen AnslowTom Butler, and Walden Leverich for volunteering their time to help make your job easier this month. Thanks to them, the September release is chock full of features – I’m not even going to try to sum them up here, and you should read the release notes. Kudos to them!

Brent Ozar's Consultant ToolkitTo get the new version:

Consultant Toolkit Changes

I updated it to this month’s First Responder Kit, but no changes to querymanifest.json or the spreadsheet. If you’ve customized those, no changes are necessary this month: just copy the /Resources/SQLServerSetup checklist folder over your existing files to replace the SQL scripts.

sp_Blitz Changes

  • Improvement: new check looks for Query Store problems where is_query_store_on reports 1, but sys.database_query_store_options doesn’t agree. (#2525, thanks Walden Leverich.)
  • Improvement: more readable info in the drive free space checks. (#2557, thanks Pierre Letter.)
  • Fix: sp_Blitz was failing on Amazon RDS SQL Servers because it was trying to run sp_validatelogins, which Amazon doesn’t allow. (#2526, thanks Erik Darling.)
  • Now alerts if there’s even just 1 memory dump in the last year (as opposed to 5.) (#2564, thanks Garry Bargsley.)

sp_BlitzCache Changes

  • Improvement: @OutputServerName parameter lets you push the sp_BlitzCache results to a central server via linked server. The linked server will need to already exist and have RPC enabled. Linked servers don’t support writing via XML, so when writing out to a linked server, the query plan column is created as NVARCHAR(MAX) rather than XML. (#2528, thanks Pierre Letter.)
  • Fix: “Many Duplicate Plans” check was reporting numbers higher than 100%. I didn’t do a great fix, but if anybody wants to dig into the real underlying problem, they can take the code from issue 2480 and apply it to this same problem. I’d be glad to take a pull request for that. (#2532, thanks Tom Butler.)
  • Fix: removed unqualified join to get the Microsoft data migration wizard to stop complaining. (#2558, thanks Ian Manton.)

sp_BlitzFirst Changes

  • Improvement: new warnings for queries running with memory grants > 15% of the buffer pool. You can configure that threshold with the @MemoryGrantThresholdPct parameter, too. (#2505, thanks Adrian Buckman.)
  • Fix: the recently-updated-statistics check (ID 44) was being blocked by index rebuild operations due to sys.dm_db_stats_properties ignoring isolation level hints. (#2548, thanks Adrian Buckman.)

sp_BlitzIndex Changes

  • Improvement: missing indexes now show both the data type for the column, and have commas in between the columns. (#2522, thanks Greg Dodd.)
  • Improvement: if lock escalation has been disabled on a table, add text mentioning that. (#2553, thanks Stephen Anslow.)
  • Fix: if Clippy came up with a missing index request more than 4,000 characters long, we were truncating it. Now, we preserve Clippy’s insanity in all its blazing glory. (#2533, thanks Kuldar Andares for the report and Greg Dodd for the fix.)
  • Fix: suggested index names are now capped at 128 characters long. This can produce duplicate index names, but frankly, that’s the price you should pay for creating column names that long, buddy. (#2552, thanks Greg Dodd.)
  • Fix: @Mode = 3 (missing indexes compilation) had a few extra diagnostic columns in the output left over from working on last month’s release. Removed those. (#2539)

sp_BlitzWho Changes

  • Improvement: when you use the @OutputTable parameters, sp_BlitzWho now automatically creates a deltas view like the other procs. The deltas view shows the last collection of each query on a session, so you see the total CPU/reads/time impact of each query. Not collected across multiple sessions – just helps you identify long-running queries that aren’t showing up in the plan cache, like stuff with option recompile hints or servers with a lot of memory pressure. (#2315, thanks Adrian Buckman.)
  • Fix: when a database has RCSI turned on, some DMV (don’t ask me which one) isn’t always reporting all transactions are in RCSI. John McCall came up with a slick workaround to report the right isolation levels. (#2534, thanks John McCall.)

sp_DatabaseRestore Changes

  • Fix: if you pass in an empty string for the @RestoreDatabaseName parameter, we use the existing database name instead. (#2536, thanks Greg Dodd.)

SqlServerVersions Changes

  • Improvement: when creating the SqlServerVersions table, we now also add extended properties for documentation. (#2574, thanks John McCall.)

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes me!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

Watch Me Working On It

If you’d like to see me working on a few of this month’s issues, you can check out my live streams from my TwitchYouTube, or Facebook channels:

 


“Not to write any procedure over 50 lines”

In Joe Celko’s Stairway to Database Design series, he writes:

The rules of thumb for T-SQL are not to write any procedure over 50 lines (one page)

This seems so radical at first, but it has precedence in NASA’s 10 Rules for Developing Safety-Critical Code:

  1. Avoid complex flow constructs, such as goto and recursion.
  2. All loops must have fixed bounds. This prevents runaway code.
  3. Avoid heap memory allocation.
  4. Restrict functions to a single printed page.
  5. Use a minimum of two runtime assertions per function.
  6. Restrict the scope of data to the smallest possible.
  7. Check the return value of all non-void functions, or cast to void to indicate the return value is useless.
  8. Use the preprocessor sparingly.
  9. Limit pointer use to a single dereference, and do not use function pointers.
  10. Compile with all possible warnings active; all warnings should then be addressed before release of the software.

See that #4? If your T-SQL was safety-critical – if folks’ lives depended on the accuracy and speed of your query results – then NASA would suggest you break it down.

Don’t get me wrong – I can’t possibly work that way, nor do I think many of you can work that way either. As I write this, the current sp_Blitz is 9,210 lines long. sp_Blitz is a great example of something that, were it to be mission-critical, someone should refactor it into much smaller, more easily testable chunks. But I still love the 50-line suggestion because it gives us pause before we tack another hundred lines onto an already-ginormous spaghetti monster.


Upcoming Free Webcast: 500-Level Career Guide: Building Your Brand

At SQLBits 2020 on Saturday, October 3 at 11:10AM Eastern, 8:10AM Pacific (calendar), I’m presenting a session called 500-Level Career Guide: Building Your Brand. Here’s the abstract:

If you want to work for yourself, you need to be able to sell yourself. I know, I hate it too – it feels gross when I write it that way. I’m Brent Ozar. You recognize my name, that’s why you wanna learn this from me.

Doesn’t that sound terrible? It sounds cocky. Arrogant. American.

But if you’re going to sell your services, you’re going to have a much easier time if people recognize you by name – and even better, seek you out by name. Just as customers seek out brands like Lego, Adidas, and Coke, you want them to seek you out by name too.

You’ll learn my 3-word exercise for building your brand, and then learn how to become known for that brand in our industry without spending any money on advertising. If you use this exercise and then stay true to it in your blog posts, tweets, and presentations, you’ll have customers pouring in without having to make a single phone call or email.

To attend, register here for free for SQLBits. Saturday is a completely free day packed with sessions from Bob Ward, Buck Woody, Paul Randal, and much more. The conference is in UK time, though, and my session’s actually in the last slot of the day. If you miss it, though, good news: SQLBits records their Saturday sessions and makes ’em publicly available for free later.

Here are the session resources:

My online bookmarks at Pinboard – especially the bookmarks tagged business, career, inspiration, management, marketing, presenting, startup, streaming, and writing.

Books I recommend, in order:

People who have done it and shared:

See you in the webcast. Don’t forget to register!


“UPDATE, INSERT, and DELETE are not normally processed in parallel”

Years ago, when troubleshooting performance, I stumbled across this Microsoft documentation on parallel query processing that says:

Certain types of statements cannot be processed in parallel unless they contain clauses, however. For example, UPDATE, INSERT, and DELETE are not normally processed in parallel even if the related query meets the criteria. But if the UPDATE or DELETE statements contain a WHERE clause, or an INSERT statement contains a SELECT clause, WHERE and SELECT can be executed in parallel. Changes are applied serially to the database in these cases.

Thank goodness I bookmarked it, because I moved on with my life and totally forgot about it until recently. I had to troubleshoot a couple of different slow deletes, and in both cases, parallelism questions came up. To help tell the story, I’ve reconstructed versions of them with the large Stack Overflow database.

I figured I should blog about this because I polled y’all on Twitter, and you weren’t sure about it either:

And there was even disagreement between those of y’all who WERE sure, hahaha, so let’s walk through it.

Problem 1: Cleaning Out a Table

Say you need to periodically erase the contents of a table and start over, and your app doesn’t have the permissions required to do a truncate table. Right now, all you’re doing is a simple delete:

The execution plan isn’t pretty:

What 148,723 Query Bucks looks like

For this demo, I’m using the 40 million row, 105GB Posts table, and it’s got 13 indexes on it. This database is the starting point for the Mastering classes – I drop different indexes on it depending on which lab folks are doing – and it happens to nicely replicate the kind of problem the client faced.

The plan has no parallelism operators, which in turn heats up one CPU core, which then sets 148,723 query bucks on fire:

Bonfire of the vanities

Watching the live query plans with sp_BlitzWho, we can see SQL Server working through the plan, one index at a time, sorting millions of rows of data. One index at a time. Two minutes, Turkish.

This query is single-threaded, but…could we get a parallel operator anywhere in the plan, and if so, will the query go faster? Well, we can kinda sorta gain parallelism by injecting a meaningless filter that matches all rows. In that case, SQL Server uses parallelism to “seek” on an index, discovers that all rows match, and then sets about doing the delete. However, there’s no parallelism in the delete operation itself, just during the meaningless operation to “find” the rows that match, which is of course all of them. The delete doesn’t finish any faster.

Anyhoo, setting aside stunt examples, back to the problem itself. I can totally understand the developer’s frustration because with this kind of query, you would think that SQL Server would:

  • Take out an exclusive lock on the table
  • Start deleting the contents of all of the indexes in parallel (since every row in every index has to be deleted, and we have an exclusive lock on the table, so it’s not like we need to worry about Halloween protection)

In that case, it helps to reset perspectives and say, “The real problem here isn’t parallelism: we just shouldn’t be doing all this work at all, let alone spreading it across multiple cores.” Better solutions for this particular problem:

  • Use K. Brian Kelley’s post on truncating tables with the least permissions, and then ideally, use a cert-signed stored procedure to do the dirty deed
  • If you can’t truncate the table, drop it and create a new one in its place – which has all kinds of problems around security, foreign keys, objects that may have been added to the table (like indexes and triggers)
  • If you can’t do any of that, at least drop all the indexes first, then recreate them after the table is empty (which is what we ended up doing in this case, bizarrely, because it presented the least security issues for that particular situation – sometimes the right solution isn’t politically feasible)

Problem 2: Cleaning Out Parts of a Table

Truncating was a no-brainer there, but things got tougher with the second query. Say we have another table – I’ll use the 66M row, 22GB Comments table in this example. Say we regularly run a clean-out project that removes all comments whose score is lower than the overall average:

And say we don’t want to add a Score index just for the delete process, because the app wouldn’t use it for any other purpose. The plan starts with two scans of the Comments table – one scan to find the average score, and another scan to find the rows we want to delete:

Off to a rough start

And the whole plan goes single-threaded despite costing 49,345 Query Bucks. Ouch. (Amusingly, it’s very easy to get parallelism in delete queries like this if you reference other tables, like if I deleted all Comments rows with a lower score than the average Posts score.)

In this case, it made sense to break the query up into two phases:

  1. Find the average Comments score (and that query plan goes parallel, plus probably makes sense to do separately ahead of time anyway)
  2. Delete all comments with a score lower than that (that query plan stays single-threaded, but it doesn’t really matter – it’s just a heck of a lot of work)

That change alone got us across the finish line in that case.

If Those Solutions Hadn’t Been Enough

For further tuning, we could think about solutions like:

  • Partitioning by Score ranges
  • Designing two separate tables: one where all new Comments go into, and another where Comments are moved into after they meet a certain Score threshold, and then truncating the low-score table to speed the delete process.
  • Instead of deleting, building a CommentsToKeep table, migrating over the desired rows, dropping the old Comments table, and renaming CommentsToKeep as Comments

All of those solutions have a heck of a lot of drawbacks, though – and the main point of this post was just to point out that just because something is a lot of work doesn’t mean that it’s going to go parallel.

If You’re Interested In This Kinda Thing

You’ll probably also like Paul White’s series on the Halloween Problem and the new Using Batches to Do A Lot of Work Without Blocking module of my Mastering Query Tuning class. In addition, here’s a fun thought exercise: if you have to do a lot of these deletes on different tables, are you better off running them:

  • Serially, one after another, in a long command/batch/proc, or
  • In parallel, in different sessions, each using different CPU cores?

The answer would depend on things like:

  • How fast your IO subsystem is – like whether one core can max it out
  • What kind of HA/DR solution you’re using – as in, do you really want to flood your multi-subnet AG with 100GB of logged changes in a matter of minutes
  • Whether blocking will rear its ugly head – like if you have foreign keys or triggers

[Video] Office Hours: Open Q&A About SQL Server

Office Hours Podcast

Office Hours Podcast

For those of you who miss our old Office Hours webcasts and feel lonely during the quarantines, I’m doing my part to help with free weekend webcasts. For the next couple of Saturday & Sunday mornings from 6AM-8AM Pacific (iCal), you can catch me on my TwitchYouTube, or Facebook channels. I’ll be:

I totally understand that weekend mornings aren’t a great time fit for everybody’s schedule, but it’s the time that works best for my own schedule (given my existing work.) But if you can’t make it live, no worries – all of the streams will be recorded and available in Twitch, YouTube, and Facebook. If I have a particularly good stream, I’ll put it here on the blog as well.

Wanna see what they’re like? Here’s this past Sunday’s stream when I did an open Q&A:

Grab the iCal file to get a reminder. They’re still on for the next two weekends, but then I’m taking a break to teach my Mastering Parameter Sniffing class over the weekend – I do that for Live Class Season Pass holders who paid out of their own pocket because their companies won’t pay or let them take time off.


What’s The First Thing You Tell Your Underling?

You were just given permission to hire an underling, and today’s their first day reporting to you.

What’s the first thing you warn them about?

I asked on Twitter, and the replies were great:

Chris and I had a followup discussion because I was curious to hear if he meant “do” to mean accept one, give one, or both. He meant both, and I agree. Plus, if you understand why, you’ll do a better job of executing the task and handling any surprises that come up during execution.

I’d extend that to include, “The data model speaks the truth, but the data often lies.”

I love that last one because it teaches the underling to take over as much as they can. It’s heartwarmingly encouraging, and we should all be that way to our underlings.

Y’all have such great advice that you deserve underlings.


SQLBits Attendees: Here Are Your Prerequisites for Mastering Index Tuning.

Howdy folks. It’s time for SQLBits!

This year, I’m teaching a one-day Mastering Index Tuning workshop from 9AM-5PM London time, and you can join in online. Pricing is £300 for the one-day workshop, or £800 for a full conference pass. You can register here.

Once you’re registered, here’s what you need to do:

What’s that, you say? You can’t buy my Fundamentals class because you’re in the EU, and we have that whole pesky GDPR problem? Okay, here’s what we’ll do: I’ve opened that class up for free until SQLBits. Go watch now, no registration required.

See you in the workshop!


SQL ConstantCare® Population Report: Summer 2020

Ever wonder how fast people are adopting new versions of SQL Server, or what’s “normal” out there for SQL Server adoption rates, hardware sizes, or numbers of databases? Let’s find out in the summer 2020 version of our SQL ConstantCare® population report.

The most popular version of SQL Server is 2016. The combination of 2014, 2016, and 2017 make up 75% of market share right now:

But companies are gradually adopting newer versions and replacing their older servers. Here’s how market share changed over the last 6 months:

SQL Server Version Adoption Rate

This quarter, shops continued to gradually replace their older versions:

  • SQL Server 2019: 4% of the market, up from 1% 6 months ago
  • SQL Server 2017: 24% of the market, up from 18% from 6 months ago
  • SQL Server 2016: fairly static at 34%
  • SQL Server 2014, 2012, and 2008R2 all declined at the expense of 2017 & 2019
  • Azure SQL DB and Managed Instances didn’t make any headway, still at 1% of the total population

My thoughts:

  • SQL Server 2016 is still the juggernaut, with 1 in 3 instances overall.
  • SQL Server 2019 still has less market adoption than SQL Server 2008R2.

So why is Azure SQL DB making so little headway here?

In this data, Azure SQL DB & MI still have less market adoption than SQL Server 2008, a 12-year-old and wildly unsupported version, let alone 2008R2.

SQL ConstantCareI’ll be the first to point out that there’s a natural bias in the source of the data. Our SQL ConstantCare® service lets users send data in about their databases and get health & performance advice via email. SQL ConstantCare® isn’t a full picture of the SQL Server ecosystem: it’s marketed to you, my readers, and specifically the subset of readers who want an inexpensive second opinion about their database performance & health. Conceivably, people running in Azure SQL DB don’t need a second opinion about their database’s health: the patching, uptime, and backups are all handled by Microsoft, and you can’t get much better than that.

However, Azure SQL DB performance – just like any database’s performance – is still a real problem, and I know that from the consulting side of the business. You can scale a long way just by throwing money at it, but eventually your wallet starts to buckle, and you need to fix the queries and/or the indexes.

If I’m wrong – and I certainly could be, I just don’t know – Microsoft could fix this perception by publishing adoption data. SQL Server phones home by default, so they could easily publish adoption rates for different versions. (If they just published download numbers, that wouldn’t really mean anything – lots of folks download stuff they never use.) The fact that they don’t publish metrics makes me suspect that my report’s numbers do give a rough idea of the wider population: 2016 is the big behemoth right now, and the adoption rates for 2019 and Azure SQL DB just don’t hold a candle to it.

I still think platform-as-a-service databases like Azure SQL DB and AWS Aurora are the future king of the market – but just emphasis on future. And at this adoption rate, the future is still at least 5-10 years away.

I’ll ask you, dear reader: why aren’t you migrating to Azure SQL DB?


Watch Brent Tune Queries at SQLSaturday Oslo

This morning, I had the honor of presenting virtually at SQLSaturday Oslo. The session title, Watch Brent Tune Queries, explains it all. Enjoy!

In the session, I used the 2018-06 Stack Overflow database, but any recent full-sized copy (not the 2010-2013 copies) will work. Here’s the query I worked with:

And for more sessions like this, check out the Watch Brent Tune Queries series page.


3 Ways to Run DBCC CHECKDB Faster

In my last post, I talked about how you can get 43%-67% faster backups. Today, I’m using that same SQL Server setup to discuss how throwing more CPU cores might help you check for corruption faster if you’ve already got modern fast storage. I don’t expect everyone to go to this level of detail by any means, but it’s the kind of tuning you can do when you’re facing multiple terabytes of production data per server and shrinking maintenance windows.

You need to check for corruption regularly because SQL Server isn’t doing it for you (although I’ve asked Microsoft to do that, and you can vote for it.) All kinds of things can cause corruption: storage issues, storage networking issues, and even SQL Server bugs like this or this or this or this or this.

However, folks are often hesitant to run DBCC CHECKDB because it has a reputation for being resource-intensive. Good news, though: there are a few easy ways to influence CHECKDB’s impact.

Performance tweak #1:
run it across more cores.

In Standard Edition, CHECKDB is single-threaded. That’s because you don’t need fast corruption checking in Standard Edition because it doesn’t have any corruption bugs at all. Nope, none whatsoever. Those of you on Standard Edition can just ignore this tweak and move right on to the next one.

Enterprise Edition has parallel corruption checking because it’s so much more vulnerable to corruption bugs (is that right? that must be what it is – surely it can’t be a money grab) and since SQL Server 2016, CHECKDB has accepted a MAXDOP parameter that lets you set the number of CPU cores involved. You can also hint it upwards, higher than the server-level MAXDOP, so that it can use more threads even on servers hosting single-threaded apps.

The more you use, the more you save, but the story here is a little trickier than the multi-file backup post. Here, I not only examined CHECKDB’s runtime, but also the amount of wait time generated by the command, and the wait time ratio. Wait time ratio means for every minute on the clock, how many minutes did SQL Server spend waiting? The higher that number goes, the uglier your performance is.

MAXDOPMinutesWait Time, MinutesWait Time Ratio
15340
2371213
4261556
81821212
121423416
161433223
241235430
321146241
48949456
56856465
64757975
72131,05279
80171,40984
96191,87999

Keep in mind that this particular SQL Server has 48 physical cores, 96 with hyperthreading turned on.

At MAXDOP 64, we’re checking 56GB of data per minute, and:

  • CHECKDB runs 86% faster than when it’s single-threaded
  • 59% faster than MAXDOP 8 (which is what a lot of folks would be at if they set server-level MAXDOP here)
  • 48% faster than MAXDOP 16
  • Wait time ratio is 75, meaning for every minute on the clock, we’re generating 75 minutes of wait time, most of which is CXPACKET and CXCONSUMER
  • The 64 CPU cores involved stick at 100% for most of the duration (thanks to our awesome hardware provisioning)

I’m certainly not saying MAXDOP 64 makes the most sense for all hardware, but in this scenario, it would let us keep the maintenance window as short as possible, assuming we have 7-10 minutes each night where we can sustain heavy loads.

MAXDOP 96 runs slower, not faster, and the CPU usage makes the box feel pretty unusable:

Performance tweak #2:
check for less (or more) corruption problems.

By default, CHECKDB checks for the most common issues, but you can make it go way faster if you just ask it to check the checksums on each page. This doesn’t catch logical issues, like corrupt statistics, but it at least gets you the earliest warning signs of storage corruption.

To do this, use the WITH PHYSICAL_ONLY switch – which doesn’t do nearly as much CPU work, so we also see different performance results here:

MAXDOPMinutesWait Time, MinutesWait Time Ratio
11710
212353
48415
86478
1256112
1657415
24511324

For this particular server, as I threw more cores at it, the only things that went up were my parallelism waits. The sweet spot here was around 8-12 cores.

But this leads to an interesting comparison:

  • PHYSICAL_ONLY with 12 cores: takes 5 minutes, only checks page checksums
  • FULL CHECKDB with 64 cores: takes 7 minutes, checks everything

On this particular server, as long as I’m running Expensive Edition, I just don’t think it would make sense to use the PHYSICAL_ONLY setting because as long as I’m taking a 7-minute slowdown, if I’ve got the CPU cores available (and they’re not working otherwise), then might as well just check everything. If this was truly a 24/7 environment where I couldn’t handle a 7-minute slowdown, then…I’m probably offloading CHECKDB to another AG replica anyway, especially given the wonderful 2019 licensing changes that make it free.

Furthermore, I might even want to use the EXTENDED_LOGICAL_CHECKS parameter. It catches even more corruption culprits, albeit at the cost of higher runtimes:

MAXDOPMinutesWait Time, MinutesWait Time Ratio
15340
2401303
4291756
81719811
121629217
161432122
241032731
48949257
561173160
64151,04769
96191,88099

It’s interesting to note that the CPU sweet spot for this particular box was 64 cores for the regular CHECKDB, but around 24 cores for EXTENDED_LOGICAL_CHECKS, producing the nicest balance between short runtimes and an overloaded, unresponsive server. (I’m abbreviating some of the results here.) That sweet spot is going to depend on not just your hardware, but also the contents of your database and which SQL Server features you’re using, which brings me to…

Performance tweak #3:
make your database smaller.

The number of tables you have AND the number of indexes on ’em both affect CHECKDB’s speed. All of the tests above involved the 390GB 2020-06 Stack Overflow database, which ships with no nonclustered indexes whatsoever.

To make the database more complex, I added:

  • 80 3-column indexes
  • 2 indexed views
  • 2 nonclustered columnstore indexes
  • Bringing the database from 390GB to 560GB (up about 44%)

And then ran a few of the tests again:

  • Physical_only, 16 cores: 7.2 minutes – up ~60% from 4.5 minutes, bigger than the 44% database size growth, but I wouldn’t read too much into one-off tests – I wouldn’t use these results to say definitively that physical_only performance doesn’t scale linearly with database size.
  • Regular CHECKDB, 64 cores: 30.4 minutes – up dramatically from 7.5 minutes with no indexes, and wait time ratio remained around 75, so the server was really hustling the whole time.
  • Extended logical checks, 64 cores: 35.3 minutes – up from 14.9 minutes, but the extended checks runtime penalty (compared to regular CHECKDB at 30.4 minutes) wasn’t as big here as it was for the test with no nonclustered indexes at all. This lines up with what I see out in the field: extended logical checks aren’t that big of a deal on real-life databases with indexes, and they’re totally worth it if you’re taking the gamble on indexed views.

These timings are why it often makes sense to run CHECKDB with the PHYSICAL_ONLY switch during the week in order to get that blazing fast runtime with hardly any CPU usage, and then upgrade to the resource-intensive full CHECKDB with EXTENDED_LOGICAL_CHECKS on the weekend, throwing lots of CPU cores at it to get the work over as quickly as possible.

Didn’t know about this? You might have missed out on some of the other things we talk about in our Fundamentals of Database Administration class.


Back Up SQL Server 43%-67% Faster by Writing to Multiple Files.

Even if you’re not writing to different physical drives, you can get faster backups with Microsoft SQL Server when you back up to multiple files. Your exact numbers are going to vary based on your server and your utter inability to rub two sticks together to make fire, but let’s see how I do.

Here’s the hardware I’m using:

The i3en.metal has 8 7.5TB NVMe SSDs. The Stack Overflow database ships on 4 data files and 1 log file, so I’m putting each of the data files on a separate NVMe SSD. That means the database in this post probably isn’t configured exactly the same as yours, but the results are still relevant. I’ve seen similar performance gains for larger databases that have just one data file in ’em.

During the tests, I use between 1 and 4 backup files, and in each test, I explain whether all of the backup files are on the same drive, or a different one – but in all cases, they’re in different drives from the data & log files.

Each of the NVMe drives are pretty quick by themselves:

CrystalDiskMark

I wanted to use this monstrously big hardware (especially relative to the database size) to illustrate that we’re not talking about hardware limitations here. Even if you’re on big metal, multiple backup files can help. Also, all of these tests use backup compression turned on, and I wanted to remove any hint of a CPU bottleneck.

You, dear reader, are going to ask all kinds of “but what about” questions. My role here is not to do your own backup testing for you: my role is to inspire you to do your own testing in your own environment so you can find the right answers to your own questions.

Test round 1:
Backing up to multiple files on one volume

In this test, the backup files were on one of the local NVMe SSDs, but not the same volume that the database’s data & log files lived on.

  • 1 backup file: 9.1 minutes
  • 2 backup files: 7.9 minutes (13% faster)
  • 4 backup files: 5.1 minutes (43% faster)
  • 8 backup files: 6.2 minutes (32% faster)

This isn’t meant to be a definitive, “Always use 4 backup files” kind of post, but just a starting point as to how much quicker your backups can get with this one easy change. As with all of the metrics in this post, the idea is to get you started on the road of testing your backups for quick wins on large servers.

Test round 2:
Backing up to files on different volumes

In this round, each backup file was on its own local NVMe SSD – until we got to 8 files, since I’ve only got 4 extra NVMe drives on the i3en.metal:

  • 1 backup file: 9.1 minutes
  • 2 files, 2 volumes: 6.9 minutes (24% faster)
  • 4 files, 4 volumes: 4.3 minutes (53% faster)
  • 8 files, 4 volumes: 3.0 minutes (67% faster)

Spreading the load over different backup volumes gave me double the speed improvements that I got when writing to a single volume. In this impractical example, I’m using local NVMe, and your server probably isn’t going to have that. However, you can get similar results by writing your backup files to different storage targets that have their own individual throughput that isn’t collectively throttled.

In the prairie days when your grandmother backed up her SQL Server, she hesitated to write her backups over the network because her network consisted of two tin cans connected with string. Today, thanks to the simple ec2instances.info, you can see the network bandwidth available to your instance type, and it ain’t no string:

Nice cans

This is why I love the i3 series: tons of local solid state, plus great network throughput. The network bandwidth alone isn’t enough, of course: you also have to provision fast storage targets on the other side if you need to back up a multi-terabyte backup. Some shops use i3en servers as staging area file servers – landing their backups there to get ’em written quickly, and then migrating the backups to more cost-effective and redundant storage for the longer term. (There are more cost-effective instance types if your backup needs are smaller, of course.)

Test round 3:
Backing up to NUL:

When you’re doing backup speed testing, you can back up to DISK=’NUL:’ and SQL Server doesn’t actually write the backup file to disk – it just discards the data. This helps measure how fast SQL Server’s backup processes can read the data from disk and compress it on this particular server.

  • 1 NUL file: 9.4 minutes
  • 2 NUL files: 6.8 minutes (28% faster)
  • 4 NUL files: 4.3 minutes (54% faster)
  • 8 NUL files: 2.9 minutes  (70% faster)

The numbers here are useful in comparison to test 2’s numbers: when writing to very fast backup volumes, you can actually approach the speed of simply discarding the data! In the 8-file example, if I just throw the data away by backing up to NUL, I finished in 2.9 minutes. Actually writing the data out to local NVMe SSDs got me there in 3.0 minutes. I’ll take it.

This is performance testing,
not configuration advice.

When your databases are just 50-250GB, and when your company is only active 9AM-5PM, this kind of thing doesn’t matter much. As your databases grow, and as your company demands constant uptime with short recovery times, this stuff matters a lot. It’s awesome that with just a little effort, you can dig in and start optimizing your backups for quicker recovery and less impact to your end users.

To start your own backup tuning project, check out Nic Cain’s automated backup tuning scripts. He automates the process of running dozens of backups with different settings – way more than just file count – in order to find the combination of parameters that works best given your database contents, where your data & log files live, and where you want your backups to be written.

In shops with ambitious RPO/RTO goals, you can use this same level of tuning for your restores, too. I wrote about that in my Choosing and Testing a Cloud Vendor post when I helped a company analyze the best VM architecture for their SaaS hosting needs.

Didn’t know about this? You might have missed out on some of the other things we talk about in our Fundamentals of Database Administration class.


Updated Fundamentals of Database Administration Class: Just $39

You’re a systems administrator or developer who’s been stuck in charge of the SQL Server. You have a couple of production SQL Servers, no high availability or disaster recovery, and around 100GB of data or less on each server.

You’ve never been to a “real” SQL Server training class, and you’ve been Googling your way through problems alright, but you’re starting to realize that this SQL Server thing is kinda complicated.

Let’s get you up to speed, quickly, with the most important things to learn before they catch you by surprise and ruin your day. No theory here, only the real deal production stuff, pronto.

You’ll learn:

  • How to configure your backups
  • How data gets corrupted, and how to detect corruption before your users do
  • How to apply patches (and which ones to apply)
  • How to configure a new SQL Server
  • How to use Agent jobs, and why you should probably push back when your users ask for one

This class is opinionated. I’m not reading you the documentation out loud: I’m telling you the things that have burned me over the years, just like a senior database administrator taking you under their wing.

This is a recorded class, and you can start streaming the recordings as soon as you check out. Recorded Class Season Pass and Live Class Season Pass holders can start watching it right away, too.

The class is normally $89, but to celebrate the relaunch, it’s just $39 this month. Enjoy!


Half Of You Don’t Understand Variables and Transactions.

Do table variables and variables respect transactions? If I set the value of a variable during a transaction, and I roll it back, what happens?

I polled y’all on Twitter:

Cory Booker is concerned about your transactions

I can’t bring myself to make a joke about the popular vote, so let’s move on and just point out that the correct answer is 200,000.

What about table variables?

When I run this query, how many rows will the SELECT return?

I polled y’all on Twitter, and here were the results:

Even less of you knew that the correct answer is 3 rows:

Variables ignore transactions.

I can understand why so many of y’all don’t know because none of this is on the variables documentation page. I’m kidding, of course. We both know you don’t read the documentation, either.

This is why I teach training classes. You should probably take one.


sp_Blitz Pro Tips for Managing Multiple Servers Centrally

sp_Blitz is a free health check script that gives you a prioritized list of things to fix on your SQL Server. Here’s what the output looks like:

Most of the time, you’ll want to work through the warnings in the order that they show up in the results, in priority order.

How to Skip Checks Across Your Estate

But sometimes, there are checks or databases that you want to skip. For example, say this server that we’re working with here is an Availability Group primary, but I’ve chosen to offload my backups to the AG secondary. Since sp_Blitz can’t necessarily connect to all your AG replicas and figure out where you’re doing your backups, you may want to skip the backup checks on this server.

To do that, create a table to hold the list of checks you want to skip, and put rows into it:

The CheckID column refers to the list of sp_Blitz checks by priority. You can also scroll to the right in sp_Blitz’s results and look at the CheckID column to see the number of the one you want to skip.

  • If you want to skip a check on all servers, all databases, then leave the ServerName & DatabaseName null.
  • If you want to skip a check on one server, but all the databases on it, put in its ServerName, but leave DatabaseName null.
  • If you want to skip a check on a particular database name, but all of your servers, populate the DatabaseName, but leave the ServerName null. (Like if you want to skip checks on all of your ReportServer databases.)
  • If you want to skip ALL checks on a particular database, populate the DatabaseName (and/or ServerName), but leave CheckID null.

Then, when you run sp_Blitz, point to your skip-checks table:

And presto, no more backup warnings:

As of the August 2020 release, you can also centralize this skip checks table by putting it in a central location, setting up a linked server pointing to your central location, and then using the @SkipChecksServer parameter:

How to Centralize Results Across Your Estate

As long as you’re centralizing your management of checks to skip, you can also centralize the results. sp_Blitz has parameters to write the output to a database, and it works across linked servers, too:

sp_Blitz will create the table if it doesn’t already exist (but the linked server, database, and schema have to already exist.)

Set that up in a daily Agent job or run it centrally from PowerShell, and you can have all your servers phoning home into one table on a regular basis. You can query the results to slice & dice ’em by priority, or find out what problems are new in your environment today as opposed to yesterday:

The output table has the same results as sp_Blitz, but adds two columns: ServerName and CheckDate, so you can query data by server or by most recent check dates.


sp_BlitzIndex Pro Tips for Consultants & Contractors

sp_BlitzIndex analyzes the design & health of your clustered & nonclustered rowstore indexes and gives you suggestions on how to get higher performance without changing your queries.

If you run it with no parameters, you get advice on the indexes in the current database:

Click to zoom

You get warnings like:

  • High value missing index – it tells you what table & columns to consider indexing, how many times the index would have been used, and how many indexes already exist on the table. That’s useful for quickly deciding whether you should consider adding that index as-is, or whether you might need to review the other indexes that exist on the table already, and maybe de-duplicate them.
  • Borderline duplicate indexes: in that screenshot, the Badges table has two indexes that both start with UserId. One of them can be flat out dropped because it’s a narrower subset of the wider index, and it’s not getting any reads anyway. Dropping it will cut 27M rows and 380MB from the database, making your maintenance jobs faster, too.

The defaults are a great start, but if you’re a consultant or contractor, you can do way better with a few parameters.

How to list all of the indexes in all databases:
sp_BlitzIndex @Mode = 2, @GetAllDatabases = 1

When I want to work on a client’s indexes, but I don’t want to be connected directly to their server, this is a lifesaver. I can dump a complete inventory of their indexes out into a spreadsheet, and then I have all kinds of details about the indexes that already exist:

Click to embiggen

Earlier, when I said the “high value missing index” warnings meant I may need to add a new index to a table – well, I need to find out what indexes already exist on that table. Armed with the contents of this inventory, I can answer that question without being connected to the client’s server.

There are dozens of columns in this, but here are some of my favorites:

  • Rows, Reserved MB – I sort these in descending order and ask the client, “Were you aware that you’ve got 75GB in this table called Sales_Backup, and 50GB in a table called WebSiteClickLog? Should we maybe clean those out and cut our backup & restore times?” Every single time, folks are just shocked about what they’ve got left lying around in the database.
  • Total Reads, User Updates – when I put this into Excel, I’ll filter for Total Reads = 0, and then sort by User Updates descending. This helps me identify the indexes I need to drop quickly in order to reduce blocking problems. Speaking of which…
  • Row Lock Wait ms, Page Lock Wait ms – along with averages on these, too, which makes it really easy to spot where my blocking is happening. When combined with the Aggressive Indexes warning, it’s a piece of cake to solve a client’s blocking issues just armed with the contents of this sp_BlitzIndex output.
  • Page IO Latch Wait ms – ever wonder which tables are getting scanned on disk and not cached in RAM? Well, now you know.
  • Drop TSQL, Create TSQL – when you need to build a list of commands to run at the client, this gets you a head start, plus helps you build the undo script to back out your changes.

You can also use the @OutputTableName, @OutputSchemaName, and @OutputTableName to write the @Mode = 2 results into a table. This is useful if you need to track index utilization over time to see whether an index is really getting used, or whether it should be dropped. (For example, some of my clients run this daily in an Agent job, and that way they can check index usage even if a server has been rebooted.)

How to get all missing indexes in all databases:
sp_BlitzIndex @Mode = 3, @GetAllDatabases = 1

Clippy’s missing index suggestions aren’t great, but they’re sure better than trying to come up with ideas yourself when you’re on a time crunch. Exporting them all out makes it easier for you to do performance tuning disconnected, too:

Click to anti-shrink

Note that at the top, you also get the server’s days of uptime. This way, you can be sure that what you’re looking at is safe: you wouldn’t want to base your missing index recommendations on a server that’s only been up briefly, or one that’s been up too long (since the queries might have only run weeks ago, not in the current app build.)

Brent Ozar's Consultant Toolkit

How to get all this in Excel:
The Consultant Toolkit

I don’t have time to connect to a client’s server, run a bunch of diagnostic scripts, and copy/paste the data into Excel. That’s why we built The Consultant Toolkit: it does all that for you.

I simply have my clients run it – client-friendly instructions are included – and then send me the output zip file that it creates. (You can even have it automatically upload the zip file to your Amazon S3 bucket if you like.) Then, I’ve got:

  • An Excel file with all kinds of diagnostics, including sp_Blitz, sp_BlitzCache, sp_Blitz, wait stats, deadlocks, and more
  • Execution plans
  • Deadlock graphs

And I can start getting to work making their server faster without even connecting to it. This helps me deliver more value to clients, more quickly. Here’s how it works:

This week, it’s a lower cost to you, too: add it to your cart and then use coupon code StartConsulting for 75% off your first year, bringing it down to just $123.75. European Union customers can buy here, and the coupon code works there too.


Updated First Responder Kit and Consultant Toolkit for August 2020

A couple of big new features this month: sp_Blitz’s @SkipChecksServer parameter lets you centralize a list of checks that you want to skip across different servers, and sp_BlitzIndex’s missing index recommendation list now includes the datatypes, making it easier to do index analysis for clients even when you’re not connected to their servers. I have two blog posts queued up for next week to get you up to speed.

Known issue: sp_Blitz does not work in Amazon RDS in this version. It’s been corrected in the dev branch and will be in the September release.

Brent Ozar's Consultant ToolkitTo get the new version:

Consultant Toolkit Changes

I updated it to this month’s First Responder Kit, but no changes to querymanifest.json or the spreadsheet. If you’ve customized those, no changes are necessary this month: just copy the /Resources/SQLServerSetup checklist folder over your existing files to replace the SQL scripts.

sp_AllNightLog Changes

  • Fix: if @BackupPath is set to null during installation, throws an error and stops now rather than failing silently. (#2474, thanks Curtis Browne.)

sp_Blitz Changes

  • Improvement: @SkipChecksServer is now implemented, so you can have sp_Blitz check a centralized SkipChecks table for a list of checks, databases, and servers to skip across your entire estate. (#2496, thanks Pierre Letter.)
  • Fix: reprioritize a few of the checks so that I can say with a straight face that you should drop everything and focus if you have alerts between priorities 1-50. (#2492)

sp_BlitzCache Changes

  • Improvement: add plan_generation_num to ExpertMode output and to the logged table results. Adds the column to existing tables if necessary. (#2514)
  • Improvement: add query_hash to the default output. Been using this a lot lately during my parameter sniffing classes. (#2513)
  • Fix: @SortOrder = ‘query hash’ wasn’t working on case-sensitive servers. (#2482, thanks Adrian Buckman.)
  • Fix: USERSTORE_TOKENPERM calculation was incorrect for pages_kb. (#2487, thanks Mike Vernon.)
  • Fix: Single-use plans were reporting over 100%. (#2480, thanks Simon Yeats.)

sp_BlitzIndex Changes

  • Improvement: added new @ShowAllMissingIndexRequests parameter to show ’em even if the query was only run a few times, or had low impact. (#2498, thanks valued customer Greg Dodd.)
  • Improvement: added data types to missing index requests. (#2227 and #2512, thanks Greg Dodd and Oleg Strutinskii.)

sp_BlitzLock Changes

  • Fix: databases & tables with periods in the name wouldn’t show correctly in the detail results. (#2452, thanks Filip Cornelissen.)

sp_BlitzWho Changes

  • Fix: sleeping sessions no longer show as accruing wait time. (#2465, thanks Adrian Buckman.)
  • Fix: parallel blocked sessions weren’t showing the blocking spid. (#2489, thanks Adrian Buckman.)

sp_ineachdb Changes

SqlServerVersions Changes

  • Fix: The SqlServerVersions table has been updated with a few SQL Server 2012 security patches. Not new, just missed ’em in the original table. (#2481, thanks TheVladdestVlad.)
  • Fix: Uninstall now removes the SqlServerVersions table. (#2502, thanks rebambus, and congrats on your first Github pull request ever!)

For Support

When you have questions about how the tools work, talk with the community in the #FirstResponderKit Slack channel. If you need a free invite, hit SQLslack.com. Be patient – it’s staffed with volunteers who have day jobs.

When you find a bug or want something changed, read the contributing.md file.

When you have a question about what the scripts found, first make sure you read the “More Details” URL for any warning you find. We put a lot of work into documentation, and we wouldn’t want someone to yell at you to go read the fine manual. After that, when you’ve still got questions about how something works in SQL Server, post a question at DBA.StackExchange.com and the community (that includes me!) will help. Include exact errors and any applicable screenshots, your SQL Server version number (including the build #), and the version of the tool you’re working with.

Watch Me Working On It

If you’d like to see me working on a few of this month’s issues, you can check out my live streams every Saturday & Sunday morning from 9AM-11AM Eastern (iCal) on my TwitchYouTube, or Facebook channels.

 


Menu
{"cart_token":"","hash":"","cart_data":""}