it’s not glamorous

And on your list of things that aren’t going fast enough, it’s probably pretty low. But you can get some pretty dramatic gains from indexes that cover columns you’re performing aggregations on.

We’ll take a quick walk down demo lane in a moment, using the Stack Overflow database.

query outta nowhere!



SELECT [v].[UserId], [v].[BountyAmount], SUM([v].[BountyAmount]) AS [BountyTotal]
FROM [dbo].[Votes] AS [v]
WHERE [v].[BountyAmount] IS NOT NULL
GROUP BY [v].[UserId], [v].[BountyAmount]

Looking at the plan, it’s pretty easy to see what happened. Since the data is not ordered by an index (the clustered index on this table is on an Id column not referenced here), a Hash Match Aggregate was chosen, and off we went.

Look how much fun we're having.

Look how much fun we’re having.

Zooming in a bit on the Hash Match, this is what it’s doing. It should look pretty familiar to you if you’ve ever seen a Hash Match used to JOIN columns. The only difference here is that the Hash table is built, scanned, and output. When used in a JOIN, a Probe is also built to match the Residual buckets, and then the results are output.

It's basically wiping its hands on its pants.

It’s basically wiping its hands on its pants.

It took quite a bit of activity to do a pretty simple thing.

Table 'Votes'. Scan count 5, logical reads 315406, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 3609 ms, elapsed time = 1136 ms.

Since this query is simple, our index is simple.

[BountyAmount], [UserId]

I’m using the BountyAmount column in the first position because we’re also filtering on it in the query. We don’t really care about the SUM of all NULLs.

Taking that new index out for a spin, what do we end up with?

Stream Theater

Stream Theater

The Hash Match Aggregate has been replaced with a Stream Aggregate, and the Scan of the Clustered Index has been replaced with a Seek of the Non-Clustered Index. This all took significantly less work:

Table 'Votes'. Scan count 1, logical reads 335, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 278 ms.

Zooming in on the Stream Aggregate operator, because we gave the Hash Match so much attention. Good behavior should be rewarded.

You make it look so easy, Stream Aggregate.

Filters, filters, filters

If we want to take it a step further, we can filter the index to avoid the NULLs all together.

[BountyAmount], [UserId]
) WHERE [BountyAmount] IS NOT NULL

This results in very slightly reduced CPU and IO. The real advantage of filtering the index here is that it takes up nearly 2 GB less space than without the filter. Collect two drinks from your SAN admin.

Table 'Votes'. Scan count 1, logical reads 333, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 233 ms.

And, because I knew you’d ask, I did try making the same index with the column order reversed. It was not more efficient, because it ended up doing a Scan of the Non-Clustered Index instead, which results in a bit more CPU time.

If you like this sort of thing, you might be interested in our Advanced Querying & Indexing class this August in Portland, OR.

4 comments ↑ Back to top

Growing up as a young boy, my elementary teachers always warned me about the perils of putting stuff in master. But what’s the real deal? Why is this such a bad idea?

System databases are often located on space-limited drives. The default installation parameters for SQL Server throw master, model, and msdb all on the C drive. I’d rather not create tables that might grow when I’m not looking, and then possibly run my server’s limited boot drive out of space. Starting Windows with a full C drive is not a lot of fun.

In the event of a disaster, we don’t usually restore master. While you can indeed restore one server’s master database over to another, it’s just not a good idea. You’re failing over to another server because things are already going wrong – why make things worse by doing something you rarely practice? Plus, if you’re failing over due to corruption, you probably don’t want to bring over a possibly corrupt master database.

Even if you could restore it, you’ll lose data. You can’t do transaction log backups for the master database, so you’ll lose whatever changes were made since the last full backup.

Now, having said all this, I’m completely okay with putting utility stored procedures in the master database – things like sp_WhoIsActive, sp_Blitz®, and the like. After all, those objects take hardly any space, and we don’t need to restore them over to our DR environment.

Kendra says: You may have more in your master database than you think! It just takes one ‘oops’ on a deployment script, and there you go.


We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.

7 comments ↑ Back to top

But Kendra, it can’t be that hard… after all, we have synchronous modes in Database Mirroring and Availability Groups, right?

Synchronous Commit doesn’t mean “zero data loss”

When we think about limiting data loss, the first thing we think of is a technology that lets us reduce points of failure. If every transaction must be written to two separate storage systems, we have a pretty good chance to have no data loss, right?


Let’s say you’re using a SQL Server Availability Group in SQL Server 2014 with a synchronous replica to do this. The secondary replica fails and is offline, but you don’t lose quorum. If you want 0 data loss, the primary needs to stop accepting writes immediately, right?

It doesn’t do that. The primary replica keeps going and writes can continue. Here’s what that looks like:

Synchronous AG Exposed

Tabloid headline: Synchronous AG Running Exposed! Scandal!

You could write custom scripts to detect the situation and stop the primary replica, but there’s a couple of problems with that. First, you’re offline, and you probably don’t want that. And second, it’s going to take some time to get that done, and that means that you don’t have zero data loss– you could lose anything written in the meanwhile. You could add another synchronous commit replica, but there’s obvious cost and support impacts, and you still aren’t guaranteed zero data loss.

Synchronous writes don’t necessarily guarantee zero data loss, you’ve got to dig into the details.

This stuff isn’t obvious

I’ll be straight up: I’ve been working with high availability and disaster recovery for a long time, and I hadn’t actually thought very critically about this until a recent chat room conversation with Brent discussing why it’s not super easy for cloud hosting providers to offer zero data loss in return for a lot of dollar signs.

Crazy facts: you can learn things from chat rooms and from the cloud. Who knew?


If data loss is important to you, don’t just assume that you’ve got it under control because you’re paying a vendor to take care of it. If you look closely, you may find that nothing’s working like you think! When your data is important, you need to make someone responsible for ensuring that you’re meeting your RPO and RTO, and have them prove that it works on a scheduled basis. Their job title doesn’t have to be “Database Administrator,” but they need to work for you, and they need to take their responsibility seriously.

Want to Learn More About High Availability and Disaster Recovery?

We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.

1 comment ↑ Back to top

With SQL Server AlwaysOn Availability Groups, you can offload backups to a replica rather than running them on the primary. Here’s how to do it:

1. Install Ola Hallengren’s utility scripts on all of the replicas. During the install, it creates a laundry list of SQL Agent jobs, but doesn’t set up schedules for any of them. More on that in a couple of steps.

Backup preference settings

Backup preference settings

2. Set your backup preferences. In SSMS, right-click on your Availability Group, click Properties, and click the Backup Preferences pane.

The first option, “Prefer Secondary,” means that your backups will be taken on a secondary server unless all secondaries go offline, at which point they’ll be taken on the primary. There’s some risk here: if communication falls behind, your secondary may be running backups of old data, as Anthony Nocentino explains. In that case, you won’t get backup failure alerts, but you won’t be able to meet your RPO. Monitoring for that is an exercise left for the reader.

In the “Replica backup priorities” window, rank your replicas to choose who will do the backups first.

Say I have three servers – two in my primary data center, and one in my disaster recovery site. I’d rather have my full backups running in my primary data center because if I need to do a restore, I want the backups nearby. (You can also run backups in both places – and I would – but more on that in a future post.)

To configure that, I’d set priorities as:

  • SQLPROD1 and SQLPROD2 (my primary data center replicas) – both 50 priority
  • SQLDR1 (my disaster recovery replica) – priority 40

3. Test your backup preferences. Run this query on each replica:

SELECT d.database_name, 
  sys.fn_hadr_backup_is_preferred_replica (d.database_name) AS IsPreferredBackupReplicaNow
FROM sys.availability_databases_cluster d

This returns a list of databases in an AG, and whether or not they’re the preferred backup replica right now. Check that on all of your replicas to make sure backups are going to run where you expect, and if not, revisit your backup preferences in the last step.

4. Configure Ola’s Agent full backup jobs. On any replica, in the Agent job list, right-click on the “DatabaseBackup – USER_DATABASES – FULL” job, click Properties, click Steps, and edit the first step. By default, it looks like this:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -Q “EXECUTE [dbo].[DatabaseBackup] @Databases = ‘USER_DATABASES’, @Directory = N’C:\Backup’, @BackupType = ‘FULL’, @Verify = ‘Y’, @CleanupTime = NULL, @CheckSum = ‘Y’, @LogToTable = ‘Y'” -b

You need to change these parts:

  • @Directory – set this to your backup path. I like using a UNC path that all of the replicas can access.
  • @Verify – I’d recommend turning this off to make your backup jobs go faster. If you really want to verify your backups, restore them on another server.
  • If you want to run the backups on a secondary replica rather than the primary, add a parameter for @CopyOnly=’Y’
  • If you only want to back up specific databases, modify the @Databases parameter. I don’t like doing that – I’d rather have one job for all of my backups. The way this is set now (USER_DATABASES), this one job will back up all of my databases that aren’t in an AG, plus it’ll back up the AG-contained databases where this replica is the preferred backup right now.

With AlwaysOn AGs, replicas can only run copy-only backups, and people often think that’s a problem. It’s only a problem if you want to use differential backups – otherwise, with AGs, it doesn’t affect transaction log handling at all. I don’t recommend using differentials with AlwaysOn AGs, but if you insist on doing it, you’ll be running your full backups on your primary replica.

Other parameters you may want to set:

  • @CleanupTime – how many hours of backup files you want to keep around
  • @Compress – Ola inherits the default compression setting at the server level, but you can hard code this to Y if you want to make sure backups are always compressed

So here’s what my Agent job script ends up looking like, with my changed parts in bold:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -Q “EXECUTE [dbo].[DatabaseBackup] @Databases = ‘USER_DATABASES’, @Directory = N’\\FILESERVER1\SQLBackups’@CopyOnly=’Y’, @CleanupTime=48, @Compress=’Y’, @BackupType = ‘FULL’, @Verify = ‘N’, @CleanupTime = NULL, @CheckSum = ‘Y’, @LogToTable = ‘Y'” -b

5. Copy these changes to all replicas and test the full jobs. Make sure each replica that isn’t supposed to run the fulls, doesn’t, and the replica that IS supposed to run the fulls, DOES. In really mission-critical environments where we’re building the new AG servers from the ground up, we actually fail the AG around to different servers to test behavior when different servers are the primary – and when entire data centers go down.

6. Repeat the setup with your log backup jobs. Right-click on the “DatabaseBackup – USER_DATABASES – LOG” job and click Properties, Steps, and edit the first step. Set the @Directory and @Verify steps as we did earlier.

Here’s where things get a little tricky – you don’t have to add @CopyOnly=’Y’ for the log backup steps. There’s no such thing as a copy-only log backup in an Availability Group secondary, much to my dismay.

You might also consider setting the @ChangeBackupType parameter to Y. By default, if Ola can’t do a transaction log backup (like if it’s a brand new database that has never had a full backup before), then the log backup is skipped. If you set @ChangeBackupType=’Y’, then Ola will do a full backup in that situation, and then do a log backup. However, if it’s a large database, this might take a while to perform the full, and this will tie up your log backup job while it runs. Say the full takes 20 minutes to perform – this might blow your RPO/RTO commitments.

7. Copy these changes to all replicas and test the jobs. Same thing here that we did with the full backups.

8. Configure your DBCC CHECKDB jobs. You need to check for corruption on any server where you’re running backups – here’s why.

9. Design your monitoring. Sadly, SQL Server doesn’t centralize backup history, so it’s up to you to poll all of your replicas to find out where backups are happening for any given Availability Group. In one case, I had a DBA change the backup preferences and Ola’s job settings incorrectly, and all of the backup jobs were succeeding – but none of them were backing up one of his Availability Groups.

10. Set yourself a weekly reminder to test restores. AG backups are notoriously complex, and if you cared enough to set up this whole expensive infrastructure, then you should care enough to test it. Make sure you have a good, restorable set of backups.

Kendra says: In a complex environment, I’m a fan of some paid third party tools that help you control the compression on your backups and which can keep a central repository of backup history, to help you monitor. Good news: Ola’s solution integrates with them! You can have your backup cake and eat it, too.

Want to Learn More About AlwaysOn Availability Groups?

We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.

14 comments ↑ Back to top

HADRQuizThink you know everything there is to know about high availability and disaster recovery?

Head on over to our new free quiz to find out! Take the SQL Server High Availability and Disaster Recovery Quiz.

0 comments ↑ Back to top

This morning, Microsoft announced a new standalone SQL Server Management Studio download:

This is our first effort to release SQL Server Management Studio (SSMS) in a mechanism outside of the SQL Engine releases. Our goal is to update this frequently with new features, fixes and support for the newest SQL Server features in SQL Server Engine and Azure SQL Database.

Awww yeah! Lightweight installer, and here’s the best part: there’s a built-in check for updates.

Check for Updates in SSMS

Check for Updates in SSMS

That’s right: Microsoft might start shipping updates to SSMS outside of the regular SQL Server Service Pack scenario!

This telegraphs a couple of interesting things: first, they had to do it because Azure SQL Database ships updates much faster than SQL Server does, so this gives Microsoft a way to enable those updates in SSMS without releasing a separate service pack.

But more interestingly – to me at least – is that this means SQL Server Management Studio is now considered The Way to Manage Azure SQL Database. There isn’t a separate tool coming, nor will designer/developer tools like Visual Studio become a management tool.

SSMS is finally getting some much-needed love. Rejoice, dear readers.

25 comments ↑ Back to top

It’s your fault if the SQL Server goes down.

You don’t build a lot of clusters or AGs, and you aren’t confident that your SQL Servers will survive a patching problem, a SAN outage, or a corruption hiccup. Brent and Kendra’s new 6-hour online course has advanced modules on:

  • Choosing the right HA/DR features for your needs
  • Learning pitfalls to avoid so you can be confident your server will stay up
  • Troubleshooting advanced features when they fall apart

And to celebrate, it’s 50% off Tuesday only – and the discount will drop by 10% each day.

  • Wednesday, 40% off
  • Thursday, 30% off
  • Friday, 20% off
  • Saturday, just 10% off

Every day, the price goes up. Isn’t shrinking awful?

Better move fast. And hey, it’s better to ask for forgiveness than permission. The clock is ticking down for the next price increase…

2 comments ↑ Back to top

When the DBA reports to you, ask yourself two questions:

  1. “What would I fire the DBA for?”
  2. “How can I prove that they’re not doing that?”

There’s a lot of possible answers to #1, many of which involve cocaine and the office copier, but let’s focus on one I hear a lot: an inability to restore data that they claimed was backed up.

Backups are the foundation of our Hierarchy of Database Needs. They’re incredibly important, but yet they often get ignored because nobody’s banging on the DBA’s cubicle telling them to focus on backups. Instead, users are screaming about performance or too-tight security.

It’s up to you, their manager, to enable them to get the basics down. Here’s how to do it.

1. Ask for a report of most recent backups, by database.

KPI: Coffee Cups Consumed (by André Freitas)

KPI: Coffee Cups Consumed (by André Freitas)

Call in the DBA and give them a list of columns you want in the report:

  • Server name
  • Database name
  • Database size
  • Purpose (production, QA, development)
  • Last full backup finish date/time
  • Last log backup finish date/time
  • Last successful DBCC CHECKDB finish date/time
  • Last date/time the backups were restored for testing

They can produce the report in any format that works for them, but they have to be able to produce it on demand with less than five minutes of manual work. (Ideally, they’d give you a solution you can run on your own, without telling them, but let’s start with something easy.)

Tell them this first task is about getting the lay of the land first, not fixing the problems. Oh, they’re going to find problems alright – but focus on just getting a clear picture of the work ahead first.

This task may take them a day or two because they have to assemble an exact list of the servers they’re responsible for, plus build the report process. (The good DBAs reading this are thinking, “I could do this in fifteen minutes!” If you’re just saying that now, that means you haven’t done it yet, and you’re not quite as good as you think. The great DBAs are thinking, “Yeah, that’s about how long it took me to get it right the first time.”)

2. Help the DBA come up with an improvement plan.

They’re going to be ashamed/disappointed by some of these numbers. (Or if they’re not, there’s your answer about their fit for the DBA position.)

Let them build a back-of-the-napkin mini-project plan to improve the metrics for your most important server. Give them that time – typically there’s a day or two of work involved – and after they finish, review what worked about the mini-project and what didn’t. Use that knowledge to build out a plan for the rest of the servers.

Your job is to run interference for them while they get the backups and DBCCs under control. When the DBA gets “urgent” help requests from end users, here’s what I coach them to say:

“Right now, I’m focused on a project to get our backups under control. If this issue is more important than getting our production data backed up, let’s go into my manager’s office together and talk through it. If not, I hate to say this, but I’ll need you to file a help desk ticket and I’ll get to it once we’ve got the backups under control.”

Yes, this means admitting that the backups aren’t under control. It’s the only way end users will accept a “no” from the DBA.

After this project finishes, run the report again, talk through the results, and then take the results to the business users who are storing data on these servers.

3. Ask the business if this amount of data loss is acceptable.

Take the backup numbers to the business along with our free RPO/RTO Worksheet. By default, the business usually wants zero data loss and zero downtime, but as good IT managers, we have to communicate the realities of today’s databases.

Armed with the costs and timelines in that worksheet, bring written RPO goals back to the DBA. (Yes, to be a good manager takes work, and these are political problems, not technical problems – save your DBA for the technical problems.) Sometimes, this will mean bringing budget numbers back as well – if the business wants tighter RPO/RTO goals than the current infrastructure provides, the DBA will need resources to build the infrastructure improvements.

4. Measure the gap between IT’s RPO goals and the DBA’s delivery.

Remember that report the DBA built? Add two columns:

  • IT’s goal for this database’s data loss (in minutes/hours) – you and the DBA come to an agreement on a realistic goal. In underfunded shops, this is sadly less ambitious than the business’s goals. (Example: “The business wants us to lose no more than 1 minute of data, but I understand that they’re not giving you space for log backups, so let’s aim for losing no more than 1 hour of data.”) In well-funded shops, the IT goal may actually be more ambitious than the business’s goal.
  • The gap between the DBA’s delivery and IT’s goal – if the last successful backup was 45 minutes ago, but we’re aiming for <30 minutes of data loss, we’re in the red by 15 minutes. This metric is how we measure DBAs.

As long as this gap is in the black, the DBA is doing a good job on backups. (Note: I’m not even talking about restore tests or DBCCs here – this is just what I consider table stakes on managing a DBA.)

When this number goes into the red, the DBA should stop focusing on anything else – anything – and focus exclusively on getting the data protected again.

For bonus points, trend the total server count, database count, and total data size over time. This helps your user base understand that the infrastructure will come under more pressure as the data size grows – you can only cram so much data into a server before performance issues arise.

In subsequent posts in this series, I’ll be exploring how to measure a DBA’s workload, the infrastructure’s complexity, RTO delivery, and other boring management topics. Buckle up.

Want to Learn More About Keeping Your Databases Online?

We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.

26 comments ↑ Back to top

We’ve just updated our First Responder Kit to include new versions of these tools:

sp_Blitz® v41 – June 18, 2015:

  • Added check 162 for CMEMTHREAD waits on servers with >= 8 logical processors per NUMA node.
  • Added check 159 for NUMA nodes reporting dangerously low memory in sys.dm_os_nodes.
  • Added check 161 for a high number of cached plans per KB 3026083.
  • Fixed a bug in the SkipChecks routines. Reported by Kevin Collins.
  • Backup-to-same-drive-as-databases check (93) now includes the number of backups that were done so you can tell if it was a one-off problem, or if all backups are going to the wrong place.
  • Bug fixes and improvements.

sp_BlitzCache™ v2.4.6 – June 18, 2015:

  • Temporary object cleanup will actually occur – thanks to Bob Klimes for
    spotting this
  • Adding memory grants to expert mode and export to excel mode
  • Parent object name is now displayed next to statements that come from a
    stored procedure
  • Run clean up in ##bou_BlitzCacheProcs before executing – this should
    prevent duplicate records from building up over multiple executions on
    the same SPID.
  • Added a @sql_handle parameter to filter on queries from a specific
    sql_handle or sql_handles
  • Added support for filtering on database name

How will the sp_BlitzCache™ changes look?

When you have a stored procedure with an expensive statement inside it, you’ll now see the parent procedure’s name in the Query Type column:


Database name filtering works just like you’d expect – use the @database_name parameter to limit the analysis to a single database.

EXEC sp_BlitzCache @hide_summary = 1, 
                   @results = 'expert', 
                   @database_name = 'AdventureWorks2014' ;

And, finally, if you want to look for a single stored procedure or batch of statements, you can grab the SQL handle. A SQL handle identifies a procedure or batch, but they’re only included in the expert mode results. Use @results = 'expert' and scroll very far to the right in the results. You’ll see a bunch of binary gibberish in there – one of the columns will be SQL Handle. Grab that and add a @sql_handle parameter to your execution of sp_BlitzCache™, just like this:

EXEC sp_BlitzCache @hide_summary = 1, 
                   @results = 'expert', 
                   @sql_handle = '0x030005006D9E2E1F6CCBE500BBA4000001000000000000000000000000000000000000000000000000000000';

What next?

If you’ve already subscribed to email updates about our tools, you got an email this morning with a direct link to the zip file. For the rest of you, hit up the download/EULA page, and enjoy! Hope this makes your database work easier.

13 comments ↑ Back to top
Commit message: Duck!

Commit message: Duck!

Everyone’s code is terrible. But exactly how terrible is yours?

First Circle: Code Limbo

We can’t say this is good code, but for the most part nobody notices it.

Second Circle: Code Lust

Cool function, bro!

Third Circle: Gluttonous Code

There’s always one simple piece of logic that’s been written hundreds of times, all over the codebase, but in slightly different ways. And then, one day, you need to change it.

Fourth Circle: Angry Code

Your application has periodic timeouts, which disappear as soon as users report them.

Fifth Circle: Greedy Code

This code was only supposed to run once when a user logged in and their profile was loaded. But sp_BlitzCache® says it runs 4,000 times per minute when you’re hardly doing anything.

Sixth Circle: Code Heresy

This code looks incredibly bad. It violates common sense, but it works right now. You’re afraid to touch it, for fear it will burst into flames.

Seventh Circle: Violent Code

Your application is deadlocking, and literally killing itself.

Eighth Circle: Code Fraud

That day you find out that large portions of your source code don’t actually belong to your company. And now you need to fix it.

Ninth Circle: Treacherous Code

When your database corrupts itself.

8 comments ↑ Back to top