I’ve been doing SQL Critical Care® work with clients for over a year now. It’s absolutely true that every client’s situation is different and needs special attention. However, I’ve found that there’s a short list of issues that nearly all clients have in common. Drawing from that list, here are the top three high-impact, low-effort areas you can work on today to tune up your SQL Server.
Check power savings everywhere
I recently had a client whose CPU would peg at 100% for short stretches even though the server was under a light load. After some exploration, we found they had power savings turned on. In the BIOS. Because we couldn’t get to the BIOS remotely (and we didn’t want to reboot the server right then and there), we used a free tool called CPU-Z to watch core speeds. Can you spot the difference in speeds?
That’s a whole lot of CPU speed you paid for and aren’t getting! When checking power savings, make sure you check all of the following:
- Windows Power Options (under Control Panel)
- If a physical server, also check the BIOS at startup
- If a virtual server, check:
- hypervisor power settings/performance plan
- BIOS at startup of the hypervisor host server
One other thing: make sure you don’t have outdated BIOS firmware. That can have some nasty CPU consequences too.
Changes to the BIOS require a server restart, so plan accordingly.
How to tell if it worked: Using CPU-Z, watch the Core Speed box at lower left. If it deviates from the core speed in the Specification box by more than approximately 1%, there’s probably power savings turned on somewhere.
address hot missing index requests
Every time a query runs and wants an index that doesn’t exist, SQL Server files that missing index request away. You need to know which missing indexes are being requested and how helpful that index will be. There are DMVs you can query for this information, but my favorite method is sp_BlitzIndex®. It will tell you the missing index definition, as well as three numbers:
- Impact – the difference this index is expected to make in query cost.
- Avg. Query Cost – the price in magical Microsoft Query Bucks as determined by SQL Server.
- Uses – the number of times this index would have been used.
The product of these three numbers (Impact x Cost x Uses) equals the Estimated Benefit.
Because these numbers get reset with every service restart, we need to factor in uptime as well. If you’ve been up for seven days or more with your average user load during that time:
- Est. Benefit > 1,000,000: Keep an eye on this index.
- Est. Benefit > 10,000,000: Try it out in dev environment and see how it does.
- Est. Benefit > 100,000,000: Try it out in a dev environment — TODAY.
I can’t recommend outright that you deploy these missing indexes to production, just that you test them. One reason: it may be more efficient to alter a very similar existing index rather than add the new one. You’ll have to decide what’s best, but don’t let the 100M+ indexes go unnoticed.
And in case you’re wondering, the record high I’ve seen for the Estimated Benefit number is just over 14,000,000,000 (billion). Let’s hope you can’t beat that. (You don’t want to.)
How to tell if it worked: This missing index recommendation should go away in sp_BlitzIndex® results and the query running it should decrease in CPU time and cost.
Raise cost threshold for parallelism
SQL Server’s default Cost Threshold for Parallelism (CTP) is 5. Microsoft has left this setting untouched since the 20th century.
(Don’t laugh — that was at least sixteen years ago.)
Face it — pretty much everything that old needs an update. Your hairstyle. Your clothes. Your car. Your server’s CTP is no different. CPUs are way more powerful today than they were in the late 90’s. With CTP set to 5, chances are a lot of queries are going parallel when they don’t have to. Raise the bar to 25 or even 50 (as always, test this in dev first). Unlike licensing additional cores, changing CTP settings is absolutely free.
How to tell if it worked: You should see a drop in CXPACKET waits, along with some query plans no longer showing parallelism.
There you have it: three simple and quick checks to speed up your SQL Server. For more ideas, try Five SQL Server Settings to Change.
Brent says: Listen, people, this is free advice. Don’t pay us to tell you this. Just do it today. Then pay us anyway. Actually, ignore this post.
At first glance, SQL Server’s transactional replication seems like it’s useful for moving data around in all sorts of situations: it works in Standard Edition, it’s not synchronous, and you can have multiple subscribers.
Why People Want Replication to Test and Pre-Production Environments
Setting up a test environment sounds simple at first. But it’s pretty tricky. Here are typical requirements:
- Data in the test environment needs to be fresh, recent production data to support query tuning
- Data and schema in the test environment need to be writable for testing purposes
- Data and schema in the test environment need to be periodically reset back to production-like configuration
The first of those requirements are why replication sounds like a great idea. There’s a lot of gotchas, even with that issue, but the second two requirements are where replication just clearly doesn’t meet the bar.
Replication isn’t Friendly to Changing Subscriber Tables
Transactional replication won’t like it if you go around changing data and schema on its subscriber tables. If you change the data, updates and deletes flowing through replication may start to fail. Well, let’s just be realistic– they WILL fail, and fixing it will be manual and a major pain point.
Schema changes are even dicier. You can’t just drop or change the data type in a replicated column, you need to modify replication for that– and that means adding filters to the publication.
Resetting Replication isn’t Fun, Either
Re-initializing a replication subscriber is often very impactful on the publisher: running a replication snapshot is an intensive, time consuming process that causes CPU, IO, and lock overhead on the publisher. (Warning: this snapshot is a totally different thing from the database snapshot I talk about a bit later. This one basically copies out all the contents of the published articles on the publisher into flat files, copies everything, and then loads it into the subscriber.)
You can initialize a replication subscriber from a full backup, but that’s not lightweight or fast for large databases, and it’s very dependent on great custom scripting and timing. And usually people are looking at replication because they want to avoid big backups and restores.
What About a Creative Solution Involving Replication?
Refreshing test and staging environments is a tough problem. So tough that people often try to get creative, like this:
- ProductionDB in Denver is a transactional replication publisher on PROD01
- SubscriberDB in Portland is a transactional replication subscriber on STG01
- Periodically, some process is run against subscriber DB to reset other databases on STG01. Ideas usually involve database snapshots, backups, and custom scripts.
But when you dig into the details, this is never great. Here’s why.
Transactional Replication Impacts the Publisher, Your Developers, and Your DBAs
If there’s one thing that most DBAs and developers can agree on, it’s that replication has made them work late and curse a lot.
Replication requires that every table have a primary key. This may mean schema changes just to get it set up.
Replication slows down your software deployment, because it’s easy to mess up publishing tables when you make schema changes to them. The only way to not mess this up is to also implement replication in your development environment, which developers hate. And in our scenario, where the test database is the subscriber, what would that even mean? Recursive replication torture?
Replication impacts performance on the publisher, particularly under high load. It frequently requires tuning storage, setting up a scale out distributor ($$$ hardware and licensing $$$), fine tuning your publisher and subscriber properties, implementing custom monitoring, and basically having at least one team member obtain a Masters Degree in Replication at the School of Late Nights. I went to that school a long time ago, and I didn’t keep the yearbook, but I did write about it here.
Database Snapshots are Read-Only
There’s no such thing as a writable database snapshot in SQL Server, even though it’s an Enterprise Edition feature. (Fancy SANs have options for that, SQL Server itself does not.) That pretty much takes most of the uses out of it for a staging environment.
If You’re Writing a Custom Script in the Staging Environment, Why Use Replication?
If you have to basically write a custom ETL to read from the subscriber to refresh staging databases, transaction log shipping is much easier to manage than replication, and it allows a read only standby mode on the subscriber.
If You’re Using Backup and Restore, Replication is Also Overkill
While it’s technically possible to back up a replication subscriber and restore it, then remove replication from it, it’s not all that great. This is a lot of complexity without a lot of payoff.
- You’ve still got the time to do the backup and restore (which you were trying to avoid in the first place)
- You haven’t validated that the database you’ve restored has a schema that matches production (it could be very different on the replication subscriber)
You’re much better off basing the restore off production in this case. Even if the pre-production and testing databases are in a different datacenter, you can optimize the types of backups used, compression and network copy time, and restore schedule. Even setting up multiple test instances that restore “in rounds” and which can be used at different times a day is often operationally preferable to the performance, monitoring, and caretaking needs of replication.
There’s No Easy Answer to this Problem
Refreshing staging and development environments doesn’t have a single right answer. This is done many different ways based on application requirements. Just don’t pin all your hopes on replication and start investing in it without looking at the details of exactly how it will work — because it’s probably not what you think it is at first. Always make sure you factor in:
- Performance impact on the publisher and hardware/storage/licensing costs to mitigate this
- Any restrictions or requirements on schema changes at the publisher and impact on code development
- Operational cost of upkeep of the technology
Don’t give up! Having staging and development environments with recent data that works well for you is totally possible, you just need to consider more technologies than replication.
We’ve got lots of resources, for money and for love. Or just for free.
- Watch Brent’s free video on HA/DR basics
- Read more free articles about replication
- Buy our 6+ hour online course on High Availability and Disaster Recovery in SQL Server
- Get custom advice from us directly about your environment in our SQL Critical Care® Service
- Attend our in-person course for Senior DBAs to learn advanced tricks to wrangle your SQL Servers
Brent says: Yeah, replication isn’t what I call “refreshing.”
But Kendra, it can’t be that hard… after all, we have synchronous modes in Database Mirroring and Availability Groups, right?
Synchronous Commit doesn’t mean “zero data loss”
When we think about limiting data loss, the first thing we think of is a technology that lets us reduce points of failure. If every transaction must be written to two separate storage systems, we have a pretty good chance to have no data loss, right?
Let’s say you’re using a SQL Server Availability Group in SQL Server 2014 with a synchronous replica to do this. The secondary replica fails and is offline, but you don’t lose quorum. If you want 0 data loss, the primary needs to stop accepting writes immediately, right?
It doesn’t do that. The primary replica keeps going and writes can continue. Here’s what that looks like:
You could write custom scripts to detect the situation and stop the primary replica, but there’s a couple of problems with that. First, you’re offline, and you probably don’t want that. And second, it’s going to take some time to get that done, and that means that you don’t have zero data loss– you could lose anything written in the meanwhile. You could add another synchronous commit replica, but there’s obvious cost and support impacts, and you still aren’t guaranteed zero data loss.
Synchronous writes don’t necessarily guarantee zero data loss, you’ve got to dig into the details.
This stuff isn’t obvious
I’ll be straight up: I’ve been working with high availability and disaster recovery for a long time, and I hadn’t actually thought very critically about this until a recent chat room conversation with Brent discussing why it’s not super easy for cloud hosting providers to offer zero data loss in return for a lot of dollar signs.
Crazy facts: you can learn things from chat rooms and from the cloud. Who knew?
NEED TO PROTECT YOUR DATA? YOU NEED A FULL TIME EMPLOYEE WHO IS RESPONSIBLE FOR THAT.
If data loss is important to you, don’t just assume that you’ve got it under control because you’re paying a vendor to take care of it. If you look closely, you may find that nothing’s working like you think! When your data is important, you need to make someone responsible for ensuring that you’re meeting your RPO and RTO, and have them prove that it works on a scheduled basis. Their job title doesn’t have to be “Database Administrator,” but they need to work for you, and they need to take their responsibility seriously.
Want to Learn More About High Availability and Disaster Recovery?
We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.
Head on over to our new free quiz to find out! Take the SQL Server High Availability and Disaster Recovery Quiz.
When the DBA reports to you, ask yourself two questions:
- “What would I fire the DBA for?”
- “How can I prove that they’re not doing that?”
There’s a lot of possible answers to #1, many of which involve cocaine and the office copier, but let’s focus on one I hear a lot: an inability to restore data that they claimed was backed up.
Backups are the foundation of our Hierarchy of Database Needs. They’re incredibly important, but yet they often get ignored because nobody’s banging on the DBA’s cubicle telling them to focus on backups. Instead, users are screaming about performance or too-tight security.
It’s up to you, their manager, to enable them to get the basics down. Here’s how to do it.
1. Ask for a report of most recent backups, by database.
Call in the DBA and give them a list of columns you want in the report:
- Server name
- Database name
- Database size
- Purpose (production, QA, development)
- Last full backup finish date/time
- Last log backup finish date/time
- Last successful DBCC CHECKDB finish date/time
- Last date/time the backups were restored for testing
They can produce the report in any format that works for them, but they have to be able to produce it on demand with less than five minutes of manual work. (Ideally, they’d give you a solution you can run on your own, without telling them, but let’s start with something easy.)
Tell them this first task is about getting the lay of the land first, not fixing the problems. Oh, they’re going to find problems alright – but focus on just getting a clear picture of the work ahead first.
This task may take them a day or two because they have to assemble an exact list of the servers they’re responsible for, plus build the report process. (The good DBAs reading this are thinking, “I could do this in fifteen minutes!” If you’re just saying that now, that means you haven’t done it yet, and you’re not quite as good as you think. The great DBAs are thinking, “Yeah, that’s about how long it took me to get it right the first time.”)
2. Help the DBA come up with an improvement plan.
They’re going to be ashamed/disappointed by some of these numbers. (Or if they’re not, there’s your answer about their fit for the DBA position.)
Let them build a back-of-the-napkin mini-project plan to improve the metrics for your most important server. Give them that time – typically there’s a day or two of work involved – and after they finish, review what worked about the mini-project and what didn’t. Use that knowledge to build out a plan for the rest of the servers.
Your job is to run interference for them while they get the backups and DBCCs under control. When the DBA gets “urgent” help requests from end users, here’s what I coach them to say:
“Right now, I’m focused on a project to get our backups under control. If this issue is more important than getting our production data backed up, let’s go into my manager’s office together and talk through it. If not, I hate to say this, but I’ll need you to file a help desk ticket and I’ll get to it once we’ve got the backups under control.”
Yes, this means admitting that the backups aren’t under control. It’s the only way end users will accept a “no” from the DBA.
After this project finishes, run the report again, talk through the results, and then take the results to the business users who are storing data on these servers.
3. Ask the business if this amount of data loss is acceptable.
Take the backup numbers to the business along with our free RPO/RTO Worksheet. By default, the business usually wants zero data loss and zero downtime, but as good IT managers, we have to communicate the realities of today’s databases.
Armed with the costs and timelines in that worksheet, bring written RPO goals back to the DBA. (Yes, to be a good manager takes work, and these are political problems, not technical problems – save your DBA for the technical problems.) Sometimes, this will mean bringing budget numbers back as well – if the business wants tighter RPO/RTO goals than the current infrastructure provides, the DBA will need resources to build the infrastructure improvements.
4. Measure the gap between IT’s RPO goals and the DBA’s delivery.
Remember that report the DBA built? Add two columns:
- IT’s goal for this database’s data loss (in minutes/hours) – you and the DBA come to an agreement on a realistic goal. In underfunded shops, this is sadly less ambitious than the business’s goals. (Example: “The business wants us to lose no more than 1 minute of data, but I understand that they’re not giving you space for log backups, so let’s aim for losing no more than 1 hour of data.”) In well-funded shops, the IT goal may actually be more ambitious than the business’s goal.
- The gap between the DBA’s delivery and IT’s goal – if the last successful backup was 45 minutes ago, but we’re aiming for <30 minutes of data loss, we’re in the red by 15 minutes. This metric is how we measure DBAs.
As long as this gap is in the black, the DBA is doing a good job on backups. (Note: I’m not even talking about restore tests or DBCCs here – this is just what I consider table stakes on managing a DBA.)
When this number goes into the red, the DBA should stop focusing on anything else – anything – and focus exclusively on getting the data protected again.
For bonus points, trend the total server count, database count, and total data size over time. This helps your user base understand that the infrastructure will come under more pressure as the data size grows – you can only cram so much data into a server before performance issues arise.
In subsequent posts in this series, I’ll be exploring how to measure a DBA’s workload, the infrastructure’s complexity, RTO delivery, and other boring management topics. Buckle up.
Want to Learn More About Keeping Your Databases Online?
We just launched our new DBA’s Guide to SQL Server High Availability and Disaster Recovery – a 6-hour online video course that teaches you about clustering, AlwaysOn AGs, quorum, database mirroring, log shipping, and more.
Everyone’s code is terrible. But exactly how terrible is yours?
First Circle: Code Limbo
We can’t say this is good code, but for the most part nobody notices it.
Second Circle: Code Lust
Third Circle: Gluttonous Code
There’s always one simple piece of logic that’s been written hundreds of times, all over the codebase, but in slightly different ways. And then, one day, you need to change it.
Fourth Circle: Angry Code
Your application has periodic timeouts, which disappear as soon as users report them.
Fifth Circle: Greedy Code
This code was only supposed to run once when a user logged in and their profile was loaded. But sp_BlitzCache® says it runs 4,000 times per minute when you’re hardly doing anything.
Sixth Circle: Code Heresy
This code looks incredibly bad. It violates common sense, but it works right now. You’re afraid to touch it, for fear it will burst into flames.
Seventh Circle: Violent Code
Your application is deadlocking, and literally killing itself.
Eighth Circle: Code Fraud
That day you find out that large portions of your source code don’t actually belong to your company. And now you need to fix it.
Ninth Circle: Treacherous Code
When your database corrupts itself.
The QUERYTRACEON hint can be particularly useful for testing the new cardinality estimator in SQL Server 2014 on individual queries using Trace Flag 2312, if your database is still using compatibility level 110:
SELECT ClosedDate, Id FROM dbo.Posts WHERE LastActivityDate > '2013-09-05 11:57:38.690' OPTION (QUERYTRACEON 2312); GO
Conversely, if you’re using the new estimator everywhere by having database compatibility set to 120, you can use the old estimator for an individual query by using QUERYTRACEON 9481:
SELECT ClosedDate, Id FROM dbo.Posts WHERE LastActivityDate > '2013-09-05 11:57:38.690' OPTION (QUERYTRACEON 9481); GO
There’s a little problem. Turning on a trace flag requires high permissions. Let’s say that I’m attempting to use this hint from an account which doesn’t have superpower permissions. Here we create a login and user for app_account and grant it data reader, and execute on the dbo schema:
USE [master] GO CREATE LOGIN [app_account] WITH PASSWORD=N'DontBeLikeMeUseWindowsAuth', DEFAULT_DATABASE=StackOverflow, CHECK_EXPIRATION=ON, CHECK_POLICY=ON GO use StackOverflow; GO CREATE USER [app_account] FOR LOGIN [app_account]; GO ALTER ROLE [db_datareader] ADD MEMBER [app_account]; GO GRANT EXECUTE ON SCHEMA::dbo to [app_account]; GO
We can now impersonate app_account to run some tests!
EXECUTE AS LOGIN='app_account'; GO
Meet Error 2561, User Does Not Have Permission to Run DBCC TRACEON
When I try to run the query, things screech to a halt:
Even if I try to get tricky, I can’t sneak this into a temporary procedure as app_account, either:
Stored Procedures to the Rescue
We need a little help from a stored procedure. We’re going to revert back to our sa role, and create a procedure around our query:
REVERT GO CREATE PROCEDURE dbo.RecentPosts @DateVal DATETIME AS SELECT ClosedDate, Id FROM dbo.Posts WHERE LastActivityDate > @DateVal OPTION (QUERYTRACEON 2312, RECOMPILE); GO /* Now we go back to running as app_account again */ EXECUTE AS LOGIN='app_account'; GO
Due to the magic of ownership chaining, we can now run the query as app_account:
This Means You Don’t have to Give Your applications Sysadmin Rights
… but you may have problems if those applications need to run a lot of adhoc queries.
Want to learn more about statistics and the new cardinality estimator? Join me and Jeremiah Peschka for our Advanced Querying and Indexing course!
Want to test your knowledge on indexes and query tuning in SQL Server?
Curious how your query tuning style compares to others?
Looking to have a little fun?
Interested in a coupon code?
Your Monday just got better. Check out our new free online quizzes right now:
Find and Fix Your Worst Query Patterns (Sponsored by Idera Software)
Friday, June 12, at 11:00AM Central
You’d love to make your queries faster, but you’re not sure what to change. Kendra Little will teach you how to identify bad patterns in the execution plans of the biggest, baddest queries running against your SQL Server. You’ll see Kendra triage bad queries in action and get an easy-to-use reference you can use to diagnose problems in your query execution plans back at work. Register here!
Missed the Webcast or Want to Learn More on the Content?
- Index articles: http://www.brentozar.com/sql/index-all-about-sql-server-indexes/
- Scalar functions and the observer effect: http://www.brentozar.com/archive/2015/02/measuring-performance-can-slow-sql-server/
- Statistics, execution plans, and parameter sniffing: http://www.brentozar.com/archive/2013/11/why-parameter-sniffing-can-slow-down-queries-video/
- Crazy joins and cardinality estimation: http://www.brentozar.com/sql/index-all-about-sql-server-indexes/
Want More Free Training?
We had some great questions about trivial execution plans in SQL Server in our Advanced Querying and Indexing class a few weeks ago. Here’s a little glimpse into what we talked about.
For really simple queries, SQL Server can use “trivial optimization”. If there’s a very limited number of ways to run the query, why do a bunch of fancy, CPU burning cost-based optimization? Just chuck the plan at the query and let it go!
Downside: Trivial plans don’t ask for indexes
With a trivial plan, SQL Server never gets to the phase of optimization where it would ask for an index. This means you might have a really frequently executing query running against your SQL instance, burning lots of CPU which might be vastly improved by an index, but nothing ever registers in the missing index dynamic management views.
How Trivial Does A Query Need to be for this Optimization?
I find trivial plans frequently, but it’s also a little trickier than you’d think. Let’s look at some examples:
I’m starting off with a simple query, running against a large table:
SELECT Id FROM dbo.Posts WHERE ParentId=3; GO
The Posts table has only a clustered primary key on on the Id column. This query gets FULL optimization, and SQL Server asks for an index.
Why wasn’t this plan trivial? Well, even though SQL Server didn’t have its choice of indexes to use, notice the parallel lines in there. SQL Server did have to decide whether or not to use more than one processor!
Let’s Take Away SQL Server’s Choices
Sorry SQL Server, I’m writing this blog post, and I know that I can remove this query’s chance to go parallel by raising one setting: Cost Threshold for Parallelism. This setting is the “estimated cost bar” for who gets to use multiple cores. I’m going to pump this setting up so my query doesn’t qualify.
exec sp_configure 'cost threshold for parallelism', 500 GO reconfigure GO
Be careful, this setting impacts all queries on the instance, and maxdop hints don’t override it.
Now, rerunning my query:
This time I got trivial optimization – the query’s estimated cost for running a single threaded plan doesn’t even qualify for parallelism, so that choice doesn’t exist.
No missing index shows up, even though SQL Server did 346,913 logical reads for this query.
What if I Add an Unrelated Nonclustered Index?
Now that we have a trivial plan, let’s give SQL Server a choice– but not a very good choice. We’re going to create an unrelated non-clustered index:
CREATE NONCLUSTERED INDEX ix_Posts_LastActivityDate ON dbo.Posts (LastActivityDate) GO
Our query doesn’t reference the LastActivityDate column at all.
Rerunning our query…
Full optimization is back! Even though the LastActivityDate index seems really unrelated to what we’re doing, just adding it puts us back in FULL optimization.
Let’s Clean Up That Cost Threshold
Otherwise I’ll forget about it and not understand the weird plans on my test machine later. I’m just setting it back to the default here:
exec sp_configure 'cost threshold for parallelism', 5 GO reconfigure GO
Takeaways: Beware the Creeping Trivial Plan
While most of your tables may be indexed well, it’s easy for changes in code to result in poorly indexed changes creeping out. While SQL Server’s default setting of 5 for “Cost Threshold for Parallelism” is generally far too low for modern processors, understand that raising it may increase your chances of getting trivial execution plans, which won’t ever ask for indexing help.
Want to find out if you’ve got trivial plans among your top queries? Our free sp_BlitzCache™ procedure sniffs around in your query XML and warns you right away.