We were looking at magnetic poetry words on a public bathroom wall (true story, insert TempDB joke here) and thought, “You know what would be really awesome? If there was one of these for databases.”
So we made ’em.
Each of us put together our own favorite database words, so we each have our own sheet.
To get one, just track us down at a conference or user group. This week, Doug & Kendra will be at the PASS Summit in Seattle, and I’ll be at SQL Intersection in Vegas. The more of us you meet, the more words you’ll be able to work with, since we’ll each carry our own sheets.
Share your favorites with us on Instagram or Twitter with the tag #SQLpoetry. We’ll pick out our favorites and send you a little something special.
Woohoo, it’s another half of the year which means it’s time for another version of sp_BlitzCache®. The last version came out on June 18th, and it’s been about 4 months. In that time frame, you have requested some new features, and they’ve slowly trickled into sp_BlitzCache®.
If you don’t want to wait, head over to the download page and get yourself an sp_BlitzCache®!
In no particular order:
Raul Gonzalez pointed out that if you set all of the parameters to NULL, sp_BlitzCache® will fail. This happens by default when you script execution through SSMS. Why SSMS does this is beyond me, but there you have it. sp_BlitzCache® will now get really upset and fail to run with a nice error message if you set any of the following to null:
You guys are smart. I mean really smart. To help you keep being smart, I added a bunch of columns to expert mode. Ready?
- Compilation memory
- Compilation CPU time
- Compilation wall clock time
- Query hash and query plan hash
- Cached plan size
- Serial desired and required memory
sp_BlitzCache® used to pull back a lot of data by default. I mean… a lot of data. This could cause execution time to go through the roof on busy systems and systems with big plans. The default value for
@top has been reduced to 10. You can still set it back to whatever you want, but just know that you can change it back to the way it was before.
There’s also an option to
@skip_analysis. This bypasses the XML shredding that goes on in the execution plans. When you skip the XML analysis, you won’t find out if you have plan warnings or missing indexes, but your executions of sp_BlitzCache® will be much faster and you’ll use less CPU. Use this on heavily resource constrained systems.
Speaking of XML – The XML checks were re-written to be smarter and require less CPU time. We still have to scan the CPU, but I’ve re-written the XQuery in ways that shouldn’t need anywhere near as much CPU time as it used to. The upside is that this also makes it easier to add more checks to sp_BlitzCache®.
There are two big changes to how results are sorted. One for speed and one for functionality.
An additional sort order option has been added for recent compilation. If you want to sort by recent compiles, just run EXEC sp_BlitzCache @sort_order = ‘compiles’
Some sort orders have always been slow (I’m looking at you “executions per minute”). New changes have pushed the sort earlier int he query and, in testing, these changes make performance faster, especially on busy systems.
Making Big Numbers Smaller
There was a potential for the totals temporary table to have integer overflows. I fixed that.
Basically, there was a significant chance that you might run into problems where you’ve burned through more than 600 hours of CPU time (that’s 2,147,483,647 milliseconds BTW). To help you avoid errors when you’re actually using your CPUs, I added a bunch of math to make numbers smaller. No, really, that’s about all that I did.
Back in November of 2006, Microsoft released “SQL Server 2005 Waits and Queues” as a best practices article. Many people in the SQL Server community have referenced this article as a great place to get started with understanding waits in SQL Server. Heck, I’ve recommend it on a regular basis.
The last time I recommended it I told the recipient, “This is great, but only read it from page X to Y.”
After nearly 10 years, this paper is showing its age. As you read this critique, and if you read the paper, it’s important to consider the historical context. “SQL Server 2005 Waits and Queues” is the first of its kind for SQL Server.
The idea of the article is that developers and DBAs should use the waits information to identify areas of slow performance in SQL Server (waits) and then associate that with performance monitor counters (queues). The authors imply a methodology that starts with waits, correlates through other data, and, if followed correctly, eventually arrives at a solution for your performance problems.
The core concepts of “SQL Server 2005 Waits and Queues” are solid; queries wait, SQL Server records the waits, you interpret the waits. The overall methodology that the authors propose is a good one, too. Gathering evidence and correlating the evidence with other signifiers is a sound way to approach problem solving.
The devil is in the details. And, in this case, the devil has a hangover.
The “SQL Server 2005 Waits and Queues” paper starts off benignly enough. There are places where it sounds downright revolutionary. Even today very few people consider things like:
We outline a methodology that considers performance from the perspective of application, SQL Server, and correlates this to the system or resource perspective.
The Purpose section goes on to mention “This paper outlines a best practice methodology and provides guidelines and thought processes to identify poor perfoming applications”. If “SQL Server 2005 Waits and Queues” stuck to defining process and high level guidelines, this paper would be gold. The process that the authors outline takes a top down approach that allows an observer to start at a system level, rapidly identify problem areas, drill down to the root cause, and then solve the problem.
Immediately after the Purpose section, the authors dive into their performance methodology. Waits are dsecribed as the observed symptom of a problem and perfmon counters are the proof. Or, to look at it another way, waits are what the application observes and perfmon counters are resource waits.
In a stroke of genius, there’s a lengthy example where the authors explain how correlating between lock waits and perfmon counters can help an observer determine if there’s really a problem going on. The crux of the argument is that “associations or correlations allow us to determine relevancy to overall performance”.
Averages are all that we have to go on when passively observing SQL Server. But in many cases, averages are misleading. In the example given, the authors mention looking at average lock time over a given window (the use of averages comes up a lot in this paper). It’s important to remember that averages can be heavily skewed by outliers.
Remember, the methodology described in this paper is a good first glance, but you can’t use it to rule out a specific problem. You can only use this methodology to determine which areas of SQL Server are a performance problem.
The methodology section doesn’t skimp on details. It provides a description of the SQL Server query wait model (running, waiting, runnable) and examples of how a query can move through this.
Unfortunately, it’s in this section where the first hints that’s something isn’t right show up. “A low signal (where signal is less than 25% of the total waits) to resource wait ratio indicates there is little CPU pressure.” The problem isn’t with the statement; if queries are spending 25% of their time waiting to get on a CPU, you’ve got problems. The problem is that the statement comes with no context – there’s no information about why the signal:resource wait ratio is bad and what should be done about it. This problem persists throughout a lot of the paper, too. It’s assumed that the reader knows where to go for verification of the absolute ratios that are provided.
The authors of “SQL Server 2005 Waits and Queues” attempt to provide performance tuning blueprints for both OLTP and data warehouse applications. If you’ve ever build a house, you know that blueprints change the minute you go out in the real world and these blueprints are no exception. When I was reviewing this article, I scrawled in the margin “here’s where things go south for a while”.
The blueprints attempt to set up a list of scenarios that DBAs need to watch out for. These are typically based on hard and fast thresholds for perfmon counters and DMVs – more on that in a minute.
Although many people have pure data warehouse systems, it’s increasingly rare to find a purely transactional system. A pure transactional system is entirely single item select, insert, update, and delete; there are no reports. This was uncommon back in 2005/2006 and it’s even less common today – business users insist on viewing live data and we have no choice but to deliver. It’s left up to the reader to determine how to synthesize appropriate metrics based on the pure OLTP and pure data warehouse blueprints.
Here’s my first major issue with this article: hard and fast advice is rarely correct. The OLTP blueprint contains a huge amount of advice that could put users in a lot of trouble. In no particular order, developers will run into advice like:
- Have three or fewer indexes on frequently updated tables.
- Queries with more than 4 joins can be too normalized for high OLTP scalability.
- Avoid scans on tables since a missing index flushes the cache.
- Page Life Expectancy should always be over 300.
- Page Life Expectancy should never drop by more than 50%.
- Reads should take 4-8 ms without any IO pressure.
Take that in for a minute.
All of these statements are listed in the “SQL Server 2005 Waits and Queues” article. They’re all presented as hard rules to follow and that indicate performance problems. But there’s no evidence given for any rule being true, nor is there any condition given for re-evaluating them. The authors do provide a single line caveat “the values in Value column are good starting point. The actual values will vary.” At least their bases are covered for being wrong.
Computers have changed significantly in the last 10 years. SQL Server has seen 4 releases since this paper was published and we’re about to be on SQL Server 2016. The query optimizer has seen improvements, storage devices have seen improvements, and memory has seen improvements. When this paper was released, a dual core Xeon was a high end server CPU. In 2015 I have a quad core processor in my phone; it’s not uncommon to see 4, 6, or 8 core processors in modern servers.
As you read through the blueprints, it’s important to remember how things have continued to change in the SQL Server world. The first rule for data warehousing insists that data warehouses should have more indexes, not fewer. Whereas a lot of modern thinking on data warehouses runs counter to this and is focused on making sure disk throughput is fast enough.
Other gems from the data warehouse blueprint include:
- Avoid plan re-use.
- If top 2
sys.dm_os_wait_statsvalues contain IO, there’s a blocking bottleneck. (Never mind that a data warehouse will usually have more data than RAM and will need to read from disk)
- Avoid big scans because a missing index flushes the cache.
- “The time taken for plan generation of a large query is not nearly as important as having the right plan”.
It’s noble that the authors make these attempts, but the blueprint requires a significant amount of customization and modification to make it worth considering, much less using in production.
The real value of the blueprint comes from suggested correlations between counters and waits, but you need to have solid baselines and a good understanding of how different parts of SQL Server work in order to synthesize this into something meaningful.
By the way, this only gets us to page 16 of 94.
The next major section deals with finding the typical resource bottlenecks in a SQL Server. This is a pretty boring section, truthfully. It contains descriptions of problem areas (“memory”, “IO stalls”) and a sample query that will help the reader look into the problem.
The biggest specific complaint I have about this section is that the authors insist on usingPage Life Expectancy as a meaningful performance indicator.
Outside of that, there a glaring general concern – at no point does this paper tell the reader what to do with the information gathered by these queries. It’s left up to the reader to interpret IO stalls, memory pressure, missing indexes, and query plan problems. During a presentation I’ll sometimes answer an unforeseen question with “that’s an exercise left up to the reader”, readers of 100 page best practices articles come to an article expecting best practices, not homework.
It’s also in this section where vague guidelines start to creep in. Readers are instructed to “make sure that the [blocked process] threshold is not too low” – the default for this is 5 seconds, earlier in the article it is recommended that the blocked process threshold should be 30 seconds. But no guidance is provided to help a DBA determine the right threshold.
And, ultimately, that’s my chief complaint – there’s very little reasoning given around the numbers presented in this best practices article. It’s assumed that the reader knows SQL Server well enough to read between the lines, refine existing numbers, and synthesize additional performance metrics. When support information is provided, it’s often in the form of another best practices article. These vary in length from 20 pages to 100 pages – I don’t know about you,but the idea of reading War & Peace: Best Practices Edition over the weekend doesn’t sound that fun.
Starting on page 22, this best practices paper turns into a list of SQL Server wait types and correlating performance monitor counters. This goes from page 22 – 79. At that point, the “queues” portion kicks in and then you’re looking at performance monitor counters. I used to recommend that people read through the list of waits to get a better idea of what’s going on inside SQL Server.
Don’t use this part of the article.
sys.dm_os_wait_stats. It has a much better explanation of individual waits.
Nowhere in this article do the authors explain which waits are ignorable and which aren’t.
SOS_SCHEDULER_YIELD is given just as much weight as
QPJOB_KILL. It’s left to you to determine if any of these waits are significant. Thankfully, we’ve got a wait stats script that will ignore many waits for you.
What’s worst of all is that some of the advice in this section is just plain crazy.
ASYNC_IO_COMPLETION– This is your SQL Server asynchronously writing to disk. To solve the problem, the authors suggest you check for memory pressure and bad query plans.
CXPACKET– The guidance for this wait type has been superseded by KB 2806535
DBTABLE– This wait is labeled “internal only”, but additional information is given on correlating it to performance problems. Should I even worry if it’s internal only? Can I even do anything about it?
Failpoint– if you made it to the
Failpoint, you clearly haven’t reached your own internal fail point.
LATCH_x– SQL Server 2005 introduced
sys.dm_os_latch_statswhich helps a DBA determine which latches are causing problems. Does this paper point the reader to the right DMV? Nope. It just says “Possibly the most common case is contention on internal caches (not the buffer pool pages), especially when using heaps or text.”
PAGELATCH_UPmeans you should partition the table, BTW.
- If you see
LCK_xyou should check for memory pressure. Unfortunately, we’re left to our own devices to figure out how. If you’ve been taking this article at its word, you’re probably running for the First Church of Page Life Expectancy to find everything else.
OLEDBis explained to largely be caused by traces and linked servers. The mitigation, set up more traces, figure out where your client apps sit and whether or not they’re loading files, or query
sys.dm_exec_requeststo find out which SPIDs are seeing this wait. Nowhere does it mention stopping your traces and not using linked servers.
WRITELOGwaits suggest looking in
sys.dm_io_virtual_file_stats(good) and query plan output (bad). Raise your hand if you’ve ever successfully used a query plan to spot I/O problems.
Surely the queues section must be better… He said, hopefully.
- Queue lengths are regarded as important. Disk queue length is especially spurious given modern storage technology. Formerly alarming numbers are now considered anemic – it’s not uncommon to see recommended queue depths as high as 128 or 256.
- “Under typical circumstances reads should take 4-8ms” – they going on to remind you to check with your vendor. Which is good because 4-8ms of storage latency is either some kind of local solid state storage or else the time it takes your HBA to acknowledge a write.
Page Faults/sec– this metric is trash. It measure when your CPU can’t find a page in memory OR when the page isn’t in the cache the CPU thought it should’ve been in. This is normal. Also – you can’t control it. Even with all of your data in memory, your server will still show page faults occurring.
Pages/sec– this measures reading data to and from memory. Once again, this is junk. Microsoft PFE Clint Huffman debunked Pages/sec as a meaningful counter back in 2009.
Page Splits/secis a counter to watch, right? Well, any addition of page to the end of an index is technically a page split. So is splitting a page. So we don’t know if these are good or bad, but they’re happening!
Context Switches/secis mentioned as something to watch, too. According to the paper, if this goes over 20,000 there’s a problem and if it goes over 50,000 there’s a big CPU problem. One time I opened a program and it produced 33,000 context switches per second. Time to throw this laptop away.
At this point, I literally can’t even.
If you’re reading “SQL Server 2005 Waits and Queues” for the first time, read pages 1-7 and then skip to the conclusion. Ignore everything in between.
It’s important to consider this article in a historical perspective – it’s the first serious attempt at documenting a tuning process for SQL Server that combines perfmon counters and wait stats. In today’s context, the concrete advice from the article doesn’t hold up; but the concepts and broad methodology are still useful for identifying SQL Server problems.
If you are looking for guidance on tuning SQL Server using wait stats, more accurate information can be found in SQL Server Books Online and third party sites (like this blog).
You work with Microsoft SQL Server – maybe as a developer, report writer, analyst, sysadmin, or DBA.
You want to learn more about how to make it faster and more reliable.
But your company can’t afford training. Maybe it’s a charity that needs to save every dollar to make a difference in the lives of others. Or maybe it’s a non-profit arts foundation, or a small startup that’s barely making ends meet.
We want to hear from you. We’re going to pick 25 SQL Server professionals to get access to a tailored set of training classes that focus on issues relevant to each person’s pain points.
The fine print:
- You must already have a job working with SQL Server.
- You must work for a foundation, non-profit, charity, or similar company that’s doing good work. It can totally be a for-profit company, just as long as they’re making a difference. (If you work for Ginormous Profitable Global Corporation, forget it.)
- Your company or government rules must allow you to receive free or discounted training. (Some companies prohibit their employees from accepting gifts.)
- You can be anywhere in the world. (Some of our scholarships will involve in-person training, and some will be online video training.)
Do you like to shake sticks?
Because we’ve got more upcoming SQL Server webcasts than you can shake a stick at:
Easy Architecture Design for HA and DR – Clustering, log shipping, mirroring, AlwaysOn Availability Groups, replication – database administrators have so many confusing choices. Microsoft Certified Master Brent Ozar will cut through all the confusion and give you a simple worksheet. You’ll learn how to get the right RPO and RTO numbers in writing from the business users, then turn those requirements into the right features for you.
How to Think Like the SQL Server Engine – normally $29, and for one day we’re giving it to you completely free. When you pass in a query, how does SQL Server build the results? We’ll role play: Brent Ozar will be an end user sending in queries, and you’ll be the SQL Server engine. Using simple spreadsheets as your tables, you’ll learn how SQL Server builds execution plans, uses indexes, performs joins, and considers statistics. This session is for DBAs and developers who are comfortable writing queries, but not so comfortable when it comes to explaining nonclustered indexes, lookups, sargability, fill factor, and corruption detection.
What’s New in SQL Server 2016? – SQL Server 2016 is just around the corner. There are a lot of new features in the box, and improvements to old ones. There’s a lot to learn, and you need to start planning before SQL Server 2016 hits the streets. In this webcast, Jeremiah Peschka will share a high level overview of what’s next in SQL Server 2016.
Prove It!: Collecting the Right Metrics to Show Performance Gains – It’s easy to say adding more memory, faster drives, or a new index will help. How do you prove it really did help? What numbers and counters should I be watching? Doug Lane will explain which numbers matter when trying to show your changes were worth the cost.
Register now by just filling out your name/company/email. See you there!
Your users are frustrated because the app is too slow, but you can’t change the queries. Maybe it’s a third party app, or maybe you’re using generated code, or maybe you’re just not allowed to change it. Take heart – there’s still hope.
I do this every week, and I’ll share my proven methodologies to performance tune with indexes, SQL Server configuration switches, and hardware.
In this one-day workshop on Friday, December 4 at the Microsoft office in Chevy Chase, you’ll learn:
- Simplified internals – how SQL Server turns your query into an execution plan
- Wait stats solutions – a list of fixes for most common bottlenecks
- Plan cache analysis – how to discover the most resource-intensive queries so you know if indexes and hardware will work
- Monitoring metrics – what DMVs and Perfmon counters to monitor, and what thresholds to alert on
- Hardware warning signs – how to prove your server is undersized for your workloads
- Tools and scripts – the best free and paid tools to make your tuning job easier
This session is for developers, DBAs, and consultants who have to make SQL Server go faster. You should be comfortable writing queries and creating tables, but not as confident about interpreting SQL Server’s DMVs and diagnostic data. I’ll train you the same way I train my own clients.
This session will be demo-heavy, and you can play along. Bring a laptop running SQL Server 2005 or newer, with at least 100GB of free drive space. We’ll distribute the Stack Overflow demo databases on USB drives for you to restore, and you can follow along with the demos. You’ll get the load test scripts, run them against your laptop, and see how we solve various challenges.
BUT FRANCE HAS A PONY
I really like SQL Server. Most of the time. Okay, so most of the time I like SQL Server most of the time. Don’t get me wrong, if I had to go back through the career-time continuum and pick a RDBMS to work with, I’d probably still choose it over Oracle. Probably. And, because I don’t exclusively grow facial hair from my neck, I wouldn’t be allowed to choose PostgreSQL. They’d kick me off the mailing list.
Just kidding. You’re all handsome rogues. We could have had a nice life together, staring longingly into each other’s shoes and trying to implement parallelism.
I’d have DB2 here, but the cost of entry to the Developer Edition is rather steep. So, you know, I’m sure it’s great! But no. Though I would be really happy if Microsoft implemented ANSI Standard constructs into T-SQL half as fast as IBM does.
I have poked at Oracle and PostgreSQL a bit, and found they have some really cool stuff. Heresy, right?
Check out some of these Oracle gadgets and tell me they wouldn’t make your life a whole lot easier.
In no particular order:
Table restores! Built in! I’m very surprised we never got a feature like this. You can do it with a 3rd party tool like Dell LiteSpeed.
Adaptive Plans! Go to the link and read the second paragraph. Read it twice. Wipe the drool off your face.
In-Database Row Archiving! You know all that stuff you do with partitions that Oracle already does better? Where you’re basically praying for partition elimination to not undo the two weeks of work you put in to setting up this partitioned table that developers are writing horrible MERGE upserts to? Yeah. You can just tell the engine to not pay attention to rows you don’t care about anymore when it accesses the index. Fancy that.
Bitmap Indexes! It’s kind of like a filtered index, except for all values of a highly non-selective column.
Materializing CTEs! Even though it’s undocumented, we use plenty of undocumented stuff in SQL Server to get the job done. This is really cool to me, since I’ve discussed this limitation in CTEs before. I’d love to see a way to do this in SQL with the same behavior; not having to create temp tables. It would be a nice way to get around issues with caching statistics for temp tables, and especially since MS is still fixing bugs around temp tables.
Are there more? Yeah, but this is a blog. Go grab a VirtualBox and read the documentation if you’re interested in learning more.
Doctors Without Borders is a truly awesome charity. Medical professionals volunteer their own personal time to go to war-torn countries and solve issues of world health.
SQL Server community member, MCM, and all-around-good-guy Argenis Fernandez organizes Argenis Without Borders, a giving event to help. Last year, the SQL Server community helped raise $13,000 for this well-deserving charity.
Now it’s your turn. Please donate $25 if you can. It would mean a lot to me, and I know it’d mean a lot to people all over the world who urgently need medical help. You’ve got a great job – it’s your turn to give back.
One of your SQL Servers is going to fail.
When one of your AG members goes down, what happens next is just like opening a new SSMS window and typing BEGIN TRAN. From this moment forwards, the transaction log starts growing.
And growing. And growing.
SQL Server is waiting for that failed replica to come back online and start synchronizing again. The transaction log has to contain all of the data necessary to bring that once-failed replica back up to speed, to the current point in time. It doesn’t matter whether this replica is sync or async, important failover partner or an unimportant reporting server.
After every 4 hours of outage troubleshooting (at 4 hours in, 8 hours, 12 hours), I look at the facts:
- How much free space do I have left on the log file drive?
- How long will it take to reseed the failed replica by using a full backup?
- Based on these, should I remove the failed replica from the Availability Group?
This means at minimum, my AG replicas all need enough space to have 4 hours of transaction log activity. I can buy myself more time by disabling index defrag/rebuild jobs as soon as a replica fails, too. (In a perfect world, I’d code that into my index rebuild jobs, but that’s left as an exercise for the reader.)
After the emergency finishes, I don’t bother shrinking the log files back down – because after all, another replica is going to fail. It’s not a matter of if – it’s just a matter of when.
Knowing that all this is going to happen, it makes sense to pre-grow/size your transaction log files large enough to handle at least 4 hours of activity without a log backup. After all, when I’m troubleshooting an availability emergency, I don’t want to waste time troubleshooting a performance emergency. If my AG databases are constantly freezing up while they grow the log files, it’s going to be a bad distraction from the problem I really need to focus on.