Merry Christmas, students! I very much appreciate that so many of you showed up in your bathrobes, because now I don’t feel so guilty teaching this class in mine. My Christmas gift to you is a shiny, new blog post. Try to contain your excitement.
This week, we’ve covered the basics of what SQL Server needs to store, and places we can store it. Today we’ll discuss what happens when storage goes bad – and it definitely goes bad.
Drives Go Bad When You’re Not Looking
After you’ve written data to storage, you probably assume it just stays there, staying exactly the same, waiting to be read.
If only that were true.
When drives begin their death spiral down the toilet bowl, sometimes it’s a long journey. At first, just a few blocks might go bad – but we won’t know about it if we’re not actively reading those blocks. Right now, in your servers, a hard drive might be in the process of self-destruction, taking your precious data with it.
The best defense is a good offense: actively checking your database content. SQL Server’s DBCC commands (yes, that’s an RAS) will go out and check the contents of your database to make sure everything’s still hunky-dory. The bigger your databases are, the more time it takes to check them, so DBCC has options that will let you just check the most important parts. For example, if a non-clustered index goes bad, we’re not too concerned because we can just drop that index and recreate it again. All of the index’s data is stored in the underlying table anyway, right? We should be able to recover in a matter of minutes.
If only that were true.
Where there’s smoke, there’s fire. If a component in your storage subsystem has started going bad, your first signs might be damage to a non-clustered index. Your proactive DBCC checks will catch that, and you’ll repair it, but you’re only fixing the symptom – not the disease. Storage corruption rarely just happens once and disappears. When your DBCC checks return problems, you do need to fix the symptom, but you also need to jump in and do root cause analysis. If you’re using a SAN with storage shared between several servers, start by checking other servers on that same SAN to see if they’re also having corruption problems.
Want to learn more about DBCC? Paul Randal’s blog posts on DBCC will teach you more than you wanted to know. When he was at Microsoft, he wrote the code behind DBCC.
SAN Drives Change When You’re Not Looking
Storage Area Networks are engineered for extreme reliability; if any one component in a properly configured SAN goes down, the servers won’t go down with it. They’ll be able to reroute their traffic automatically between multiple networks, multiple controllers, and multiple drive enclosures. SAN admins can even do some routine maintenance during business hours without anyone even noticing, like recabling switches, flashing firmware versions, adding additional drives to RAID arrays, or changing cache settings. Well – in theory, no one notices, but the performance changes can impact your servers. Do you know if your SQL Server’s transaction log array is 10% slower than yesterday? 20%? 30%?
The most advanced environments have even started using storage virtualization, which works just like our server virtualization technologies. Server virtualization abstracts the hardware away from the operating system, letting us move servers around. Storage virtualization lets SAN admins move your data from one array to another, like from RAID 10 to RAID 5, from SSD to SATA, or even from an old SAN to a new SAN. The benefit is that they can decommission old hardware without taking huge outages to rebuild servers from scratch. The drawback is that they can do this without telling you, on the fly. You might have fantastic performance on Monday, abysmal performance on Tuesday while they move your data, and then middling performance on Wednesday.
The solution to all of this? Constant performance monitoring. Learn how to use Perfmon to baseline your storage performance, and check your storage metrics regularly. Even if you don’t have the time to use Perfmon, you can start by monitoring your full backup times. If they jump by 20% overnight, there’s a good chance your SAN administrator just pulled a fast one – or rather, a slow one – on you. Talk to your storage folks to find out why performance is suddenly suffering, and use your Perfmon and backup times to prove it.
Virtualization Makes It Harder to Look
Virtualization abstracts your hardware away from the operating system. It’s much harder for you to know whether your CPU performance issues are caused by:
- Your virtual server’s load
- The host’s hardware capabilities
- Load from other virtual servers
Or some combination of the above. Storage performance is even tougher to troubleshoot and measure, because the bottlenecks might be:
- Your virtual server’s needs
- The host hardware’s connection to the SAN (iSCSI, FC speeds)
- The host hardware’s multipathing configuration (active/active, round robin, Russian Roulette, etc)
- Other virtual servers sharing the same SAN arrays as you
- Other virtual servers sharing the same connection to the SAN arrays as you (competition for paths)
- The storage controllers and arrays
I don’t have a simple answer to solve your virtualization storage performance problems, but I do have one important recommendation: get good at performance monitoring before you start virtualization projects. If you can’t accurately identify the performance bottlenecks of non-virtual machines, you’re not going to get any better when they’re virtualized.
Summing Up the Series: Storin’ Ain’t Easy
SQL Server DBAs think they hold a patent on the answer, “It depends,” but storage administrators were using that phrase before DBAs were even a glimmer in…wait, let’s draw the line here before I start talking about your mothers. I could go on and on for dozens of pages about storage and how it affects SQL Server.
Want to learn more about SQL Server, raid levels, SAN storage, Windows storage configuration, and more? Check out Professional SQL Server 2008 Internals and Troubleshooting. My 40-page chapter on storage goes into more detail, but frankly, the other chapters are even better! There’s in-depth details on memory, processors, tracing, and more, and it’s got a ton of information you won’t find in other SQL Server books – especially the storage, latching, and Extended Events chapters.
It’s not for beginners – it’s targeted at senior DBAs, and people who want to become senior DBAs. If you liked this series, I know you’ll like the book.
During a recent project I started to see longer % disk times in my Perfmons. It allowed me to raise issue with the storage team and find that we had 4 drives in a nose dive with bad blocks on our SAN. Your Perfmon articles have been pure gold in obtaining empirical data I can use.
Glad I could help, sir!
I’ll have to invest some time to read all articles behind the links, since dm_io_virtual_file_stats tells us we have, depending on the servers and databases involved, average read and write latencies up to 400ms and our SQL consultant confirmed our problems using SQLIO.
Since our SAN and network admins do not see unacceptable wait times or congestions, I think there might be buffering issues somewhere, since ca 30 host (with 1GB connections) hosting ca 300 guest Operating systems (with virtual 10GB connections) talk with 1 SAN (1GB connection). I hope the planned upgrade to 10GB at host and SAN side will have major impact.
Anyway, I was looking for something else. I read your post regarding the importance of index rebuilds, but I wonder what’s the impact, besides cache hit ratio might be lower in the morning. I can imagine that index rebuilds and NTFS defragmentation have huge impact when addressing the harddisk of my laptop due to more sequential IO’s (so, less random IO’s), but I wonder about the impact on our SAN since the SAN decides where to write the bytes (as I understand updates are written on other blocks to assure snapshot mechanisms) and since I don’t understand the concept of sequential read might compatible with RAID setups. Do you have links, articles, books regarding this?
Thanks a lot,
Hi, Peter. It varies on the type of SAN, type of workload, and even the edition of SQL Server. I usually treat this on a case-by-case basis with clients: if storage read waits are their biggest problem, then we look at everything in the storage stack. How are we fetching the data? Are we using the right indexes? Is storage configured correctly? Fragmentation is just one piece of this puzzle.
I might be a bit late in visiting your site but excellent explanations and very useful articles.
If the SQL server provisioned SAN drives (Data, Log, TempDB, Index) fill-up, What do you recommend for drive expansions ?
Vineel – thanks, glad you like the site! Unfortunately, that’s a little beyond the scope of something I can answer quickly in a comment. Check out our SAN best practices post here for starters: https://www.brentozar.com/sql/sql-server-san-best-practices/