Merry Christmas, students! I very much appreciate that so many of you showed up in your bathrobes, because now I don’t feel so guilty teaching this class in mine. My Christmas gift to you is a shiny, new blog post. Try to contain your excitement.
Drives Go Bad When You’re Not Looking
After you’ve written data to storage, you probably assume it just stays there, staying exactly the same, waiting to be read.
If only that were true.
When drives begin their death spiral down the toilet bowl, sometimes it’s a long journey. At first, just a few blocks might go bad – but we won’t know about it if we’re not actively reading those blocks. Right now, in your servers, a hard drive might be in the process of self-destruction, taking your precious data with it.
The best defense is a good offense: actively checking your database content. SQL Server’s DBCC commands (yes, that’s an RAS) will go out and check the contents of your database to make sure everything’s still hunky-dory. The bigger your databases are, the more time it takes to check them, so DBCC has options that will let you just check the most important parts. For example, if a non-clustered index goes bad, we’re not too concerned because we can just drop that index and recreate it again. All of the index’s data is stored in the underlying table anyway, right? We should be able to recover in a matter of minutes.
If only that were true.
Where there’s smoke, there’s fire. If a component in your storage subsystem has started going bad, your first signs might be damage to a non-clustered index. Your proactive DBCC checks will catch that, and you’ll repair it, but you’re only fixing the symptom – not the disease. Storage corruption rarely just happens once and disappears. When your DBCC checks return problems, you do need to fix the symptom, but you also need to jump in and do root cause analysis. If you’re using a SAN with storage shared between several servers, start by checking other servers on that same SAN to see if they’re also having corruption problems.
Want to learn more about DBCC? Paul Randal’s blog posts on DBCC will teach you more than you wanted to know. When he was at Microsoft, he wrote the code behind DBCC.
SAN Drives Change When You’re Not Looking
Storage Area Networks are engineered for extreme reliability; if any one component in a properly configured SAN goes down, the servers won’t go down with it. They’ll be able to reroute their traffic automatically between multiple networks, multiple controllers, and multiple drive enclosures. SAN admins can even do some routine maintenance during business hours without anyone even noticing, like recabling switches, flashing firmware versions, adding additional drives to RAID arrays, or changing cache settings. Well – in theory, no one notices, but the performance changes can impact your servers. Do you know if your SQL Server’s transaction log array is 10% slower than yesterday? 20%? 30%?
The most advanced environments have even started using storage virtualization, which works just like our server virtualization technologies. Server virtualization abstracts the hardware away from the operating system, letting us move servers around. Storage virtualization lets SAN admins move your data from one array to another, like from RAID 10 to RAID 5, from SSD to SATA, or even from an old SAN to a new SAN. The benefit is that they can decommission old hardware without taking huge outages to rebuild servers from scratch. The drawback is that they can do this without telling you, on the fly. You might have fantastic performance on Monday, abysmal performance on Tuesday while they move your data, and then middling performance on Wednesday.
The solution to all of this? Constant performance monitoring. Learn how to use Perfmon to baseline your storage performance, and check your storage metrics regularly. Even if you don’t have the time to use Perfmon, you can start by monitoring your full backup times. If they jump by 20% overnight, there’s a good chance your SAN administrator just pulled a fast one – or rather, a slow one – on you. Talk to your storage folks to find out why performance is suddenly suffering, and use your Perfmon and backup times to prove it.
Virtualization Makes It Harder to Look
Virtualization abstracts your hardware away from the operating system. It’s much harder for you to know whether your CPU performance issues are caused by:
- Your virtual server’s load
- The host’s hardware capabilities
- Load from other virtual servers
Or some combination of the above. Storage performance is even tougher to troubleshoot and measure, because the bottlenecks might be:
- Your virtual server’s needs
- The host hardware’s connection to the SAN (iSCSI, FC speeds)
- The host hardware’s multipathing configuration (active/active, round robin, Russian Roulette, etc)
- Other virtual servers sharing the same SAN arrays as you
- Other virtual servers sharing the same connection to the SAN arrays as you (competition for paths)
- The storage controllers and arrays
I don’t have a simple answer to solve your virtualization storage performance problems, but I do have one important recommendation: get good at performance monitoring before you start virtualization projects. If you can’t accurately identify the performance bottlenecks of non-virtual machines, you’re not going to get any better when they’re virtualized.
Summing Up the Series: Storin’ Ain’t Easy
SQL Server DBAs think they hold a patent on the answer, “It depends,” but storage administrators were using that phrase before DBAs were even a glimmer in…wait, let’s draw the line here before I start talking about your mothers. I could go on and on for dozens of pages about storage and how it affects SQL Server.
Want to learn more about SQL Server, raid levels, SAN storage, Windows storage configuration, and more? Check out Professional SQL Server 2008 Internals and Troubleshooting. My 40-page chapter on storage goes into more detail, but frankly, the other chapters are even better! There’s in-depth details on memory, processors, tracing, and more, and it’s got a ton of information you won’t find in other SQL Server books – especially the storage, latching, and Extended Events chapters.
It’s not for beginners – it’s targeted at senior DBAs, and people who want to become senior DBAs. If you liked this series, I know you’ll like the book.