A reader wrote in and asked:
We’re running SQL Server with blades and a NetApp SAN. We have a few hundred databases from 100mb to 200gb. All data, logs, tempdb, etc. are located in the same 30-disk pool.
Apparently this was setup using NetApp’s guidelines. NetApp recommendations are to put everything in one aggregate, or a couple of aggregates with the caveat that performance will suffer if you don’t use all disks in the same aggregate. They actually recommend putting data and log files together!
Have you ever seen anything like this before? Is this a stable system or is it craziness?
By the way, your blog is absolute genius! I love everything you’ve ever read, and your SQL Server expertise is surpassed only by your good looks!
(Okay, so maybe I added that last paragraph myself.)
First off, yeah, I’ve seen that before, and it’s not craziness. There’s times when this is the right way to do a SAN configuration, and there’s times where it’s wrong. Some SAN vendors (like EMC and NetApp) recommend this type of configuration as a default, and they only recommend changing it when you can justify better performance in a dedicated drive setup – meaning, physical drives are dedicated to specific tasks, like a six-disk raid 10 setup for your transaction logs.
Before I give an answer for this one specific scenario, let’s talk about the decision factors.
When you’re choosing between two things, you want to be as specific as possible about the options. If someone asks you what’s better, a Chevy or a Mercedes, you can’t just say Mercedes. They may be asking about a late-model Corvette versus a clapped-out 1970′s Mercedes wagon. Or that Mercedes might be pristinely restored, whereas the Corvette’s been in an accident. You get the idea. If you were choosing between cars, you’d want to know the age, the mileage, and what the person wanted the car for – performance, family travel, reliability, etc.
When you’re choosing between SQL Server storage options, here’s what you need to know about each of your choices:
- The number of drives – sounds simple, but try to get that information out of your SAN administrator in a shared-drive environment. Then, just for laughs, ask them, “If I was going to take my drives out of the shared pool and switch to my own dedicated spindles, how many spindles will I get?” It’s probably going to be a pretty small number, and you’ll need to know this before you decide to switch.
- The RAID level – some SAN vendors will say that the RAID level doesn’t matter anymore, and in huge shared pools, that’s vaguely true. A 30-drive RAID 5 and a 30-drive RAID 10 are going to overwhelm your HBAs anyway. However, in dedicated drive setups, we’re probably talking about much lower numbers of drives, and then it starts to matter.
- The peak load windows for other apps on those drives – if you’re sharing drives with a couple of tiny servers, and you’re the biggest load in the group, then this probably isn’t an issue. On the other hand, if you’re sharing drives with a large ERP system with hundreds of users that all log in around 8am, you’re going to want to expect that.
- The backup method – if you’re using SAN snapshots, you want as many spindles as possible to make your backups less invasive. I’m not saying SAN snapshot backups are invasive, but if you suddenly present that snapshot to another server on the SAN, your load on those hard drives just doubled. That changes your SAN bottleneck, and for this, you’ll want shared spindles.
- The current performance bottleneck – I could write a whole session just on finding your current SAN bottleneck. Oh wait – I did, and I’m delivering it via videoconference soon! Well, to condense it into a single bullet point, you need to find out where your performance bottleneck is, because it may not be your hard drives. You want to focus on your bottleneck first, eliminate it, and then move on to the next bottleneck. If the current one is the drive arrays, then even moving to dedicated spindles may not be the answer if we don’t get enough of them (see our first bullet point).
- And there’s more – like the amount of time you want to spend managing the SAN, whether you want to use SQL Server partitioning, and so on – but these are a good start.
Now that we know the questions, let’s look at some of the answers for this one scenario.
The number of drives – right now you’ve got 30 drives in a single pool for a cluster. If you carved it up into dedicated drives, you might do something like this:
- 12 drives in a raid 10 for data
- 10 drives in a raid 10 for logs
- 8 drives in a raid 10 for TempDB
I’m just pulling these out of the air to illustrate that you’ve got a lot of drives, and you can play around with the config here. To make a good design decision, you’d want to know the read/write mix and the activity on TempDB. Bottom line, you’ve got enough drives that you could get some performance with dedicated drives instead of shared. If you were running, say, 10 drives altogether in the pool, then the decision changes.
The backup method – NetApp’s snapshot solution has some great SQL Server integration, so just generally, if you’re using NetApp, I would use a shared drive config. You may not be using their snapshots now, but as your databases get bigger, it’s nice to have the snapshot option available. If your backup window gets out of control, call your NetApp guys about getting a demo.
The current performance bottleneck – it takes a lot of work to figure this out, but the key word in the question was “blades”. I love blades, but I’ve seen a few implementations where people have shoved a bunch of SQL Servers in a single blade chassis, only put two HBAs in each blade, and only used two fiber cables to connect the entire blade chassis to the SAN. That means all of the servers are choking on a small amount of bandwidth.
When we look at these answers together, I’d say that the shared drives are not holding back this server’s performance. Before changing from shared to dedicated drives, I’d add HBAs in each SQL Server, enable & test multipathing software, and connect the blade chassis to the SAN infrastructure with as many fiber cables as feasible. Otherwise, changing from shared to dedicated spindles for this setup won’t make a performance difference.
At the same time, I would only make these changes if you’re seeing disk performance bottlenecks on the SQL Server. (This reader mentioned that he used Quest Spotlight, and it’s easy to see this in Spotlight.)
Does that mean NetApp’s original recommendation was right? Not necessarily, but here’s why they do it: most people have setups just like this. People will throw hundreds of thousands of dollars into a storage controller and a bunch of drive enclosures, only to kill the performance by not giving it any bandwidth. The more time and effort you put into the setup around the edges, the more it’ll pay off in performance.