We’ve got a client that does big batch jobs every day, loading hundreds of gigabytes of data or more in short bursts. They were frustrated with slow performance on the batch jobs, and after we performed our SQL Critical Care® with ’em, it was really clear that their hardware was the bottleneck. They were using a virtual server backed by an iSCSI SAN, and they were getting bottlenecked on reads and writes. We could put some more memory in it to cache more data, preventing the read problem, but we would still get bottlenecked trying to write lots of data quickly to the shared storage.
We recommended two things: first, switch to a standalone bare metal SQL Server (instead of a virtual machine), and second, switch to cheap commodity-grade local solid state storage. Both of those suggestions were a little controversial at the client, but the results were amazing.
Why We Switched from VMware to Bare Metal
Theoretically, virtualization makes for easier high availability and disaster recovery. In practice, there are some situations – like this one – where it doesn’t make sense.
In the event of a failure, 15-30 minutes of downtime were acceptable. The server was important, but not mission-critical. In the event of an outage, they didn’t mind manually failing over to a secondary server. This meant we could avoid the complexity of a failover cluster and shared storage.
Slow performance was not acceptable during normal production. They wanted to put the pedal to the metal and make an order-of-magnitude improvement in their processing speeds with as few code changes as possible.
They weren’t going to pay a lot for this
muffler server. They’re a small company with no full time DBA and no glut of servers laying around. Buying a server was a big deal – we only had one shot to buy a server and get it right the first time. In this move, we were able to free up VMware licensing for other guests, too.
What We Designed: Dell R720 with Local SSDs
The Dell R720 is a 2-processor, 2-rack-unit server with room for 16 2.5″ drives across the front of the server, and two RAID controllers. It’s got room for up to 768GB of memory. It’s my favorite 2-processor SQL Server box at the moment.
I’m not against shared storage – I love it – but when I’m dealing with large batch jobs, a limited budget, and no clustering requirement, it’s tough to beat local SSDs. The R720 lets us use a big stack of 2.5″ solid state drives with two RAID controllers for processing data. Quantity is important here since SSDs tend to be relatively small – 512GB or less. Some larger drives exist, like the Crucial M4 960GB, but performance doesn’t quite match the 512GB-class yet.
The Dell R720XD is a similar server, but it’s absolutely slathered with drive bays, handling up to 26 2.5″ drives. While that sounds better – especially with today’s fastest SSD drives still being a little size-constrained – the R720XD only has one RAID controller instead of the R720’s two.
For our Plan B – where we’d fail over if the primary server died – we actually stuck with a virtual server. We built a small 2-vCPU, 8GB RAM guest with SQL Server. We keep it current using the database backups from the primary server. Remember, this application is batch-oriented, so we just need to run backups once a day after the batch completes, and then restore them on the secondary server. When disaster strikes, they can shut down the VMware guest, add more CPU and memory power to it, and it’s off and running as the new primary while they troubleshoot the physical box. It’s not as speedy as the primary physical box, but that’s a business decision – if they want full speed, they can easily add a second physical box later.
Picking Solid State Drives for the Database Server
When picking drives to populate the R720’s 16 bays, that’s where the tough decision comes in. You’ve got three options:
1. Use Dell-approved, Dell-sold drives. These are ridiculously, laughably, mind-bogglingly expensive given the size and performance:
A 400GB MLC drive is $1,200 rack, so filling all 16 bays would cost $19,200. To put things in perspective, the server itself is about $10k with 2 blazing fast quad-core CPUs, 384GB of memory, and spinners on the fans, so buying Dell’s SSDs triples the cost of the server.
2. Use commodity off-the-shelf SSD drives. In the latest Tom’s Hardware SSD Hierarchy (scroll halfway down that page to see the tiers), the Samsung 840 512GB drive is in the top tier. Anandtech’s review called it “the fastest consumer SSD we’ve ever tested.” It’s available on Amazon for under $500, less than half the cost of the Dell drive, meaning we could fill the R720 with 8TB of smokin’ fast storage for under $8k, plus leave a couple of hot spares on the shelf.
There are risks with this approach – Dell won’t guarantee that their controller and their software will work correctly with this combination. For example, during our load testing, the DSM SA Data Manager service repeatedly stopped, and we couldn’t always use the Dell OpenManage GUI to build RAID arrays.
3. Ignore the drive bays, and use PCI Express cards. Drives from Fusion-IO, OCZ, and Intel bypass the RAID controller altogether and can deliver even faster performance – but at the cost of higher prices, smaller space, and tougher management. You can’t take four of these drives and RAID 10 them together for more space, for example. (Although that’s starting to change with Windows 2012’s Storage Spaces, and I’m starting to see that deployed in the wild.)
For our design, we ended up with:
- Dell PowerEdge R720 with 2 quad-core CPUs, 384GB memory – $10k
- 16 Samsung 840 Pro 512GB SSDs – $8k
- Hardware total: under $20k
- SQL Server Enterprise Edition licensing for 8 cores – $56k
Kinda keeps things in perspective, doesn’t it? The hardware seems insanely overpowered until you look at how much licensing costs. At that point, why wouldn’t you buy this kind of hardware?
Why I Load Test SSDs in RAID Arrays
The R720 has two separate RAID controllers, each of which can see 8 of the Samsung drives. The drawback of this server design is that you can’t make one big 16-drive RAID 10 array. That’s totally okay, though, because even just a couple of these race car drives can actually saturate one RAID controller.
I wanted to find out:
How few drives can we get away with? For future client projects, if we didn’t need to fill up the drive bays in order to get capacity, could we saturate the controllers with just, say, 4 drives instead of 8? Can we leave enough space to have hot spare drives? I run the performance tests with 2, 4, 6, and 8 SSD drives.
How much of a performance penalty do we pay for RAID 5? RAID 10 splits your drive capacity in half by storing two copies of everything. RAID 5 lets you store more data – especially important on limited-capacity solid state drives – but is notoriously slow on writes. (Thus, the Battle Against Any Raid Five.) But what if the drives are so fast that the controller is the bottleneck anyway?
Should we turn the controller caching on or off? RAID controllers have a very limited amount of memory (in our case, 1GB) that can be used to cache reads, writes, or both. In the past, I’ve seen SSD-equipped servers actually perform slower with the caching enabled because the caching logic wasn’t fast enough to keep up with the SSDs. Dell’s recent PowerEdge controllers are supposed to be able to keep up with today’s SSDs, but what’s the real story?
Does NTFS allocation unit size still matter? In my SQL Server setup checklist, I note that for most storage subsystems, drives should be formatted with 64K NTFS allocation units for maximum performance. Unfortunately, often we get called into client engagements where the drives are already formatted and the database server is live in production – but the NTFS allocation unit is just 4K, the default. To fix that, you have to reformat the drives – but how much of a difference will it make, and is it worth the downtime?
The answers to these questions change fast, and I need to check again about once a quarter. When I need to double-check again, and I’m working with a client on a new server build with all SSDs, I offer them a big discount if I can get remote access to the server for a couple of days.
Load Test Result Highlights
Turning off read caching didn’t affect performance. The controller’s small cache (1GB) just isn’t enough to help SQL Servers, which tend to cache most of their data in memory anyway. When we need to hit disk, especially for long sustained sequential reads, the controller’s minimal cache didn’t help – even with just 4 SSDs involved.
The controller’s write caching, however, did help. Write throughput almost tripled as opposed to having caching disabled. Interestingly, as long as write caching at the controller was enabled, it didn’t matter whether read caching was enabled or not – we saw the same benefit. I would expect higher write throughput if all of the 1GB of cache was available to cache writes, but that doesn’t appear to be the case with the R720’s controllers at least.
NTFS allocation unit size made no difference. This combination of RAID controller and drives is the honey badger of storage – it just don’t care. You can leave the default caching settings AND the default NTFS allocation unit size, and it’s still crazy fast.
In RAID 10, adding drives didn’t improve random performance. We got roughly the same random performance with 4 drives and 8 drives. Sequential read throughput improved about 35% – good, but maybe not worth the financial cost of doubling the number of drives. Sequential writes saw a big boost of about 60%, but keep in mind that sustained sequential writes is a fairly rare case for a database server. It’s unusual that we’re not doing *any* reads, and we’re writing our brains out in only one file.
SSD speeds still can’t beat a RAMdrive. With SQL Server Standard Edition being confined to just 64GB of memory, some folks are choosing to install RAMdrive software to leverage that extra cheap memory left in the server. If your queries are spilling to TempDB because they need memory for sorts & joins, this approach might sound tempting. Microsoft’s even got an old knowledge base article about it. The dark side is that you’re installing another software driver on your system, and I always hate doing that on production systems. Just for giggles, I installed DataRam’s RAMdisk for comparison. The SSDs are on the left, RAMdisk on the right, and pay particular attention to the bottom row of results:
The bottom row, 4K operations with a queue depth of 32, is vaguely similar to heavy activity on multiple TempDB data files. This particular RAMdrive software manages about 4x more write throughput (and IOPs as well) than the RAID 10 array of 8 drives. (For the record, a RAID 0 array of 8 drives doesn’t beat the RAMdrive on random writes either.)
And finally, here’s the performance penalty for RAID 5. RAID 10 is on the left, 5 on the right. Same number of drives, same cache settings, same allocation unit settings:
It’s not a surprise that RAID 5 is faster for reads, but in this round of testing, it was even faster for sequential and large writes. The only place where RAID 5 takes a hit: the bottom right, 4K writes with queue depth 32.
If you’re running a Dell PowerEdge R720 loaded with Samsung 840 Pro SSDs, you’re probably better off with RAID 5 than RAID 10 – but why?
Load Test Lowlights: The Controller
We’re not just testing the drives here – we’re also testing the RAID controller. To get a true picture, we have to run another test. On the left is a RAID 10 array with 8 drives. On the right, just one drive by itself:
The top two lines are sequential performance, and the RAID array helps out there as you would expect. Having more drives means more performance.
The jawdropper hits in the bottom half of the results – when dealing with small random operations, more drives may not be faster. In fact, the more drives you add, the slower writes get, because the controller has to manage a whole lot of writes across a whole bunch of drives.
See, we’re not just testing the drives – we’re testing the RAID controller too. It’s a little computer with its own processor, and it has to be able to keep up with the data we’re throwing at it. In the wrong conditions, when it’s sitting between a fast server and a fast set of solid state drives, this component becomes the bottleneck. This is why Dell recommends that if you’re going to fill the server with SSDs, and you want maximum performance, you need to use the R720 instead of the R720xd. Yes, the R720xd has more slots – but it only has one RAID controller, so you’re going to hit the controller’s performance ceiling fairly quickly. (In fact, the controller can’t even keep up with a single drive when doing random writes.)
This is why, when we’re building a new SQL Server, we want to test the bejeezus out of the drive configurations before we go live. In this particular scenario, for example, we did additional testing to find out whether we’d be better off having multiple different RAID arrays inside the same controller. Would two four-drive RAID 10 arrays striped together be faster than one eight-drive RAID 10 array? Would we be better off with a single RAID 1 for TempDB, and a big RAID 5 for everything else? Should we make four mirrored pairs, and then stripe them together in Windows?
You can’t take anything for granted, and you have to redo this testing frequently as new RAID controllers and new SSD controllers come out. In testing this particular server, for TempDB-style write patterns, a RAID 0 stripe of two drives was actually slower than a single drive by itself!
So as Jezza says on Top Gear, and on that bombshell, it’s time to end. Good night!