Solid state drives (SSDs) have gotten a lot of press because they can be dramatically faster than magnetic hard drives. They use flash memory instead of spinning magnetic platters. Since any bit of memory can be accessed at any time without moving a hard drive’s head around, SSD random access is insanely fast.
How fast are SSDs? So fast that good ones overwhelm the capacity of the connection between the server and the drive. The SATA bus maxes out at around 300 MB/s, and a good SSD can saturate that connection. In order to get your money’s worth out of an SSD, you have to connect it with something faster than SATA. Value is especially important given the pricing of solid state drives – more on that in a minute.
Fusion-IO ioDrives get around this limitation because they’re not SATA drives; they plug directly into your server’s much faster PCI Express bus. These cards can push several times more data per second than SATA drives can. Other vendors using this approach include OCZ Z-drives and RAMSAN. Of course, this connection method only pays off when the drive uses top-notch memory chips, and after briefly testing some of Fusion-IO’s products in their lab, I can vouch that they’re using the good stuff.
How Fast Are Fusion-IO Drives?
As a former SAN administrator, I’m anal retentive about reliability and uptime. I’ve heard FusionIO drives sold as a “SAN in your hand,” but with a single drive, there’s not enough failure protection for my personal tastes. I wouldn’t run any storage device without redundancy, so I ran most of my tests in a RAID 1 configuration – a mirrored pair of Fusion-IO SSDs. Keep in mind that since these devices have their own built-in controllers, any RAID setup must be a software RAID setup managed by Windows. Software RAID has a bad reputation, but it’s the only choice available when working with these drives. I was initially worried about the performance impact of software RAID, but I didn’t have anything to worry about.
I tested several different ioDrive models using my SQLIO scripts as seen on SQLServerPedia and got blazing results. Here’s a fairly typical set of results from a pass doing random reads:
C:\SQLIO>sqlio -kR -t2 -s120 -o8 -frandom -b64 -BH -LS P:\SQLIO\TestFile1.dat sqlio v1.5.SG using system counter for latency timings, 2929716 counts per second 2 threads reading for 120 secs from file P:\SQLIO\TestFile1.dat using 64KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 10240 MB for file: P:\SQLIO\TestFile1.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 22787.54 MBs/sec: 1424.22 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 0 Max_Latency(ms): 106 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I know what you’re saying, though. “Wow, Brent, that’s faster than any SAN I’ve ever seen before, but solid state drives are slower for writes, right?” Yes, here’s a set of results for writes:
C:\SQLIO>sqlio -kW -t2 -s120 -o1 -frandom -b64 -BH -LS P:\SQLIO\TestFile1.dat sqlio v1.5.SG using system counter for latency timings, 2929716 counts per second 2 threads writing for 120 secs to file P:\SQLIO\TestFile1.dat using 64KB random IOs enabling multiple I/Os per thread with 1 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 10240 MB for file: P:\SQLIO\TestFile1.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 10114.24 MBs/sec: 632.14 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 0 Max_Latency(ms): 54 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Write performance was about half as much as read performance, but that wasn’t because Fusion-IO drives were significantly slower in writing. They managed the same 500-600 MBs/sec write performance when using one drive alone (rather than RAID). When I put them in a RAID 0 (striped) configuration, I frequently got over 1,000 MBs/sec. To put things in perspective, the fastest solid state drives in Anandtech’s recent showdown only achieved 337 MBs/sec in their absolute best case scenarios.
The only way to outperform a Fusion-IO drive is to invest six figures in a SAN and hire a really sharp SAN admin. These drives consistently outperformed every storage I’ve ever seen short of SANs wired up with true active-active multipathing software, because typical 4Gb fiber connections can’t sustain this kind of throughput. Random access, sequential access, lots of threads, few threads, reads, writes, you name it – once we had them configured properly, I couldn’t get them to perform any slower than 550 MBs/sec, which is faster than you can drive 4Gb fiber. Connecting directly to the PCI Express bus really pays off here, and makes everything simple.
Configuring SQL Server on a SAN is hard. To really wring the best performance out of it, you have to design the pathing, the queue lengths on the host bus adapters, the RAID arrays, the SAN caching, the database filegroups and files, and sometimes SQL Server table partitioning in order to get everything to work in concert. Fusion-IO drives return SQL Server configuration to its easiest: plug it in, put data and logs on the same array, and just create databases with the GUI, using one data file and one log file. As long as your hardware environment is solid, it’s pretty darned hard to screw up a database configuration on Fusion-IO drives.
The Drawbacks of PCI Express Drives
Connecting data storage devices as PCI Express cards isn’t all unicorns and rainbows. Since they’re PCI Express cards, they can’t be used in blades unless you’ve got special PCI Express expansion blades.
These cards are connected directly to one server, which means they can’t be used in clustering environments. If your data is mission-critical, you’re probably using a cluster to protect yourself from server failures. Protecting from failures is even more important when you’ve got a single point of failure – like, well, a single PCI Express card with all your data. If something should happen to go wrong with one of these devices, you can’t simply fail over to another server unless you’re doing synchronous database mirroring, something I rarely see in production.
You can get some form of protection by using software RAID 1: building a simple mirror with Windows between two Fusion-IO drives in the same server. Whenever data is written to the volume, Windows will automatically write it to both Fusion-IO drives. Software RAID gets a bad rap, but in my brief testing, I saw no performance penalty when using this configuration.
However, when a drive fails, you probably won’t be hot-swapping these while the server is in operation. With server-class RAID arrays, you can pull a failed hard drive out of a server and replace it on the fly. The RAID controller will rebuild the array while the server is still online. Data access speeds will be slower while the array is rebuilt, but at least the server can stay up the entire time without an outage. Not so with PCI Express cards: the server will have to be pulled out of the rack and opened up in order to access the drives. This requires careful cabling – something I don’t see often in datacenters.
And Yes, You Need to Be Paranoid
During my testing, before Fusion-IO ironed out all of the configuration issues, I kept having drives drop offline. Normally I’d blame my own stupidity, but my tests were run in FusionIO’s datacenter, on their servers, configured by their staff. I connected via remote desktop, set up SQLIO per my SQLIO tutorial at SQLServerPedia, and ran the tests. During my tests, ioDrives appeared to have failed and FusionIO staff had to replace them. It took several weeks for us to narrow down several unfortunate problems.
If you truly try to push your IO subsystems to the limit, a Fusion-IO subsystem will expose more weaknesses than other storage subsystems because it has so much more throughput. Some of the problems included motherboard issues, driver problems, OS configuration errors, and even insufficient power supplies that couldn’t handle the load of multiple drives.
Buyers need to be aware that this is a version 1 product with version 1 best practices and documentation. When you put something like this into your infrastructure, make sure you’re actually adding reliability. In my post about adding reliability to your infrastructure, I pointed out:
“The only way a twin-engine plane is more reliable is if just one of the two engines is enough to power the airplane safely. If the airplane requires both engines in order to maneuver and land, then the second engine didn’t add reliability: it just added complexity, expense and maintenance woes.”
The ironic part about my FusionIO testing woes was that they only happened in RAID scenarios. The drives were actually more reliable without RAID – when I added RAID, I could knock a drive offline in minutes. The faster we wanted to go, the more careful the team had to be with other pieces of the infrastructure.
FusionIO drives solve a real problem, and they can deliver staggering performance, but just like any other new technology, you should test them thoroughly in your own environment before deploying them in production. Make sure to test them in the exact configuration you plan to deploy – if you’re going to deploy them in a RAID configuration, test them that way, rather than testing individual drives and assuming they’ll hold up in RAID configs. In the case of Fusion-IO drives, you should probably even test using similar power supplies to production in order to improve your odds.
Where I’d Use Fusion-IO Drives in Database Servers
If you’re experiencing heavy load problems in TempDB, and if you’re not using a cluster, a Fusion-IO drive can probably solve the problem with less engineering effort than any other solution. Simply shut down the server, drop in an ioDrive, change SQL Server’s TempDB location to point to the ioDrive, and start SQL Server up again. Your TempDB IO won’t travel over the same storage paths that your database needs, which frees up more bandwidth for your data and log traffic. TempDB requests stay on the PCI Express bus and don’t hit your SAN.
If you’ve got power, cooling, and space constraints in your datacenter, but you need to add more storage performance (but not necessarily capacity), a Fusion-IO drive makes sense. You’ll gain more storage throughput this way than by adding several shelves of bulky, hot hard drives that require lots of power. On a cost-per-gigabyte basis, this won’t make economic sense, but if you’re buying storage for performance needs, the cost equation is different.
If you need to scale an OLTP database that doesn’t require high availability, you might consider skipping clustering and trying database mirroring instead. Use two SQL Servers in the same datacenter, both equipped with Fusion-IO drives, and make sure they’ve got plenty of bandwidth between them to keep up with synchronous (or asynchronous) mirroring. You could argue that this provides a higher availability than clustering, since it uses two different storage back ends. I’ve had SANs fail, and I can see how this might be attractive in some environments. StackOverflow strikes me as an excellent candidate – the databases are small enough to fit on Fusion-IO’s drives, and the servers are rack-mounted as opposed to blades.
I’m also intrigued at the ioDrive’s potential to offload transaction log load. I can envision a scenario where databases are in full recovery mode, but the load of logging and transaction log backups is starting to put a strain on the server’s IO. Moving the transaction logs onto the Fusion-IO drive eases the load on the SAN (not just the hard drives, but the cabling between the servers and the SAN controller).
I wish I had SQL Server benchmarks for these scenarios to share with you, but the testing process ended up taking several weeks, and I ran out of time. Thankfully, Paul Randal is blogging about his experiences with Fusion-IO drives.
My verdict: these drives can solve some tough challenges. I’m not saying that because I’m being paid to, either; Fusion-IO was gracious enough to give me access to their labs for my testing, but I didn’t get compensated. Quite the opposite – I sank a lot of time into this project. Folks who follow me on Twitter may remember that I struggled with Fusion-IO during the initial stages of the testing as we went through one hardware failure after another. After the problems we encountered and the weeks of investigation, I’m glad to finally be able to say without hesitation that you should check out FusionIO’s products. Their throughput may push the rest of your infrastructure to its limits, but hey – that can be a good problem to have!
This post is a part of T-SQL Tuesday, which is focusing on storage/IO this week. If you liked this post, head over there to see what else is happening in the blogosphere this week!
