Blog

Storage vendors brag about the IOPS that their hardware can provide. Cloud providers have offered guaranteed IOPS for a while now. It seems that no matter where we turn, we can’t get away from IOPS.

What Are You Measuring?

When someone says IOPS, what are they referring to? IOPS is an acronym for Input/Output Operations Per Second. It’s a measure of how many physical read/write operations a device can perform in one second.

IOPS are relied upon as an arbiter of storage performance. After all, if something has 7,000 IOPS, it’s gotta be faster than something with only 300 IOPS, right?

The answer, as it turns out, is a resounding “maybe.”

Most storage vendors perform their IOPS measurements using a 4k block size, which is irrelevant for SQL Server workloads; remember that SQL Server reads data 64k at a time (mostly). Are you slowly getting the feeling that the shiny thing you bought is a piece of wood covered in aluminum foil?

Those 50,000 IOPS SSDs are really only going to give you 3,125 64KiB IOPS. And that 7,000 IOPS number that Amazon promised you? That’s in 16KiB IOPS. When you scale those numbers to 64KiB IOPS it works out to 1,750 64KiB IOPS for SQL Server RDS.

Latency as illustrated by ping

Latency as illustrated by ping


Latency vs IOPS

What about latency? Where does that fit in?

Latency is a measure of the duration between issuing a request and receiving a response. If you’ve ever played Counter-Strike, or just run ping, you know about latency. Latency is what we blame when we have unpredictable response times, can’t get to google, or when I can’t manage to get a headshot because I’m terrible at shooters.

Why does latency matter for disks?

It takes time to spin a disk platter and it takes time to move the read/write head of a disk into position. This introduces latency into rotational hard drives. Rotational HDDs have great sequential read/write numbers, but terrible random read/write numbers for the simple reason that the laws of physics get in the way.

Even SSDs have latency, though. Within an SSD, a controller is responsible for a finite number of chips. Some SSDs have multiple controllers, some have only one. Either way, a controller can only pull data off of the device so fast. As requests queue up, latency can be introduced.

On busy systems, the PCI-Express bus can even become a bottleneck. The PCI-E bus is shared among I/O controllers, network controllers, and other expansion cards. If several of those devices are in use at the same time, it’s possible to see latency just from access to the PCI-E bus.

What could trigger PCI-E bottlenecks? A pair of high end PCI-E SSDs (TempDB) can theoretically produce more data than the PCI-E bus can transfer. When you use both PCI-E SSDs and Fibre Channel HBAs, it’s easy to run into situations that can introduce random latency into PCI-E performance.

What About Throughput?

Throughput is often measured as IOPS * operation size in bytes. So when you see that a disk is able to perform X IOPS or Y MB/s, you know what that number means – it’s a measure of capability, but not necessarily timeliness. You could get 4,000 MB/s delivered after a 500 millisecond delay.

Although throughput is a good indication of what you can actually expect from a disk under perfect lab test conditions, it’s still no good for measuring performance.

Amazon’s SQL Server RDS promise of 7,000 IOPS sounds great until you put it into perspective. 7,000 IOPS * 16KiB = 112,000 KiB per second – that’s roughly 100MBps. Or, as you or I might call it, 1 gigabit ethernet.

What Does Good Storage Look Like?

Measuring storage performance is tricky. IOPS and throughput are a measurement of activity, but there’s no measure of timeliness involved. Latency is a measure of timeliness, but it’s devoid of speed.

Combining IOPS, throughput, and latency numbers is a step in the right direction. It lets us combine activity (IOPS), throughput (MB/s), and performance (latency) to examine system performance. 

Predictable latency is incredibly important for disk drives. If we have no idea how the disks will perform, we can’t predict application performance and have acceptable SLAs.

In their Systematic Look at EC2 I/O, Scalyr demonstrate that drive latency varies widely in EC2. While these numbers will vary across storage providers, keep in mind that latency is a very real thing and it can cause problems for shared storage and dedicated disks alike.

What Can We Do About IOPS and Latency?

The first step is to make sure we know what the numbers mean. Don’t hesitate to convert the vendor’s numbers into something relevant for your scenario. It’s easy enough to turn 4k IOPS into 64k IOPS or to convert IOPS into MB/s measurements. Once we’ve converted to an understandable metric, we can verify performance using SQLIO and compare the advertised numbers with real world numbers.

But to get the most out of our hardware, we need to make sure that we’re following best practices for SQL Server set up. Once we know that SQL Server is set up well, it’s also important to consider adding memory, carefully tuning indexes, and avoiding query anti-patterns.

Even though we can’t make storage faster, we can make storage do less work. In the end, making the storage do less gets the same results as making the storage faster.

↑ Back to top
  1. Nice writeup, I also don’t want to see folks scammed by a single metric without context.
    “It’s easy enough to turn 4k IOPS into 64k IOPS…”
    Simply converting 4k IOPs to 64k IOPs by dividing by 16 does not account for important factors such as CPU/controller utilization, buffer utilization, and round trips. Its extremely important, for example, when using an on-board PCIe flash accelerator card to know the page size options and select the best one for the workload and system. There’s a current card that is formatted to 4k pages by default which allows 64k page formatting… many SQL Server workloads will perform far better with 64k pages. It’s also very important to evaluate whether a change to fibre channel HBA maximum transfer size will benefit performance. If SQL Server issues 1 mb IOs and the HBA splits them into 512k IOs… the extra round trips make a difference.

    • Thanks for the additional details.

      I opted for the simple route since device manufacturers are the initial parents of lies and untruths. RAID stripe size, HBA transfer size, iSCSI packet size, and all kinds of other things can play a big part in IO performance.

      • Ha! I thought about this some more and realized that with a very reasonable assumption, you could simply divide a the vendor’s 4k IOPs number by 16 and come up with an estimated CEILING for the 64k IOPs, or perform other such simple conversions. The assumption is that the vendor would ONLY report the IOPs from the most favorable IOPs*IOsize combination.
        However, it would be hard to predict how far below that ceiling a different IO size would land. And who knows… maybe there is a vendor that, when faced with significant differences among IOPs*IOsize combination, would actually report the whole story?

        • Actually, dividing 4kB random IOPS by 16 does not necessarily produce a meaningful ceiling estimate for 64kB random IOPS. On a pure-disk system, where disk seek latency is the primary limit on IOPS, that may be a reasonable assumption.
          On ION’s SR-71mach4 all-SSD RAID server, we find that 4kB random reads execute at just over 1 million IOPS while 64kB random reads are performed at 131 thousand IOPS, or more than double what would be predicted when dividing by 16. In an all-SSD system doing that many random IOPS, the limit at 4kB is not the performance of the individual drives and certainly not “seek” time. The limit on performance for 4kB random reads is much more related to the performance of system processors and RAID controller processors. The result is that for random read performance, bandwidth (MBps), that is, IOPS * block size, improves significantly as block size increases.

  2. IOPS are more about random seeks. For sequential reads the number is entirely meaningless. You get the same MB/s with any (reasonable) blocksize and therefore with any IOPS you want.

    On a magnetic disk the 4k IOPS number will be very comparable to the 8k IOPS number because the disk seek makes for ~99% of the cost. The seek takes 10ms and the additional 4k read takes 4k/(100MB/s))=0,04ms (IOW nothing).

    The model will be entirely different for each disk type and vendor. To make a decision you need to understand the model. This article however makes very broad statements which do not hold in general. This is really a nasty source for out-of-context quotes.

    • Hey Tobi, thanks for your contribution. I’m glad that you pointed out that to read 8K, I could spend 10 ms (or 99.6% of the duration) waiting on a drive head to move into position. It really is important that people know how long I/O could take in a worst case scenario.

      In database applications, or really any application in a multi-user environment, can you tell me the last time you encountered purely sequential I/O?

      • Jeremiah: “In database applications, or really any application in a multi-user environment, can you tell me the last time you encountered purely sequential I/O?”
        Yep. Database backup :) To a lesser extent integrity checks and index maintenance. And to the least extent of these – SQL Server readahead and Oracle multi-block reads. Use startup option -E to alter SQL Server proportional fill behavior and you might just see a whole lot more sequential IO :)
        Tobi: “IOPS are more about random seeks. For sequential reads the number is entirely meaningless.” For spinning media I agree with the first sentence. For flash if every IO passed to the device is same size as block/page, its usually a draw. For spinning media AND flash I disagree that IOPs are meaningless for sequential IO. A counterpoint is database backup. If IOPs were meaningless… Microsoft wasted time allowing a configurable max transfer size parameter for the T-SQL backup command… and I wouldn’t have had as much fun showing how increasing the fibre channel HBA max transfer size (thus lowering the number of IOPs for the backup) changed the behavior of SQL Server backup with respect to bandwidth and latency. :)
        http://sql-sasquatch.blogspot.com/2013/07/mssql-backups-check-hba-maximum.html

        • Good points on your part.

          My main point is: understand the execution model and you can predict performance. Without talking about a specific model I would not make specific statements.

          In your article you made improvements after determining what affects the underlying IO size. That was a good approach. SANs and RAIDs can need high IO size because they stripe IOs over multiple drives so that physical IOs are a fraction of the requested IO size. Again, the model explains the observations and allows for making predictions.

      • And yes, IOPS are a scam :) Glad someone is calling it out.

  3. Depending on how the device work (for both HDD and SSD devices) doing 16x 4k reads and 1x 64k read might take significantly different amounts of time. Just multiplying to convert one to the other may well give you meaningless numbers.

  4. Pingback: Stuff that makes us laugh VI: Return of the lulz! - Page 146 - Project Reality Forums

  5. Nice Article.

    I would add that IOPS and Throughput measure different things. IOPS have a direct impact on an IO subsystem capability to hold concurrent IO request to the subsystem. IOPS are generally more relevant to system’s performance than throughput when a large number of database applications are hosted on a single subsystem : that is the case when physical consolidation increases . In such regard, the key factor of IO subsystem becomes a purely physical issue depending on the controller algorythmic’s ability to buffer effectively striped IO operations on the spindles. Traditional SAN vendors are now totally overwhelmed economically by the appearance of newt technologies which basically do much better for a much lower price.

  6. So does the performance noticeably go up when you reformat a drive from default 4k to 64k or is it just that SQL uses the space more efficiently and it’s a ‘little’ better?

    • With some storage subsystems, I’ve seen as much as a 50% jump. With others (like NetApp or some SSDs) there’s been no difference.

    • You’ll get much more efficient I/O – when SQL Server performs one logical 64KB I/O operation, it’ll be one physical I/O operation too. Lining everything up with the correct formatting (check with your vendor) can make a huge difference – as Brent points out.

  7. Hi 10x for this blog.
    We have EMC with Fastvc, the Storage team want new Hardware and want plan form me.
    the issue is i am working to decrease any IO oparation this days – so the business is up but the IOPS are down.
    what shloud i do?
    pini

    • Krisher – that’s a great question. A lot of our clients come to us for help building storage and hardware budgets. We’ve got a customized consulting product to help with that, and you can learn more by clicking Contact at the top of our site, or emailing us at Help@BrentOzar.com. (It’s tough to find folks who will help you build a hardware budget for free – although of course your vendors will be more than happy to do that for you, and you can guess that the budgets will be fairly high, hahaha.)

  8. I agree completely that just IOPS, or just MBps or even just latency, is not very meaningful without context. To evaluate how meaningful a certain claim is, you really need to see the detail of IOPS AND MBps AND latency along with the context of block size and queue depth and a description of the storage system under test. Full IOmeter resulsts and a description of the hardware and software confiugration are a good start.

    My company, ION Computer Systems, has a good amount of storage benchmarks posted, and they include am overview of the system along with block size, read and/or write. random or sequential and queue depth. And they include a link to the IOmeter detailed output for the test.

    On more note, on SSD, or SSD RAID servers like ION’s SR-71mach4 server, another important aspect of “context” is the duration of the test and the warm up, or ramp, before the test. Disk-based systems can yield consistent results if run long enough to saturate the effects of the caches along the way. SSD-based systems need longer, hours under tests that include write, for the individual drives to reach steady-state performance.

  9. Pingback: What Makes Up Networked Storage Throughput? - softwareab

  10. I may have lost this in translation, but apparently looking at latency alone is not enough to gage disk performance… So I have to ask, What is a “qualitative” way of measuring performance of your SAN? What do I need to take into account?

    • You need to take a look at operations/sec, throughput, and latency. As Keith Josephson points out in a previous comment, you also need to be aware of block size and queue depth as well as the other other activity on the storage device.

      In a perfect world, you would collect this information and look at the average, median, standard deviation, and a variety of percentiles in order to get an accurate understand of storage performance over time.

      There’s some great information in the comments that will help you put together an assessment of your storage.

      There are also several examples and samples over in our article How Fast is Your SAN?.

  11. Pingback: Coding for SSDs – Part 2: Architecture of an SSD and Benchmarking | Code Capsule

  12. Pingback: Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer | Code Capsule

  13. Pingback: Coding for SSDs – Part 4: Advanced Functionalities and Internal Parallelism | Code Capsule

  14. Pingback: Coding for SSDs – Part 5: Access Patterns and System Optimizations | Code Capsule

  15. No metric should be taken without a grain of salt.

    64K is a standard page size for SQL Server.
    It’s not the case for other RDBMSes
    It’s optimized for HDDs.
    All current RDBMSs are tweaked to abuse sequential IO at the expense of random IO which has horrible performance on spinning media.

    Now, if you consider an RDBMS that has been optimized for the performance pattern of solid state media, not only are you going to take full advantage of the stated burst 50K+ random IOPS, but you will also drop all the now inefficient spinning media tricks that make up most of the RDBMS storage layer for another speedup.

    Having tested SSD with current RDBMSs I was very disappointed by the performance increase, but to be fair this is due to those tools being stuck in the spinning rust era, not to the SSDs not performing, as their raw performance is a measurable fact.

  16. It doesn’t matter how SQL Server wants to read blocks, all that matters is how the drive was formatted by the OS. If it was in 4K blocks, you get 16 IOPS in SQL Server, like it or not. And WTF is a KiB? And why would you use it right next to a MB? Stick to Counter-Strike and getting your latency down to a minimum, kid.

    • Hi “Butch”,

      Thanks for your insightful and meaningful contribution to the world at large. They’re greatly appreciated!

      Since you asked so nicely, a KiB is a kibibyte or 1,024 bytes. An MB is a megabyte or 1,000 KB or 1,000,000 bytes.

      Keep living the dream!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php