Let’s oversimplify the bejeezus out of this complex problem. Suspend your disbelief for a second and work with me:
We have a database server hosting just one 100GB table. Sure, in reality, we’ve got lots of databases and lots of tables, but we’re going to keep this simple. We’ve got a simple sales table that stores a row for each sale we’ve ever had. We don’t have any indexes: this is just 100GB of raw data in our clustered index.
Our database server has 32GB of memory. Some of that is going to be used by the operating system, drivers, the database software, and that bozo who keeps remoting into the server and playing Angry Birds, but again, we’re going to keep this really simple and pretend that all 32GB of memory is actually used for caching data. We don’t have enough to cache the entire 100GB table, though.
A user runs a query that needs to scan the entire table. They want sales numbers grouped by year, by region. In decision support systems, users run all kinds of wacko queries, and we can’t build indexes to support all of them, but even if we could, we’re keeping this scenario simple and assuming that we have 100GB of raw data and no indexes whatsoever. To satisfy this query, we have to read all 100GB of data.
Before our query can finish, we have to read 68GB of data from disk. That’s our 100GB table minus 32GB of it that happens to be cached in memory. I’m assuming that we’ve got a warm cache here with some 32GB of the data in memory, although I don’t know which 32GB, and it doesn’t really matter. We can’t fit 100GB of data in a 32GB bag.
The user wants the query to finish in 10 seconds or less – preferably much less. Presto: now we know how fast storage needs to be. We need to be able to read 68GB of data in less than 10 seconds. We can test our storage to see whether it meets that number using my recent post on how to check your SAN speed with CrystalDiskMark.
The Magic SAN Speed Formula
The final formula is beautifully simple: how much time do we have, and how much data do we need to read? The business is responsible for telling us that first number, but the second number is a heck of a lot harder to gather. We have to put ourselves into the above scenario and boil things down to the simplest possible illustration of the worst case scenario.
How much memory is available for caching data? Use these simple DMV queries to find out how much memory each database is using, and even better, how much each object in each database is using. You might be surprised at how little memory is available for caching because your server needs so much memory for other tasks like keeping the OS’s lights on and sorting your query data. This is why I’m so emphatic that you should never remote desktop into a SQL Server – by launching programs there, you’re consuming very valuable memory.
How big is the biggest table we need to query? Use this DMV query to calculate the size of all the tables in your database – both with and without indexes. The results help explain why more indexes aren’t necessarily better: they’re all competing for the same memory. When I’ve got two overlapping indexes that are both getting used, I’m cutting my cache capabilities.
Can we use an index to satisfy the query? Sometimes the answer to faster storage is writing better queries that can leverage indexes rather than doing table scans. This is why it’s important to understand sargability and implicit conversions.
How much of this data can we guarantee will be in cache? Think worst case scenario: other queries may be running, or other databases on the system might be more active and taking over the cache. The more memory I put in the server, and the more I isolate performance-critical databases away from the rest, the more I can guarantee fast queries by caching data.
Microsoft’s Reference Architecture Specs for SAN Speeds
Microsoft’s Fast Track Data Warehouse systems are purpose-built database servers that ship with everything you need to get fast performance. They’re available from hardware partners like Dell, HP, and IBM, and Microsoft works with ‘em to make sure you’ll get the speed you need.
The Fast Track reference architectures assume that we can’t satisfy queries via indexes, and they don’t even try to cache the data in memory. They just flat out assume queries will be performed using table scans, so they require very high speed storage performance:
“…this system architecture is called the Core-Balanced Architecture. This balanced approach begins with what is called the CPU core consumption rate, which is the input capacity that each CPU core can handle as data is fed to it.”
This is a really different approach, and it starts to explain SQL Server 2012′s licensing of around $7k per core for Enterprise Edition. If you’re going to pay big money for 40 cores of that licensing, wouldn’t it make sense to ensure that those CPUs can actually do work? By specifying a minimum IO throughput per core, Microsoft guarantees that the server could actually get busy. Otherwise, we’re harnessing expensive thoroughbred racehorses to a crappy chariot. The Fast Track Configuration Guide even goes so far as showing you how to calculate a Maximum Consumption Rate and a Benchmark Consumption Rate for your system before going live. (I love Microsoft.)
In a typical customer environment I worked with recently, their current IO subsystem was able to deliver 300-400MB/sec. By using the questions above and looking at Microsoft’s Fast Track reference architectures, we calculated that they needed closer to 4,000MB/sec in order to satisfy their end user requirements for query times. Put another way, if we didn’t change any of the other variables, we needed to make the storage ten times faster. Obviously, making that kind of improvement ain’t easy or cheap – and suddenly we got buy-in from management to change some of the other variables.
When you see the whole picture – licensing, storage throughput, query design, and end user requirements – it’s much easier to find the right way to get faster performance. Sometimes it’s insanely fast IO throughput like Microsoft’s Fast Track solution, and sometimes it’s rewriting queries to improve index utilization. Showing the real cost of storage throughput helps justify why query writers need to step back and rewrite troublesome parts of the app.