SQL University: Where We Store Stuff

By Brent Ozar · December 23, 2009 · 18 comments

Welcome back, class. In Monday’s session, we discussed the basics of SQL Server storage – things like the data file, log file, and tempdb. We talked about why we might need to separate those things from each other in order to get more performance. In today’s lecture, I’ll be covering where we store this stuff and why it matters.

In The Beginning, God Made Magnetic Hard Drives

You kids are living in a magical time; you’re old enough that you’ll remember when data was stored on spinning metal platters, yet young enough that you’ll watch the transition to stashing it in memory chips. (Some of us are even old enough to remember computers that booted off floppy drives, but I digress.)

If you pop the hood on a 3.5″ magnetic hard drive, you’ll see something like the picture shown here. The round silver platter is like a vinyl record, and the metal pointer-looking thing is like a the record needle. Wait, what? You don’t know what a phonograph is? Well, I suppose that’s what I get for trying to use 19th century technology as a simile in the 21st century.

The pointer-looking thing (known as the “head”) changes the magnetic state of small areas of the drive. If the area is charged one way, it represents a zero, and if it’s charged the other way, it represents a one. As the platter spins around, the head moves over different areas of the drive where different data is stored. The head can also move left or right to access closer in to the center of the platter, or further out toward the edges. Over time, hard drive technology has improved by making the platter spin faster (thereby covering more data in less time) and packing more magnetic on/off areas on each platter (thereby letting us store more stuff in the same physical space.)

There’s a few problems with these drives. First off, there’s a lot of moving going on; the platter spins at speeds of up to 15,000 RPM. To put that in perspective, it’s faster than the engine spins in your car. If any dust gets into the drive, or if a part comes loose inside, things will crash around inside the drive and tear it apart. We might not find out about this damage until it’s too late – our data will get damaged as more parts of the platter start to fail, but unless we’re constantly re-reading everything on the drive, we won’t know about it. The drive’s moving parts themselves can also wear out, such as the motor that spins the platter or the motor that moves the head around. In order to fix those parts, we would have to crack open the drive, which would introduce dust and cause a failure.

While these speeds are fast in the mechanical world, they’re downright tortoise-like compared to the rest of our computer. In the milliseconds that it takes to move the head from one point to another on the drive, our computer’s processors sit around waiting for what feels like eons. This is why computers use memory to cache things that are also stored on the drives – it’s dramatically faster for us to read this data out of memory than it is to fetch it from spinning metal platters.

Want to learn more about relative speeds inside computers? Peruse the Build-a-PC Bandwidth Chart, which compares bus speeds of processors, memory, networks, and storage.

Two Heads Are Better Than One

Redundant Array of Inexpensive Phonographs makes sure the beats go on and on to the break of dawn — Redundant Array of Inexpensive Phonographs ensures the beats go on and on to the break of dawn

To mitigate the failure and speed problems of these drives, sysadmins combined multiple hard drives together into Redundant Arrays of Inexpensive Drives (RAID). There are several ways to configure groups of drives, and the simplest method is called RAID 1. It uses two hard drives that always contain exactly the same set of data. Whenever the computer writes data to one drive, it also writes that same data to the other drive. The duplicate-write process can either be managed in hardware or at the operating system level. When one of the drives fails, the other drive is unaffected, so our data is still available and our server stays online.

When both drives are working fine, then we can do our reads from either drive. If we need to read a 10 megabyte file from storage, we can read 5 megabytes from one drive and 5 megabytes from the other drive, which could make our reads twice as fast. If two drives still don’t give us the performance we need, there are other RAID levels like RAID 5 and RAID 10 that can handle more drives for more speed. (The level names don’t have anything to do with the number of drives involved.)

Want to learn more about RAID levels? The Wikipedia articles on RAID and on standard RAID levels are very well-written.

The more performance we want, the more hard drives we have to use – but most servers only come with a few hard drive bays. The solution is to move to a Storage Area Network (SAN), a system with hundreds of hard drives, a very smart computer to track what they’re doing, and a dedicated network to connect the servers to their hard drives. SANs are big (typically fridge-sized or bigger), expensive (at least $100k, but often $1m or more), and they can be very fast. Can be, but may not be – more on that in a minute.

Even dozens or hundreds of drives in a RAID configuration aren’t always fast enough, and that’s where the newest drive technology comes in.

Solid State Drives (SSDs)

USB thumb drives and memory cards for cameras represent a new kind of storage that uses transistors instead of magnetic charges to store data. This flash memory uses an electric charge to change the state of a transistor much like our magnetic hard drives used power to flip magnetic bits, and the on/off state of each bit of flash memory represents 0s and 1s. Just as magnetic hard drives don’t need power in order to maintain their state, flash drives don’t need electricity to maintain the state of the transistors. This means that unlike your computer’s RAM memory that gets erased whenever you power off, the flash memory saves its state over time. Eventually, as this technology got faster and cheaper, manufacturers could build entire hard drives out of these chips.

Early generations of solid state drives performed worse than conventional magnetic hard drives, but subsequent generations have dramatically improved. SSDs can read any stored data immediately without waiting for platters to spin or heads to move, so they can really shine at random access of data. Compare these two charts from Tom’s Hardware covering average read speeds (higher is better):

So we’ve established that they’re faster (in part because they don’t have moving parts), but just because they don’t have moving parts doesn’t mean they don’t wear out. Each bit of storage can only be electronically flipped so many times before it breaks. To account for that, SSDs try to balance out which parts of the drive are written to, and will move your data around in order to wear it out evenly. “Hot” areas that have been written many times may be moved to another part of the drive, thereby leveling out the wear. Drive manufacturers also build in extra blank space on the drive to give it extra space to take over when other areas are worn out, thereby extending the drive’s life.

Want to learn more about wear leveling? Anandtech’s article on choosing the best SSD covers the subject in exhaustive detail and explains that the OS and drivers also matter.

We can protect ourselves from drive failure by using SSDs in RAID configurations the same way we did with magnetic hard drives. In fact, SSDs may be simple drop-in replacements for some servers, since SSDs are available with SATA interfaces just like your desktop’s hard drive. However, when we start talking RAID, we usually talk about connecting to our hard drives with something faster than SATA. As we start to throw more and more drives at the problem, whether they’re magnetic or solid state, we run into a new problem. We have to make sure that the connection between our drives and our computer is fast enough to handle all of the drives at once.

Homework Time: Studying Multipathing & Shared Drives

For those of you who are using a SAN, thinking of using a SAN, or hoping to get a job somewhere with a SAN, there’s a lot to learn. Thankfully for me, I’ve already written a ton of it, and here comes your homework:

Stop whining – just because it’s a holiday week doesn’t mean you can stop learning. After consuming all that information, you should have a basic grasp of what makes storage fast or slow, and how to find out your own storage performance.

Putting It All Together: How Fast Is My Storage?

Your storage speed is a combination of the drive type, the RAID level, and the connection between the servers and the drives. (It depends on even more than that, like whether the drives are shared or dedicated, but let’s focus on the basics for now.)

Drive types, from fastest to slowest:

Good solid state drives
15k magnetic drives (meaning they’re spinning at 15,000 RPM)
7200rpm drives
5400rpm drives
Bad solid state drives

There’s some variations in here – for example, there are really good 5400rpm drives that outperform bad 7200rpm drives, and so on, but again, let’s focus on the big picture.

RAID levels, from fastest to slowest:

RAID 10 or RAID 5 (depending on whether you’re reading or writing)
RAID 1 (a pair of drives in a mirror)
No RAID at all (a single drive, and yes, people do still do this – especially with solid state drives like Fusion-io)

You in the back – sit back down. Yes, I understand that there are dramatic performance differences between RAID 5 and RAID 10 in many scenarios, but this is a beginner’s class. If you want to teach, get your own course.

Connection methods, from fastest to slowest:

Properly configured SANs with true active/active multipathing
Properly configured SANs without active/active multipathing
Locally attached storage
Poorly configured SANs

The bean counters in the class will notice that these variables are also listed in another order: from most expensive to cheapest. Good storage ain’t cheap.

You, dear student, probably don’t have the responsibility of architecting your entire SQL Server storage solution from scratch. You’re probably stuck with a few existing options for your servers, like a couple of arrays from your SAN admin or asking for different RAID configurations. Your simple answer is to use SQLIO to benchmark your storage performance and find out how fast each of your options are.

If you have a fixed number of drives, but you have the ability to carve them up into different groups, work with your SAN administrator to try different combinations. For example, if you had 20 hard drives, you could measure the read/write performance of each of the following:

14 drives in a RAID 10 plus 6 drives in a RAID 10
14 drives in a RAID 5 plus 6 drives in a RAID 10
2 sets of 7 drives each in a RAID 5, plus 6 drives in a RAID 10

Why would you try these different combinations? Because your next decision – and this is where “it depends” becomes an art form – is where to place your SQL Server files.

Where SQL Server Meets Storage

Assuming you have multiple storage options with different performance profiles available to a given SQL Server, the simplest guideline is:

Transaction logs go on the fastest writing storage (typically your fastest drives in a RAID 10 array)
Data files go on the fastest reading storage

The problem is that both of those lines say “fastest.” When DBAs approach SAN admins in a dark alley with a storage requirement, we always say the same thing: “Give us the fastest storage you have. All of it.” Since SAN admins aren’t prone to emptying their wallets for us, we have to learn how to compromise. We need to identify as many of our stored things as possible that don’t need speed, and then peel those out into separate requirements. In the case of a multi-terabyte data warehouse, those requirements might look like this:

Transaction log files – extremely fast writes, needs to be on a RAID 10 array.
Primary data files – tables with frequently changing dimensions, last 12 months of sales data. Needs to be on a RAID 10 array.
Archive data files – historical sales data >1 year old that makes up 75% of our data volume, but is rarely changed. Can go on a slower RAID 5 array.

Then inside each of those needs, the DBA must decide whether to use a single file on a single array, or multiple files on multiple arrays. That decision is determined by the number of cores involved, the types of queries, and even the multipathing configuration on the database server. For example, if a server is connected to its SAN with several 1 gig iSCSI links that don’t have true active/active multipathing, the DBA and SAN admin may determine that they need multiple data files on multiple volumes just to get more throughput – regardless of CPU count or query type.

This is a lot of work to do on a server-by-server, database-by-database basis. That’s why in the vast majority of cases, it’s best to start simple: just put the data and log files onto the fastest drives possible, and then over time, monitor your storage performance. I recommend using Perfmon to periodically check your Sec/Reads and Sec/Write numbers as discussed in my Perfmon article. When your queries are experiencing IO bottlenecks and you trace the problem back to slow Sec/Reads and Sec/Writes metrics, then it’s time to tune storage. Beef up that particular storage array, separate out what’s being stored on that array onto different arrays, or figure out how to make that array go faster (like with multipathing).

Hopefully I’ve started to open your eyes about the ways to tweak your storage options and what you put on that storage. The options are staggering, and the more you play with it, the more you can eke out every last bit of performance.

Our final post on Friday will cover some scary storage scenarios. (Yes, it’ll go live on Christmas Day, but only because I’m scheduling it in advance.)

Continue to Part 3: Scary Storage Scenarios

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

18 comments

Mark Richman

December 23, 2009 at 7:31 am

“If you want to teach, get your own course.” – Well said, my friend!

Can you recommend a make and model SSD to put in RAID 10 on my desktop? I’m looking for 2TB total storage, so I guess four 1TB SSDs would do the trick.

Reply
1. Brent Ozar
  
  December 23, 2009 at 7:36 am
  
  The best answer for that is the Fusion-io ioDrive Duo right now. 640gb capacity in each PCI Express slot, and you bypass the slow SATA interface. As long as you’re using it in a non-clustered environment, you won’t mind that it isn’t shared between multiple systems (like a cluster). Of course, it ain’t cheap on a per-gig basis, but I’m hearing raves about their performance.
  
  Reply
2. Brent Ozar
  
  December 23, 2009 at 7:36 am
  
  Forgot to include the link to the Fusion-io:
  
  http://www.fusionio.com/products/iodriveduo/
  
  Reply
3. Mark Richman
  
  December 23, 2009 at 7:48 am
  
  Well, what about a SATA SSD solution? I’ve got 4 SATA drive bays available on my Mac Pro. Obviously, this will not be a SQL Server application 😉
  
  Reply
  1. Brent Ozar
    
    December 23, 2009 at 7:51 am
    
    I really like the benchmarks on Tom’s Hardware for that:
    
    http://www.tomshardware.com/charts/2009-flash-ssd-charts/benchmarks,48.html
    
    But make *sure* the Mac Pro’s SATA controller will actually handle the speed. I’ve heard horror stories about the Macbook Pros not being able to handle the additional speed. I haven’t worked directly with any Mac Pro users, so I can’t say about those SATA controllers.
    
    http://www.macsimumnews.com/index.php/archive/macosg_macos_x_is_having_trouble_handling_ssd_hard_drive_speeds
    
    Reply
Dave Schutz

December 23, 2009 at 7:40 am

BrentO,
I can remember running programs off floppy disks and records on phonographs! Is there a reason you skipped NAS and jumped right to SAN? SAN is sexier and fancier but NAS can have lots of disks for cheaper cost. Some of us smaller organizations cannot afford SAN yet. Also are you going to cover which RAID is best for OLAP or OLTP configurations?
I read an article on Fusion-io claiming each drive had 6 GB throughput and they put multiple drives together in a box giving 1 TB throughput. If I can get that with a 48 core Intel chip for Christmas I’d be a happy boy!

Happy Holidays!

Reply
1. Brent Ozar
  
  December 23, 2009 at 7:43 am
  
  Dave – because SQL Server can’t store live databases on a NAS. You can run virtual machines with the entire server stored in the NAS, but that’s outside of the DBA’s world, so I didn’t cover that. (Database server performance in that kind of setup is generally horrendous, too.)
  
  I’m not covering which RAID is best for OLAP or OLTP because as I discussed in the article, there’s no one answer. The best thing for anybody is a RAID 10 of the fastest drives on the planet – but that has nothing to do with your budget. It all comes down to what you can afford, what you already have in place, and what your application does.
  
  Reply
Grant Fritchey

December 23, 2009 at 8:23 am

Boot off of floppy disks… luxury. Try booting off a casette player some time.

Excellent explanation.

Reply
1. Mark Richman
  
  December 23, 2009 at 8:58 am
  
  ,8,1 FTW!
  
  Reply
  1. David Stein
    
    December 23, 2009 at 10:16 am
    
    ,8,1 Brings back some Commodore 64 Memories. The 1541 Hard Drive was almost as bad as a tape drive in my opinion.
    
    Reply
  2. SQLDenis
    
    December 23, 2009 at 1:21 pm
    
    Yes I remember that and if you didn’t have a TURBO LOADER it would take 35 minutes to load a 40K game. I wrote about the C128 here: http://blogs.lessthandot.com/index.php/ITProfessionals/EthicsIT/what-was-your-first-computer-and-what-we
    
    Reply
2. David Stein
  
  December 23, 2009 at 10:15 am
  
  Oh yeah, we had a TRS 80 Color Computer CoCo with tape drive. Mind numbingly slow.
  
  Great article as usual Brent.
  
  Reply
3. Gail
  
  December 23, 2009 at 12:15 pm
  
  I remember doing that. When I was about 10 years old….
  
  Reply
jason

December 29, 2009 at 6:32 am

So with the ssd’s, Would we now put everything on the same disk? Meaning data files, tempdb, and log files? I thought about this and I don’t see why you wouldn’t, unless of course of money/space issues.

Reply
1. Brent Ozar
  
  December 29, 2009 at 6:46 am
  
  Jason – it depends. As of this writing (12/2009), there’s cost/capacity issues with SSDs, and there’s also compatibility issues. Most server vendors still aren’t shipping all of their servers with solid state drives, and they’re not certifying many SSDs on their compatibility lists, so you can run into problems. If you put them directly into your server, then you can’t do clustering. If you put them into shared storage (like a SAN), then you have to be on the SAN’s compatibility list, and that’s rather difficult right now.
  
  Reply
  1. jason
    
    December 29, 2009 at 7:15 am
    
    I guess it was more of a theoretical question for now. “In the future” with ssd’s, is the best practice of separate drives going out the window?
    
    Reply
    1. Brent Ozar
      
      December 29, 2009 at 7:19 am
      
      Jason – for entry-level databases, like databases under 100gb, that practice is pretty much already out the window. For larger databases, though, it still makes sense, because you can need more throughput than even SSDs can deliver.
      
      Reply
      1. jason
        
        December 29, 2009 at 9:37 am
        
        Thanks Brent!
        
        I took a brisk walk with the dog this morning and thought about this a bit more and decided I still need redundancy, Raid levels (as you mentioned) are still relative, and I may want to partition off certain files. Also had the thought that bigger and better hardware is going to turn into bigger(resource intensive) and better applications. So for now I guess this is all exciting but still quite up and coming.