Automated Tiered Storage for Databases


Database administrators want just one kind of storage: fast.

Companies want just one kind of storage: cheap.

Since fast storage isn’t cheap, and we can’t put everything on slow storage either, we end up buying different tiers of storage. Let’s start by pretending that we’re buying a new HP DL370 G6 to act as a data warehouse server.  It can hold up to 14 3.5″ hard drives – so what drives should we buy if we want to use tiers?

All the better to store you with
HP DL370 G6: Our local tiered storage candidate

Tier 1: Mirrored pair of Solid State Drives – SSDs are still pretty expensive, so we only buy two of these.  We’re going to store mission-critical data on our SQL Server, so we buy two of these drives and mirror them.  That way, if either drive fails (and SSDs certainly do fail), our server won’t go down.  However, SSDs have limited capacity, so if we buy two 256GB drives in a mirrored pair, we’re only going to get 256GB of usable very-fast capacity.  SSDs are blazin’ fast at random reads and writes because they have no moving parts – they can jump to any portion of the drive at any time.  In our sample server, we’re going to use these drives for TempDB because we’ve determined that our application hammers TempDB hard.

Tier B: Four 15K drives in RAID 10 – 15k refers to how fast the drives spin.  At 15,000 revolutions per minute, the platters are flying, and lots of data is passing under the hard drive heads.  These drives can read or write a lot of data very fast when they’re dealing with sequential data, because the drive heads don’t have to move around.  They’re not as good with random reads and writes as SSDs, which have no moving parts.  Our 15K drives are larger than SSDs – we can afford to buy the 450GB models, and with four of those in a RAID 10, that gives us about 900GB of unformatted capacity.  (You can learn more about calculating RAID sizes in the Wikipedia RAID article.)  In our sample server, we’ll use these drives for our transaction logs.

Tier III: Eight cheap 7200RPM SATA drives in RAID 5 – These only spin at 7,200 RPM, roughly half the speed of the above drives, which means data is passing under the hard drive heads more slowly.  Just as slower cars are cheaper than faster cars, slower hard drives are cheaper too.  We’re going to buy the 2TB models, and with 8 of those in a RAID 5, that gives us 14TB of unformatted capacity.  RAID 5 can be much slower for writes, but it can be fast for large reads.  Our database stores a lot of wide data because we need to store XML and files in the database, but we don’t update that data very often.  It’s mostly kept for archive purposes.  We really do need that much space for our user databases.

This is manual tiering with local storage: carefully determining your application’s needs and then crafting a storage solution to fit.  It requires picking the right server, the right drives, the right RAID configuration, and then finally, configuring the application (SQL in this case) to store the right data in the right places.

This is a pain in the ass.

Very few people do this because there’s so much labor involved if you want to do it right.  If you just make some guesses about loads, buy any old drives, and slap ’em into any old RAID configuration, then you’re locked into bad performance.  Six months down the road when you realize the application isn’t really hammering TempDB, and that the real speed bottleneck is one particular set of tables in a user database, you’re going to have a rough time reconfiguring all this.  Worst case scenario, we’re talking about backing up all of the databases, reformatting everything, and restoring it all from backup.  The bigger your databases get, the more painful this becomes.

In-Drive Storage Tiering

Whenever there’s a pain, vendors step in and offer to take your money help.  One of the early successful hybrids was Seagate’s Momentus XT – a 2.5″ laptop drive that partnered 4GB of SSD memory with a 500GB hard drive in one package.  The drive learned which data you accessed the most frequently, then cached that data on the SSD.  The hybrid drive sped up routine actions like reboots and application launches.

If it was shipping, I'd show a picture of it in my hand.
OCZ RevoDrive Hybrid

The newest contender in the market, the OCZ RevoDrive Hybrid, is a PCI Express drive like Fusion-IO, but it combines a 100GB SSD with a 1TB 2.5″ laptop drive.  These aren’t shipping yet, but the press release holds an interesting nugget:

“Advanced caching algorithms learn user behavior and adapt storage policies to ensure optimal performance for each individual user, maximizing productivity for the most demanded programs and applications.”

The word caching implies that like the Momentus XT, the RevoDrive Hybrid is writing everything to the magnetic drive, but just caching frequently accessed data on the SSD for faster reads.  However, further down in the release, they brag:

“In addition, the drive not only eliminates the SATA bottleneck unleashing ground-breaking bandwidth up to 910MB/s, but also features up to 120,000 IOPS (4K random write) for high transactional workloads delivering true SSD-level performance.”

The phrase random write implies that writes will hit the SSD first, and then later be pushed down to the hard drive.  You can’t get 120,000 IOPs of 4K random writes on a 2.5″ magnetic drive.  This also means that if you tried to continuously do 4k random writes, you might be able to fill the available space on the 100GB SSD, and then performance would slow down as the RevoDrive was forced to migrate data down to the hard drive.

I’m not suggesting you use either of these solutions on a production SQL Server, but I’m showing them to introduce you to the concept of tiered storage.  In a single drive, some of your data might reside on fast solid state memory, and the rest would live on slower, more capacious magnetic Frisbees.

SQL Server Thinks Drives is Drives

Tiered storage can be much cheaper, easier, and more effective than partitioning.The good news is that your database server doesn’t need to be configured for anything in order to use this storage.  In our data warehouse example, we could use several of these RevoDrive Hybrids to store our sales table, which has hundreds of millions of rows of history.  It documents our company’s history going back for the last ten years, and our executives have told us we’re not allowed to archive any data.  They want to be able to run reports to compare this Christmas season’s sales against the last ten Christmases.

Before automated storage tiering, this type of table was painful for DBAs.  How do we decide which data to put on which drives?  If we only keep the most recent year of data on fast storage and the rest on slow storage, then our users are going to be screaming during every holiday season.  They’ll all try to run reports comparing this year against the last ten years, and the reports will run dead slow.  We scream, “WE CAN’T STORE ALL THAT DATA ON FAST DRIVES!” but actually…we can.

With automated storage tiering, the storage gear constantly watches which parts of the drive are being accessed, and moves them to faster/pricier storage.  It also watches which parts aren’t being frequently accessed, and moves them to slower/cheaper storage.  Depending on your environment, this tiering management might be done in different places:

DBAs need to know about this technology because it can help you avoid complicated setups like table partitioning or guessing what needs a fast TempDB versus what needs fast transaction logs.  When you know how tiered storage works and when you know how to test it, you’ll be able to adapt to changing performance needs faster.

Previous Post
How Do You Mask Data for Secure Testing?
Next Post
Does This Transaction Log Make My Database Look Fat?

24 Comments. Leave new

  • Hey Brent,

    thank you for this article. One thing though: what about storage array controller caching? My fear is that a fast tier 1 might constantly spill the cache and therefore tier 2/3 might be even more slower. Of course I’m talking about cheaper das/san solutions without a manageable caching strategy.

    What are your thoughts on that?

    Thank you!

    • Vadim – I’m not quite sure what you mean by “a fast tier 1 might constantly spill the cache.” Can you try asking the question a different way? Thanks!

      • Oh ok, maybe this is just something I misunderstand. I’ll try that another way:

        Suppose we have only two tiers: tier1 for performance critical data. tier 2 for the rest.

        In an OLTP scenario I would guess: tier1 for tempdb (lots of r/w) and log (w), tier2 for data (more reads/but some writes as well).

        But we have only one DAS/SAN array cache block (simple, low/mid-range storage solution).
        So I am afraid the tier 1 data will fully saturate the cache and slow down the tier 2 performance even more.

        To summarize: what effect would tiering have on the storage caching algorithms?

        Thank you!

        • Vadim – gotcha, I think I’m not explaining well enough.

          Tiered storage is a feature that’s built into the storage itself. If you’re using a low/mid-range storage solution, it’s probably not tiered. When you say it’s DAS/SAN, those are two different things. What’s the exact make and model of your storage?

  • Hey Brent,

    well actually I meant a storage which is capabel of tiering. For example a HP EVA 6300 or even an MSA for that matter. They take hdds and ssds. Yes, there is no AUTO-tiering available (as in an HP XP).

    Simple example basically would be to create couple of SSD based LUNs and some HDD based LUNs.

    An here I’m not sure how array caching will work.

    The procedure is the same as yours basically.

    sry for taking up your time with my questions, still struggling with the concept of manual tiering (and my english) 🙂

    Thank you!

    • Ah, okay, that’s not automated tiering. For manual tiering with most storage arrays, things work exactly the same whether you’re using different tiers or the same tier. If you use one SSD array and one HDD array, you’re still splitting up the cache the same as if you had two HDD arrays. It depends on how the storage processor works. Some divide cache by array, some by LUN, some by server. You’ll need to check with your SAN admin to see how yours works.

  • Brent,

    Any experience or thoughts on whether having sufficient RAM on the local server would negate the need for or not completely utilize the auto-tiered storage? It would seem to me that once the data has been read from disk, SQL would store it in memory and would try to read from that first before it would go back to SAN based cache or say the SSD tier on a SAN. Not to mention, at least in the EMC world (as I understand) blocks of data get moved in 1 GB chunks from tier to tier and I wouldn’t think that would be able to happen fast enough to derive any real benefit.

    • Jared – good question. Despite having a lot of memory, there are still things that hit the disk. The two most frequent ones are the transaction log (because writes have to hit the disk) and spills (when SQL doesn’t allocate enough memory to run a query and needs to use TempDB instead). We cover both of those in detail in our book –

  • Brent,

    I already own that book….bought it about a week after it came out. AWESOME resource, one of, if not the best technical books I have had the pleasure of reading. I actually pulled both my tempdb and log files back to separate local raid 10 storage on 15k drives. We are in the process of designing a new SAN infrastructure and I am just really curious on whether our particular databases will see any marked improvement from these new auto-tiering SANs. I also think that SQL will out smart any flash cache on the san used for caching frequently read data because SQL server already manages that itself with RAM. I know every database is different but I was wondering if you had (or anyone else) had experience with these new drives or SANs. I also wonder how SAN based read cache handles dirty pages from SQL server? When the page gets flush from local cache, will the SAN be smart enough to do that as well?

    • Jared – you’ve got a whole lot of questions in there, but I’ll take a quick stab at a few:

      “I am just really curious on whether our particular databases will see any marked improvement from these new auto-tiering SANs.”

      To give you that answer, I’d need to know a lot more about your particular databases. I usually spend 2-3 days working with a client to develop the right storage strategy for their database needs.

      “I also think that SQL will out smart any flash cache on the san used for caching frequently read data because SQL server already manages that itself with RAM.”

      It depends on how much memory you put in the SQL Server, how you carve up your LUNs, how the SAN divides cache (between servers or LUNs), and whether you ever need to reboot the SQL Server.

      “I was wondering if you had (or anyone else) had experience with these new drives or SANs.”

      Yep, absolutely, I’ve got lots of clients using them.

      “I also wonder how SAN based read cache handles dirty pages from SQL server? When the page gets flush from local cache, will the SAN be smart enough to do that as well?”

      When a page is dirty, SQL Server eventually writes it back to disk. The SAN learns of the change at that point. SQL won’t read a dirty page off disk because it knows it has changes in memory that haven’t been committed to disk yet.

      Hope that helps!

  • Ah…the “it depends” answer. I completely agree with your responses. I still have a little apprehension on letting anything “automatically” move my data base on some algorithm (over simplified I know). Mostly paranoid on my part I guess. I am still very intrigued by these new drives and new auto-tiering SANs.

    Thanks for your comments and keep doing what you do. You are a very inspiring individual!

  • Hi Brent,
    Have you chnged your thinking on automated storage tiering and databases since writing this article? Im curious to know what the current thinking is as I have tried this sort of thing with Oracle on Unix a few times and not been convinced. The bits I like are that the writes ALWAYS occur on SSD and are then migrated down the tiers by the SAN. The problem is with reads. The database optimiser is always going to be out of date. It has no idea how fast the underlying storage is for any particular access plan. Therefore it cannot produce the most efficient access plan for your data. Unless the DB optimiser can integrate with the SAN controllers having automated tiering is going to produce very poor query performance. Conversely having manual tiering and a good DBA means your db stats are a good representation of the access speed of the data. If you migrate a DB file between tiers you regenerate the stats after. Simple.
    Agreed the access profile of your DB plays a part. If you only ever perform writes then auto tiering is a god send. If your access is more balanced then discuss….
    What are your thoughts on this?

    • Nick – well, let’s take a few steps even farther back. When you get an execution plan for a query, that plan doesn’t even take into account whether the data is cached in memory. Think about the difference between memory access versus drive access, and does the tiering of drives really matter? No, that pales in comparison to the difference between memory or drives. Because of that, I don’t get too worried about the optimizer tweaking plans based on the speeds of the drives. When in doubt, cache it in memory. 😀

  • Thomas Rhea
    July 17, 2013 4:05 am

    Hi Brent – we have an new EMC automated storage array and are in the process of looking at layout of files in regards to putting them on new LUNS. However from what has been described to me data is moved around the tiers at a block level depending on how hot they are and as such I am struggling to understand what advantages there are in spending time looking at things like sys.dm_io_virtual_file_stats to work out how to divide up the files across LUNS?

    This then leads me on to file groups and whether there is any advantage now other than perhaps for recovery in looking at putting tables into filegroups and across LUNS?

    Hopefully that makes sense.



    • Hi, Thomas. This is beyond the scope of something I can answer quickly in a blog comment, unfortunately. You’re at the point of personalized systems design where it usually makes sense to bring in someone (EMC or a consultant) to have a day-long conversation.

  • Hi
    In our environment, we have Vmaxe storage and earlier Tempdb was on this storage when a query was taking approx 600 ms.
    Now tempdb has been moved to Local SSD and query takes about 4 ms.

    Any ideas ?

  • Brent,

    To be brief, we are moving from a Clariion CX-480 to potentially a VNX5400. Our CX-480 has been carved up for performance with many spindles per dedicated LUN to guarantee sql performance. The VNX utilizes automated tiering which is new to us. EMC is saying we don’t need all the physically dedicated LUNs anymore with high spindles to guarantee performance. We can now rely on automated tiering within a storage pool to move the hot data to faster storage as needed and colder data to slower storage as needed. Essentially they are saying we can get better performance with less spindles. This is frightening. Especially if we spend the $$$ to upgrade our SAN and get less performance.

    Is automated tiering just something DBA’s have to take on faith works? Or can you suggest a way to test that performance will not be less than what we currently have?

    • Evan – let me turn around and answer that with a couple of questions.

      1. Your CX-480 – does it have solid state drives?

      2. Have you tried replacing your desktop or laptop’s magnetic hard drive with a solid state drive? What was your experience like?

      • 1. We utilize FAST Cache in our CX-480 for “some”, not all, sql LUNs. We also have a Violin Flash array where Tempdb currently sits.

        2. Yes my laptop currently runs on a Liteon LCT-128M3S SSD. Read/Write operations for my local usage is MUCH faster.

        I am sold on the performance of SSD. When it comes to using storage pools for sql I am concerned with contention from high IO jobs running simultaneously against the same spindles which previously had their own dedicated spindles to run on. Our sql environment is 7TB. An educated guess is 2-3TB of that needs to have high performance when it’s being accessed.

        • You nailed it when you said these two things:

          “dedicated spindles” – stop thinking about storage as spindles, and start thinking about IOPs, throughput, and latency. Dedicated spindles were the only crappy shortcut we had to guarantee performance, but those days are over.

          “2-3TB of high performance reads” – that’s what memory is for. If you’re continuously reading the same 2-3TB of data, and that data is broken up across multiple SQL Servers, how much RAM do they have?

  • Dedicated spindles – I guess old habits die hard. Thanks for the positive reinforcement.

    Believe it or not, we have 6.7TB of data spread across 68 databases on one instance of SQL 2008. The instance runs on a clustered Win2k8R2 Dell R820 with 128GB of RAM that can only have downtime on Saturdays.

    Along with refreshing our storage needs, we are struggling to find a way to break up this beast. We have four departments hitting this server with critical SPs and code heavily engrained from 3rd party applications and past employees. The same databases are access by each department. Some hit the same tables, some hit their own tables. (Yes, I’m pretty good at tracking deadlocks and blocking). Finding a starting point for unraveling is challenging. Find a solution that every department can agree on is even more challenging.

    Any experience with this kind of situation?


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.