The old advice went something like this: “Put your data and log files on separate drives and your server will be more reliable. If you lose the data drive, you can still do a tail-of-the-log backup, and you won’t lose any data.”
It’s advice. But is it actually good advice?
Let’s think through it.
- If SQL Server loses connection to shared storage, you’re still screwed. No surprise there.
- If it loses connection to just one volume, and it happens to be the log file volume…you’re still screwed.
- But if it happens to lose connection to just your data file volume, you’re safe! Well, you’re down, but you didn’t have any data loss (assuming you know how to do a tail of the log backup.)
At first, that sounds like you’ve cut your risks by 50% – but let’s dig deeper. This scenario correctly assumes that a single volume can fail. I’ve certainly had that happen:
- A SAN admin accidentally remapped one of my volumes to a different server
- A messed-up snapshot ran out of space (the SAN admin had accidentally taken a snapshot of one of my server’s volumes)
- A raid array became corrupt
I can’t even begin to estimate how often these things happen, so just to pick a number, let’s say any given volume has a 1 year time between failures.
So it’s time for a quiz:
- If you put all of a SQL Server’s data files & logs on a single volume, how many failures will that server experience per year?
- Bonus question: what kinds of data loss and downtime will each of those failure(s) have?
- If you split a SQL Server’s data files onto one volume, and log files onto another volume, how many failures will that server experience per year?
- Bonus question: what kinds of data loss and downtime will each of those failures have?
Think carefully about the answers – or read the comments to see someone else’s homework, hahaha – before you move on.
“I disagree with your volume failure rate idea.”
I hear some of you saying, “Wait – I believe failure rates are not tied to volumes. It’s not that each volume can fail – it’s that a server will have a failure rate. I believe a server will lose a volume once a year.
OK, hotshot, let’s say that once a year (again, just picking a number), your server will lose one of its volumes. In that case, which design would give you the least data loss and downtime?
- Just 1 volume, with both your data & logs on it
- 2 volumes, 1 with data files and 1 with logs
- 10 volumes, 9 with data and 1 with logs
- 100 volumes, 98 of which are empty, then 1 with data and 1 with logs
If you’re arguing for answer #2, keep in mind that when your server has its annual volume failure, you stand a 100% chance of downtime and a 50% chance of data loss.
Whereas #4 has a 2% chance of downtime, a 1% chance of data loss. Brilliant! 1,000 empty volumes probably equals five 9s of uptime, woohoo!
Except now you stand a 100% chance of getting a bottle to the face from your storage admin. Ask for this configuration, and they’ll be happy to explain why adding more empty volumes to your server doesn’t magically protect any valuable volume.
If you still have a single point of failure in your log file volume, adding other volumes to do other things doesn’t help you. You’re just adding more single points of failure with their own failure rates.
On a related note, check out my 2009 post, Adding Reliability to Your Infrastructure.