I’m proud to say that I’ve only lost one job in my life. I was a grocery store bagger when the store came under pressure to increase their percentage of minority workers. They laid off all of the Caucasian baggers, but I’m not bitter. I actually loved bagging groceries, and if I hadn’t been laid off, I’d still be the guy asking you “Paper or plastic?”
I probably didn’t deserve to lose that job, but I’ve deserved to lose others. I’ve just been lucky. Through my career, I’ve seen a lot of database administrators go through some dicey situations that cost them their jobs. Maybe not right away – but once these situations happen, it’s only a matter of time until management finds a legally defensible reason to swing the axe. These situations often involve just nine letters.
The First Three Letters: RPO
Recovery Point Objective (RPO) measures how much data you’d lose if your server went down and you brought up the secondary server. It’s measured in time, not megabytes. The less time you have, the more frequently you have to get the data to a secondary location.
In theory, if you’re doing transaction log backups every 5 minutes to a network share, so you’ve got a 5-minute RPO.
In theory, “in theory” and “in practice” are the same. In practice, they are different.
In practice, you’re unlucky, and your transaction log backups will fail before the engine itself does. I know this because it’s happened to me and my clients. In one case, the volume for backups filled up and the transaction log backups stopped working. Since the log backups didn’t run, the transaction logs didn’t clear, and they continued to grow. The Gremlin of Disaster arrived, sowing corruption in his wake. By the time the dust cleared, we’d lost over an hour of data – permanently.
If you’re backing up to a local drive and the RAID controller fails, check out this helpful article. You’ll be needing it, because with a failed RAID controller, you can’t access any of the data on the local drives. Before you try pulling the hard drives out of the failed server and putting them into another server, check out this article. After all, you’re gambling – gambling that the RAID controller firmware matches between servers and that it’ll be able to read and rebuild the newly inserted drives. Guess wrong, and it will format the drives, thereby erasing your backups.
Savvy readers will detect a sense of acerbic humor in the post so far – even more biting than my usual pieces. Brace yourself, because it’s going to get worse.
The Next Three Letters: RTO
Recovery Time Objective is how much time you’ve got to get your server’s act together. Some companies measure it from the time you learn about the problem, some measure when the service is first unavailable.
The clock’s starting point matters a lot. When things go wrong, SQL Server Agent might just email the failsafe operator – if you’ve gone to the trouble of setting one up and configuring alerting. Even then, you probably won’t notice the email because you’ll be drunk at the pub having a nutritious dinner with your lovely family. An hour later, when you realize that your cell phone is set on vibrate, you’ll panic, run to the car, fire up your laptop, and try connecting over your phone’s 1-bar internet connection. Should we start the clock when the server started failing, when the email was sent to you, when you saw the email, or when you finally connected to the VPN? If the business wants to start the clock when the server stops, you have to account for the meatbag reaction time, and with most of us, there’s a pretty long delay. (If you’re the kind of DBA who sets up rules to move automated alerts into a separate email folder, you’re off to a bad start.)
If our idea of disaster protection is restoring the databases onto another server, we might assume that the RTO just needs to cover the length of time to do a full restore. Not so fast, Speedy McPinto – here’s a few things you’ll need to do to get the server online:
- Provision enough space to do the restore – do you know how much space you need on each drive?
- Make sure it’s on the right version/build of SQL Server – sometimes we have to install service packs, cumulative updates, or hotfixes to get production to work right. There’s no guarantee our standby box has those same binaries.
- Get the sp_configure settings right – if production had a certain maxdop, for example, it might fail terribly on the new hardware.
- Get security working correctly – you did script out the logins and passwords, didn’t you?
- Configure TempDB correctly – I’ve seen people restore production databases onto dev environments where TempDB was just configured with a few gigabytes of storage, and it instantly started growing when handling production load. TempDB ran out of space, and the server went Pete Tong.
- Restore the database – and hope to hell it restores correctly the very first time.
The less time you have, the more tasks you have to perform ahead of time, and the more you have to document, test, and rehearse the steps you’re going to perform at the moment of disaster.
The Most Important Three Letters: CYA
Right now, all of your users believe your RPO and RTO is zero. They believe you’ve got everything under control, and that when disaster strikes, you’re going to instantly bring up another server with no data loss whatsoever. We both know the truth, though – you spend your time at work checking blogs, not backups, so we’ve got some work to do.
Start with a statement of fact that avoids blame: “Right now, if we lost the production system at 3:45 PM, it would take me at least four hours to bring another copy online, and that copy would have lost all data between 3PM and 3:45PM. Is that acceptable?”
They’re going to say, “ZOMG, NO WAY!!!1! IT HAS TO BE BETTER THAN THAT!”
You say, “I thought you’d say that, because I’m uncomfortable with that too.” (See what we did there? We just put you on their side.) “Here are our options to improve our RPO and RTO.”
To give them options, check out my free video on HA/DR Options: Clustering, Log Shipping, and Replication. I cover the pros and cons of all your high availability and disaster recovery options including database mirroring, SAN replication, virtualization, and more. I won’t show you how to implement each of those (hey, I’ve only got 30 minutes) but I’ll give you enough information to make a recommendation to your business users.


Gail December 22, 2011 | 8:36 am
And for anyone who thinks (or insists) that RAID controllers don’t fail in SANs and, even if they did they’re teamed so there’s redundancy.. Well, I have a story of a 600GB database completely lost with no recovery that you might like to hear
Brent Ozar December 22, 2011 | 8:38 am
Yep, I agree wholeheartedly. I’ve lost entire racks of storage. Good times…
Robert L Davis December 22, 2011 | 8:51 am
Gail, that reminds me of a major disaster I dealt with. We had a drive fail in an array and when they opened it up to replace the failed drive, they discovered the hard way that Dell had forgotten to screw the drives in place and all drives came tumbling out.
Gail December 22, 2011 | 9:22 am
Ow, ow, ow!
jonmcrawford December 22, 2011 | 9:28 am
…I’m sorry, I was reading your post, but got completely distracted by the aptly captioned Pinto, reminding me of driving that around in high school, and then a Mercury Bobcat (knock-off Pinto) once the first died. What were you saying?
Claire December 22, 2011 | 9:51 am
When I first read the title, I thought you meant written letters, and got very excited. But acronyms are cool, too!
Tough luck on the grocery bagging job loss
Wes MacDonald December 22, 2011 | 11:13 am
Great advice! I am definitely attending your webinar.
David Eaton December 23, 2011 | 10:00 am
Well I have been the lucky one. The most data I have ever lost was a 10 minute window. I always lobbied hard and successfully for high availability resources and using RPO and RTO as my basis, never had to CYA.
The key is picking the right hardware and setting it up correctly. If you are using a Fibre Channel connection to your SAN do you have a redundent FC switch?
Does you Array/SAN have Dual Controllers that are hot swapable?
Do you have extra NICS in the server going to different swiches?
Clustering has always been the most reliable solution if it is setup correctly. I have had to go into shops where they thought Mirroring would solve all thier issues, but in fact, it complicated the situation because the Users and Security settings were not synced between the servers.
It is all about a firm discipline on setup and configuration, and implenting the proper alerting system.
By the way, the 10 minutes I lost was not due to hardware or server issues, it was a careless developer who deleted the entire customer table by accident, 10 seconds after my last log backup.
Why did I lose 10 minutes? Well that is becuase I was in my car on the way to the datacenter to build another production cluster. And it took them that long to figure it out and call me.
The tool to make life so much easier here is one that does object level restores. It found the backup and all the rans action logs and fixed the Customer table in 4 minutes. I am not sure I can propmte tools here so if I hear from Brent etc that it is accpetable, I will post it.
Build the infrastructure right, and you can recover in a very short time.
David Eaton December 23, 2011 | 10:02 am
Sorry for the typos……
Calin December 26, 2011 | 11:08 am
David, there are many levels of “doing right” and “firm discipline”. We just had major issues with NetApp and Brocade. Was it NetApp’s fault? Was it Brocade? Well, from our customers perspective, it was us. Lesson learned
David Eaton December 28, 2011 | 12:29 pm
Was the Brocade Fabric load balanced and set for failover?
Was the NetApp setup for High Availability?
I am not saying hardware failures do not exist, but the infrastructure can be built to minimize that occurance.
There are, as you say, many levels that need to be considered down to the type of RAM in the Server and the Multipath client used for the HBA connection to the Fiber Fabric.
The real issue is the difference of management weighing the costs of doing it right versus the potential costs of taking shortcuts. I have to make the best of what I am given to work with like everyone else. Sometimes I get what I think is needed and sometimes not. But I try to follow the same approach as best I can in all cases.
Anup December 29, 2011 | 1:33 pm
Great post Brent! As you had said in one of your early webcasts,its hard time when the CTO taps on your back when you are dealing with a downtime and reminds about RPO and RTO:)
Marc Jellinek January 1, 2012 | 10:27 am
I’ve always told my customers that you have three options:
- you can have it done fast
- you can have it done cheap
- you can have it done right
… now pick two.
Invariably, they pick fast and cheap. So I get a physical signature on a document where they accept the risks associated with their choices, and I don’t start work without that signature.
Amazing how often they now want to discuss “right”.
I’ll often ask embarrassing questions like:
- When is the last time you tested your restore process to a bare-iron machine?
- How many configurations (OS, Database and associated service packs) do you have in production, test and development?
- How do you measure RPO and RTO?
- What is your target for RPI and RTO?
- What measures have you put in place to achieve your stated goals?
- Have you historically met your stated goals?
- What are the costs associated with not meeting your stated goals (lost transactions, lost business, man-hour costs, good will costs)?
- Is your business prepared to absorb those costs?
Usually, I’m dealing with someone non-technical. So I ask them what happened the last time they didn’t have access to their primary mode of transportation (cars for suburbanites, public transportation for city-dwellers). They’ll tell me how they rented a car, were supplied with a “loaner” car, or instead of taking the bus, they took the subway.
I bring up the point that they had redundancy options, with associated levels of cost and convenience.
I’ll then ask them what redundancy options they have in place for their business or primary mode of making money. .
Aside from whatever local HA/DR measures have been put in place, I’m looking options for offsite replication to “cloud” databases like Azure. Has anyone else explored this as an option? I know there are some limitations (database size, costs, etc), but I haven’t had a chance to dig deep yet. Any guidance?
Brent Ozar January 1, 2012 | 10:31 am
Marc – typically pure cloud databases like SQL Azure have too many limitations to function as HA/DR to onsite SQL Servers. You can’t do transaction log backups or database mirroring to SQL Azure. You can use these techniques with hybrid cloud solutions like Amazon EC2. The pros & cons of these are beyond what we can cover in comments, but you can start your research with this post by Jeremiah:
http://www.brentozar.com/archive/2011/09/sql-server-ec/
Pingback: Link Round Up – January 2012 Edition « SQL Feather and Quill