In our last post in this series, I talked about why you should be using Ola Hallengren’s maintenance script for backups and sp_DatabaseRestore for faster restores. These two tools have implications for how you build, back up, recover, and fail over your SQL Server.
There’s a saying in the cloud: treat your servers like cattle, not like pets.
What this means for you, in graduating level of difficulty through your upcoming career:
1. Server setup isn’t done until you’ve tested and timed a point-in-time rebuild & restore. When you’re done setting up the full and log backup jobs on the new server, but before you’ve gone live, restore your real production databases. Run the full and log backup jobs, producing a few sets of backup files in their soon-to-be-normal location.
Now here comes the fun part: shut the server down, and start the clock. Your mission is to bring all of the databases online, with their databases up to a specific point in time – say, 5 minutes ago. Do it, and time how long it takes you. No, you’re not allowed to turn that original server back on – he’s broken. (Say for example that he’s having problems with a Windows patch, and the business is tired of waiting for the sysadmins to figure it out – they need you to start Plan B now.)
You’ll learn a lot as you step through this. It feels overwhelming at first, because you’ve probably been building your servers manually, stepping through the SQL Server install GUI manually, setting up things like Cost Threshold for Parallelism, your favorite trace flags, and SQL Agent jobs. That was fine when you weren’t a Database Administrator – but now that you’re moving towards becoming a full time DBA, you need to think about a better long term plan. That means…
2. Servers should be built from well-documented checklists. As you go through phase 1, you’ll realize that you really need to be writing this stuff down as you go. Documentation is a gift to your future self – not just in the sense that it makes your job easier, but it lets you pass work off to others when you need help. During an emergency, when you’ve lost 19 SQL Servers at once (hoo boy, August 2007), you’re able to hand different checklists off to different coworkers and say, “You, follow these checklists and tell me when you’re done.”
This affects your daily database management, too: I’ve seen DBAs apply a trace flag or sp_configure setting during an emergency in order to fix something, but then not document what they just did. When that SQL Server goes down hard, and they build a new one, they simply don’t remember every setting they made to the old server along the way. When the new server comes up online, and has the same performance issues they had a long time ago that required the unique setting, they have to go through the same old troubleshooting process. Bottom line: don’t make undocumented changes to servers – leave a bread crumb trail behind in your server’s documentation or build checklist instead.
I’ve heard folks say, “It’s no big deal: I can just copy the settings from the old server to the new one.” Not when the main server goes down hard, you can’t, nor when you’re dealing with a disaster recovery situation because your primary data center or availability zone is down.
3. Long term, servers should be built and user databases restored with automation. As you start to manage dozens of database servers, checklists don’t scale. You need a way to turn manual steps into automated actions so that you’re not sitting around waiting for the next progress bar to move to the end. However, this is where my advice starts to get a little fuzzy: the SQL Server product has sprawled out to a huge coverage area over the last decade (SSAS, SSIS, SSRS, clustering, Availability Groups, Linux, and now Kubernetes) and there isn’t a single desired-state installation & configuration tool that covers the whole surface area.
If you want to get to automated builds & restores, the closest thing right now is DBAtools & PowerShell, but just be aware that while we call this “automation,” it’s more about automation tooling than just something you download and click Go. Think of it as a box of tools, and you still have to build out an automation solution that works well for you. Thankfully, the DBAtools community is welcoming & hard-working, and they’re constantly making the tools better.
4. Even longer term, if it’s automated, it’s testable. For decades, serious database administrators have automated their own fire drill testing. They’ve built standalone SQL Servers that restore their production backups, run CHECKDB against them, and then send reports of success or failure to the DBA. This gives you a canary-in-the-coal-mine approach to detecting backup problems early.
This post is a little shock-and-awe.
But it’s also a fork in the road: do you want to specialize in being a production database administrator (who makes sure the SQL Server is online and the database is accessible), or do you want to specialize in database development & performance tuning, making sure the queries run quickly?
These are really two different careers: people who tell you they do both are lying, because they suck at one of them (or maybe even both of ’em.)
If you want to specialize in production database administration, your future learning path focuses on things like PowerShell, scripting builds, monitoring for server configuration changes, applying patches in a controlled and recoverable manner, and engineering reliability into your infrastructure. If you think this is the route you want to take, pick up the book Database Reliability Engineering to get a glimpse of what your next ten years will look like. I don’t think there’s ever been a more exciting time to take that leap.
If you want to specialize in performance tuning, we’ll be talking more about that over the coming weeks as well – stay tuned.
If you’re not sure yet, that’s okay. You can always come back to this fork again later. For several years, I chose to focus on production database administration because I really liked building servers and troubleshooting ’em. Then back around 2015, I came to the fork again and decided to give up on that line of work and focus on performance tuning instead. I love the feeling of making a user’s query faster, or even better, making an entire workload go faster. (Plus I was sick and tired of getting called after hours for emergencies!)
In the next post, we’ll talk about the next job duty for production DBAs: checking for corruption.