Training Plan for a SysAdmin to Manage Availability Groups

How hard is it for a systems administrator who’s used to running SQL Server on Windows Clusters to tackle Availability Groups? Our example system administrator knows a bit of TSQL and their way around Management Studio, but is pretty new to performance tuning.

Well, it might be harder than you think. First, let’s look at the skills needed to succeed. Then let’s tackle a sample training plan to get those skills.

Do You Have the Skills to Manage AGs? Take this Test!

sp_Blitz® Sanity Test

Incident Management Skills Test

  • Do you regularly use techniques to learn what’s running and how long it’s been waiting at any given point in time that are NOT SQL Server’s Activity Monitor or sp_who/sp_who2?  (Sorry for the bias, but Activity Monitor is full of lies, and those SQL Server 2000 tools don’t really help anymore.)
  • Do you have a documented incident response process to collect data when performance or availability suffers that at least 2 people can run?
  • Do you regularly execute a process to perform root cause analysis when performance is bad or the SQL Server is offline? (Only answer “yes” if it has successfully led to a real root cause 3 times in the last year.)

Performance Tuning Skills Test

  • Do you regularly analyze your SQL Server wait statistics and baseline them? (A “yes” means that you know what your top 10 bottlenecks mean, what “poison” waits are and what they mean if they show up, and you could quickly identify any changes in your top waits and react.)
  • Do you know the top 10 queries that run against your SQL Server by CPU, and have you tuned their indexes?

SQL Server Scripting/Configuration Skills Test

  • Do you have the scripting skills to be able to handle managing jobs across all nodes? Jobs must all run everywhere, and be smart enough to know who is the primary replica.
  • Do you have the scripting skills and security knowledge to be able to manage linked servers and logins across all replicas? (Don’t say yes if you haven’t done it before– it’s trickier than you think!)

Windows Clustering Skills Test

  • Have you successfully executed basic Windows clustering management tasks with the version of Windows you plan to go live on? These tasks are: configuring quorum, getting clean cluster validation reports, and using the Cluster Failover Manager.
  • Have you practiced advanced cluster rescue scenarios at least twice within the past year? (These tasks are: forcing quorum, generating and reading cluster logs)

Testing and Patching Skills Test

  • Are you subscribed to and do you actively review all hotfixes that come out for your version of Windows Failover Clustering and SQL Server?
  • Do you test and apply hotfixes and selective updates to a test environment on a regular basis as part of a patching deployment for production?

Here’s the deal: to do well with Availability Groups you need to honestly say yes to every single question on this list.

Closing the Gap: A Training Plan

The right training plan for Availability Groups includes learning from others as well as hands-on experience. To make this work, you need to be devoting at minimum 12 full days per quarter to learning. That’s at least one full day a week, and even then, your time may be very, very tight.

First Quarter: Learn the Pitfalls of AGs and Build your Incident Response Process

If you didn’t score two “yes’s” on the sp_Blitz® sanity test, that’s priority #1. Make sure you have completely covered that section before moving on.

Next, make sure you understand why you need to learn so much, and also make sure you really need AGs! Dig in and learn the basics about Availability Groups and why they’re so tricky, and how to choose the right HA/DR solution in SQL Server. Options:

Next, get working in production and build your incident response kit.  It will take work and time to get incident response and root cause analysis really working, but this is an absolute pre-requisite for working with AGs.

Second Quarter: Baseline and Analyze Wait Stats and top Queries

Focus on establishing lightweight monitoring and baselining your top waits and queries in SQL Server.

Third Quarter: Build a Prototype and Get Advanced Training

At this point, you’ve learned a lot about how to understand what SQL Server is telling you. It’s time to start understanding Availability Groups specifically.

  • Review your notes from what you learned about looking out for in AGs– and make sure you plan the right version of Windows and SQL Server that’s really right to use.
  • Set up a prototype environment. You need the same hardware and storage that you’re going to use in production. This isn’t a temporary environment, either: to live well with AGs you need a pre-production environment!
  • Restore full sized production databases to your AG environment and generate activity against it.
  • Make the environment fail! You need to cause failures in multiple places and validate that your incident and performance management tools really work in this brave new world.
  • Oh, and work on all those scripting skills, too.

At this point in the process, it’s time to learn from others and really make sure you’re on the right track. You can do this in multiple ways:

  • Attend a SQL Server conference that lets you attend “lessons learned” style panels– such as the SQL PASS Summit
  • Attend an advanced SQL Server DBA training class like our Senior DBA course

Fourth Quarter: Reassess Your Progress and Fill the Gaps

You’ve come a long way. But you still had a job to do with other tasks in it.

Step back and re-take the quiz. How are you executing on all of those tasks, and how many people can execute on them? Return to the areas where you’ve scored the weakest and build those skills up.

Yep, it Takes a Year

Availability Groups are a very cool, cutting edge tool in SQL Server. They’re expensive, but people are willing to pay the price because availability and performance really matter for them.

But if you’re not careful, the tool that you’re using to try to improve performance can slow you down– or take you offline entirely. And if you don’t have really strong incident management and performance tuning skills, you’ll have no idea if the problem is the AG or something completely unrelated.

The good news is that if you have a foundations in system administration, you know some TSQL, and you really focus your efforts, you can learn these skills in a year!

Not Sure Where to Find the Time?

Show your manager this post and talk about your answers to the questions above, and the fact that you need to fill the gap. Be clear about areas where you and your team are weak and strong, and what risks you’re concerned about if you take on more complex technology that you can’t handle. It may not always be obvious, but usually your manager wants you to succeed as much as you do!

Previous Post
We need your help.
Next Post
SQL Server Features I’d Like To See, Oracle Edition

4 Comments. Leave new

  • Another great post!

    Reply
  • Tangentially germane to this post… I am right in the middle of implementing 2 AGs on a new 3 node cluster (UCSs with 2x 6TB fusion io cards and .75 TB of RAM) so I have done a crash course with all of your information on this site. so THANK YOU you have saved my bacon.

    I did have a couple questions so I attended Office Hours today, I don’t think my questions were going through…I was asking about implementing 2AGs on the 3 nodes and how to separate out the primaries, read-only secondaries and backups and SSRS on the cluster (barf but im just the DBA). My questions in the chat stayed gray. I could see the others questions turn green but I think as they were answered. Its cool if you guys saw them but did not answer, but just wondering if you saw them at all?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.