You’re a database administrator, Windows admin, or developer. You want to build a Microsoft SQL Server environment that’s highly available, and you’ve chosen to use Always On Availability Groups.
In this white paper we built with Google, we’ll show you:
- How to build your first Availability Group in Google Compute Engine
- How to test your work with four failure simulations
- How to tell whether your databases will work well in GCE
In theory, setting up an AG is pretty easy. Slap a couple of Windows VMs together, throw the clustering feature in, next your way through a couple of wizards, and you’re done.
In practice, building a reliable AG – especially in the cloud, where you’re dealing with software-defined networks spanning multiple subnets – is another matter entirely. That’s why this isn’t a little 5-page white paper – this thing is a 97-page monster, and we worked closely with the Google Compute Engine team as we wrote it.
It’s in our free First Responder Kit, or get it from Google and let us know what you think.
Maybe you should give up on convincing people to call it “Full Recovery Model.” Even the Microsoft wizards call it “Full Recovery Mode.”
That’s a little nit picky isn’t it? The database properties page under Options does refer to it as ‘Recovery model’.
@Brian: if he would call it “mode” would make me think that Brent is not familiar with the correct terms and I would be very disappointed…
@Brent: you are doing a great job!
Hahaha, thanks Gerald.
My comment was mostly intended to point out how far Brent has to go on his campaign, rather than seriously proposing that he give up.
Regardless, I’ll respond to Gerald: I was in no way suggesting Brent start calling it, “mode.” I was merely pointing out that eliminating “mode” is going to be pretty tough, when even Microsoft staffers working on SQL get it wrong.
Anyhow, Brent doesn’t merely use the correct terms. He also celebrates every time someone else uses the correct terms.
Conspiracy theory: The secret reason Jes left is that Brent told her she was no longer welcome after reading https://www.brentozar.com/archive/2012/08/whos-taking-your-backups-no-really-whos-taking-your-backups/ .
Brian – nah, you haven’t been around me long enough. I called it mode for years, and I only switched when Tara joined. She’s the one who’s been conditioning me, hahaha.
Yeah it’s me with the “model” vs “mode” thing. It just drives me crazy. I’m weird like that. Carry on.
Also, for the record: Jes was awesome and totally eligible for rehire. 😀
Why the recent uptick in the use of the “jazz-hands” Brent Ozar avatar on this site?
Because I like it. 😀
Love the jazz hands!
wow. Great post, BOU!
This is a well written, easy to follow ‘White Paper’. Nicely done Tara. Kudos to the other guy.
Is there anywhere to read about the “complex networking issues” that necessitate the need for the VMs to be on separate subnets? Also, am I correct in assuming then that this extends to a new subnet for every server/instance I would like to add in to my Availability Group?
Tom – I haven’t seen any public documentation on it, no. This is the goofy thing with the cloud – often it’s changing so fast that it can be hard to find good technical documentation, especially if you want internals on how someone’s proprietary cloud works. I’m with you in that I find it really interesting, but your best bet for stuff like that is to go to the vendor’s conferences, and talk to the staff directly there. They can often give you some good inside information. (I had a blast at Google Cloud Next last month doing that.)
Based on my understanding, GCE network is acting as layer 3 switch. It cannot see the IP address of the nodes, so you had to use router to bridge the packets from outside of VM. That’s why you had to put another nodes to another subnet. Say if you had three nodes you will have three routers for three nodes. And three listener routers (not AG listener) for three nodes. Listener Router will connect GCE network and the VM. I could be wrong. I will do more research on this.
To update my findings, I believe the reasons that different nodes had to stay in different subnets were that GCE network is special. Today I’ve tested to put two nodes, listener, cluster IP all inside one subnet. I could setup windows cluster, setup AlwaysOn , create Listener. However, after creating listener, though I could failover to another replica, the listener cannot be recognized by name. That’s why Google document said we had to put nodes in different subnets.
I will blog this in detail later.
OK, I too was having issues and it’s partly down to obsolete documentation and GCP not supporting gratuitous ARPs, which you need to resolve by routing virtual IPs yourself — the problem you have with the above documentation is that GCP now auto-routes subnets, which voids the documented process — as the stated 16-bit subnet covers off any host-specific routes you wish to create.
The first solution to this I found was a bit hacky (to say the least), in that you create a 24-bit subnet in GCP, but then NETSH the SQL instances with 16-bit subnet masks; then you create the SQL listener IP GCP routes. However, as that breaks common networking logic, the more sanitised way of working this out is to use secondary routes designated to your listener IPs upon creating the subnets.
I’ve found you only need to create routes for the listener IPs, as the WSFC can failover using the virtual IPs despite not having to designate GCP routes (may be due to the the listener IPs “sharing” port 1433).
I’ve only tested on a single-region 2-SQL node environment, so my next tests will incorporate a third-node in a different region (apparently upon investigating the above, I’ve discovered Azure has issues with subnets across regions, so different vendor approaches to cloud routing seems to be causing issues at different points in the configuration).
Rob – yeah, this is part of the fun with working in the cloud: things just change so fast.