If you’re using SQL Server 2012′s hot new AlwaysOn Availability Groups feature, your databases will go offline when your network connection does – even if you’re using asynchronous replication.
This is not a bug. This is working as designed – and it’s important to understand the underlying concepts.
AlwaysOn Availability Group functionality relies on Windows Failover Clustering technology to know when things are going well, or when the poop has hit the fan. A core concept of failover clustering is quorum – the voting mechanism that lets each individual node understand whether it’s online or isolated. Windows Server 2008 has a variety of quorum methods, but in the vast majority of configurations, each server needs to be able to see the network in order to reserve its IP address, network name, and see other nodes in the cluster.
Take the following scenario:
- SQL2012PROD1 – primary active node. All read/write connections are going here.
- SQL2012PROD2 – secondary node with asynchronous replication. I could let users query this server, or just let it run in standby – it doesn’t matter for this scenario.
If I disconnect the network connection for SQL2012PROD1 – even for a brief moment – all of the databases in my availability groups roll back all open transactions and then go offline. The informational messages in the SQL Server event log are shown at right for humor purposes.
The SQL Server itself is still up – but the databases in the availability group aren’t because they have a dependency on the availability group’s listener and IP address. Since those aren’t available without a valid network connection, the databases are taken offline.
This is a dramatic departure from database mirroring or replication on a standalone (not clustered) database server. Both of those technologies leave the primary SQL Server’s databases up and running when the network drops.
Bottom line – when deploying AlwaysOn Availability Groups, make sure you understand the risks of everything presented in the cluster validation wizard. Some of the alerts (like some storage alerts) can be skipped for shared-nothing AlwaysOn Availability Group deployments, but others (like the network redundancy alert) definitely can’t. Just a general tip: mission-critical SQL Servers should be connected to two separate network switches, as should all other mission-critical servers. That way when one network switch fails, the mission-critical servers can still all talk to each other and nothing will go down.


kevin June 26, 2012 | 9:03 am
Hi Brent – I need your help. Want to run sql queries from my desktop PC (SQL 2008 R2) via ServerA (SQL 2008 R2) on ServerB (SQL 2008 R2). Is this possible, if it is how dio you do it ?
Brent Ozar June 26, 2012 | 9:13 am
Hi, Kevin. That question is completely valid, but it’s totally unrelated to this post. It sounds like you’re struggling to find the right place to get answers. There’s a saying – if you give a man a fish, he eats for a day, but if you teach him how to catch fish, he’ll never go hungry again. Rather than giving you the fish, I’d rather teach you how to go fishing. Try posting this question on http://StackOverflow.com or http://www.SQLServerCentral.com. Those are vibrantly active question-and-answer sites that can get you help from lots of qualified professionals quickly, for free. Try that, and if you don’t get a good answer within a day or two, let me know and we can help refine your question.
Fernando Jacinto Alvarez June 26, 2012 | 5:58 pm
Try configure a linked server between serverA and B
Richard M June 26, 2012 | 11:03 am
Brent, so with a network blip your availability groups go down. If network is reestablished, do the Availability Groups resync or do you manually have to do it from scratch? This of course poses an issue with multi-terrabyte DB’s…..
Brent Ozar June 26, 2012 | 11:43 am
Richard – thankfully, when the network is reestablished, the Availability Groups resync automatically. Whew!
Richard M June 26, 2012 | 11:57 am
Whew, really….. we’re starting to see if we can setup a small lab here with a couple of VM servers to set this up, and do these type of tests….
Thanks!
Michael Irwin June 26, 2012 | 11:40 am
That’s a very big catch, Brent. Something further: assuming that you have two physically separated sites for an Availability Group. Then you’re relying on (in the simplest case) two NIC cards, two sets of cabling, two routers or switches, and so on … until you hit the outside world! Then what?
For this, you should also have two ISPs (Comcast and Verizon, for example). I’ve worked for companies where the ISP connection went down and there had to be physical attention to fix it (someone dug it up!). Needless to say, that didn’t happen overnight!.
Brent Ozar June 26, 2012 | 11:45 am
Michael – that wouldn’t actually be a problem. In that case, as long as your primary server can still contact things like its own DNS server and gateway, you’d be fine. Typically in multi-site environments you have a DNS server on each side, so a temporary blip wouldn’t force the cluster offline as long as you configured quorum correctly.
Michael Irwin June 26, 2012 | 12:11 pm
Or, at least, only one side would go down – right? So then the HA functionality would be working for the customers. Thanks. Like I said, great catch! Thanks!
Anup Warrier June 26, 2012 | 5:26 pm
This is an excellent point Brent,thanks for sharing !
Alex June 27, 2012 | 6:13 am
Hi Brent,
I am a huge fan of MS SQL Server. I work with SQL Server and Oracle, and I think Oracle database is more complex to setup and to manage.
In spite of this difference, Oracle RAC is a unique feature that MS still does not has similar and I do not think allways on is the answer.
Do you think MS thinks about something similar to RAC?
Brent Ozar June 27, 2012 | 10:36 am
Alex – that’s an interesting question. What’s the business problem that you need to solve? Rather than comparing features, let’s talk about the business use case that you’re faced with.
Alex June 27, 2012 | 11:20 am
Reduncancy is the key feature of RAC. Some people think that RAC can enhance performance, but in some cases, it slows down data access. It should be used with care.
Business continuity can be guaranteed with RAC because, if one of the nodes crashes, other node(s) can still handle connections. It does not dependes on one primary node.
In spite of that, with cloud solutions and database in cloud, I do not see much more use of this feature.
Do you think cloud solutions could kill MS clusters and Oracle RAC use?
Si June 28, 2012 | 8:47 am
Alex – I happen to think one of the the important features of RAC is the ability to scale out. From a commercial perspective, this is important on Oracle because Oracle do not allow you to turn off cores on an x86 server using BIOS settings. Thus, the only real way to “pay as you grow” (without purchasing Oracle’s database appliance) is to use RAC and add nodes as you grow.
With SQL Server, this hasn’t been so much of an issue historically – you scale up with SQL, so you purchase the right size server. The issue is now that CPUs have many cores and SQL 2012 is licensed by core. Microsoft do NOT give you flexibility to turn off cores in a CPU (like Oracle) so unfortunately right now, you have to make sure you buy the correct size server. If your workload grows, then you may well need to throw the old server in the bin and buy a new (bigger) one. There is no “add another node” option in SQL unless you can distribute individual databases across different SQL instances.
Adam October 22, 2012 | 8:29 pm
So do you recommend teaming the NICs and plugging them into two separate switches?