Why Australian Azure SQL DBs Went Down for 8+ Hours

Azure SQL DB
7 Comments

On August 30, Azure’s Australia East data center had a big problem, affecting customers like Bank of Queensland and Jetstar. Here’s the timeline:

  • 30 August 2023 @ 08:41 – Voltage sag occurred on utility power line
  • 30 August 2023 @ 08:43 – Five chillers failed to restart
  • 30 August 2023 @ 10:30 – Storage and SQL alerted by monitors about failure rates
  • 30 August 2023 @ 10:57 – Cosmos DB Initial impact detected via monitoring
  • 30 August 2023 @ 11:15 – Attempts to stabilize the five chillers were unsuccessful after multiple chiller restarts
  • 30 August 2023 @ 11:34 – Decision was made to shutdown infrastructure in the two affected data halls
  • 30 August 2023 @ 20:29 – All but two SQL nodes recovered
  • 31 August 2023 @ 04:04 – Restoration of Cosmos DB accounts to Australia East initiated
  • 31 August 2023 @ 04:43 – Final Cosmos DB cluster recovered, restoring all traffic for accounts that were not failed over
  • 31 August 2023 @ 08:45 – All external customer accounts back online and operating from Australia

Note that 11:34, the decision was made to shut down infrastructure without Microsoft failing your databases over elsewhere. If you were an Azure SQL DB or Cosmos DB user, and you weren’t paying for replicas in another data center, it was up to you to follow Microsoft’s disaster recovery guidance.

Controversial opinion: I actually love that and I think it’s great.

I see a lot of Azure SQL DB users make the mistake of assuming that Azure includes disaster recovery, but it does not. It’s on you, and as a result, you save money. (Same thing in AWS Aurora PostgreSQL.) I’m sure there are plenty of small business databases that don’t need disaster recovery within a day or two. Heck, even Bank of Queensland probably has some databases that fit into that category, although… probably not as many as actually went down, hahaha.

There’s a problem with that, though: Microsoft didn’t notify affected customers about which of their databases were down, or that the customers should start their DR processes. Microsoft couldn’t notify customers because … they didn’t know who those customers were. Microsoft’s Azure status history doesn’t let you easily link to a single event, but if you expand the outage on 30 Aug, the preliminary writeup is really detailed, and explains why they were flying blind:

From a SQL perspective… Some databases may have been completely unavailable, some would have experienced intermittent connectivity issues, and some databases would have been fully available. This uneven impact profile for databases in the degraded ring, meant that it was difficult to summarize which customers were still impacted, which continued to present a challenge throughout the incident.

Boy, I have been there. When multiple databases and servers go down, one of the first thing management wants to know is, “Which specific apps are down?” When you can’t answer that question, it makes management pretty nervous, and adds even more stress to the situation.

As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in degraded health scenario. Soon this became our largest impediment to mitigating impact.

It might be tempting to point and say, “Well, Microsoft, you should have that” – and they should – but I don’t see a lot of shops with well-tested automated DR failover tools.

I’ve long said that Azure SQL DB does a better job of database administration than not having a DBA altogether, and this is a good example. Customers who didn’t have a DBA wouldn’t have been any better off managing their own DR in a situation like this, and frankly, most customers who do have a DBA wouldn’t have been better off either. (If you smugly think you’d be fine, point to your current list of production servers & databases, and prove that every single database you have is also synced with DR. Go ahead. I’ll wait.)

Because every DB moved required manual mitigation via scripts, it seriously undermined our ability to move fast even once impacted DBs were identified, and DB moves were scheduled.

Elsewhere in the post, they mention that over 250,000 databases were involved in just one of these troubled rings of databases alone. You just can’t manually do anything with 250,000 databases, so I can only imagine how stressful it was to try to write the automation code under fire. Props to the folks working that night.

Overall, this kind of incident – and how Microsoft responded to it afterwards – is why I think that if you don’t have a DBA, you could do a lot worse than relying on Microsoft, Amazon, and Google doing that job for you instead. Platform-as-a-Service lets someone else stress out about the outage, troubleshoot it as quickly as they can, then build better processes to shorten the next outage.

Previous Post
[Video] Office Hours: Overly Caffeinated Edition
Next Post
Who’s Hiring in the Microsoft Data Platform Community? September 2023 Edition

7 Comments. Leave new

  • D365 applications that are controlled Microsoft also went down. And for these applications, Microsoft doesn’t provide access to the related SQL Azure instance, the only option you had – just wait.

    Reply
  • Whenever I go to work for a new customer I have a range of questions for that first week. One is what is your DR plan and when was it last tested? Too many times on seeing the answer I have to reply: “That’s not a DR Plan that’s a DR Wishlist”

    Reply
  • Ashton Tate V1.0 DBA
    September 8, 2023 1:46 am

    Ouch, and we are looking to rehearsing DR soon, including some Azure. Now how do we emulate that?

    Reply
  • I just want to add that even if you did have Failover groups in place they didn’t failover. In the final PIR Microsoft puts their hand up about this, but it’s disappointing. Yes – you were able to manually failover but when you go to the effort of setting up DR that someone else ultimately throws the switch on, it’s disappointing when they don’t bother.
    Microsoft did slightly better in last weeks US East outage where customers who had Failover groups in place did have their failover occur….8 hours into the 12 hour outage.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.