On August 30, Azure’s Australia East data center had a big problem, affecting customers like Bank of Queensland and Jetstar. Here’s the timeline:
- 30 August 2023 @ 08:41 – Voltage sag occurred on utility power line
- 30 August 2023 @ 08:43 – Five chillers failed to restart
- 30 August 2023 @ 10:30 – Storage and SQL alerted by monitors about failure rates
- 30 August 2023 @ 10:57 – Cosmos DB Initial impact detected via monitoring
- 30 August 2023 @ 11:15 – Attempts to stabilize the five chillers were unsuccessful after multiple chiller restarts
- 30 August 2023 @ 11:34 – Decision was made to shutdown infrastructure in the two affected data halls
- 30 August 2023 @ 20:29 – All but two SQL nodes recovered
- 31 August 2023 @ 04:04 – Restoration of Cosmos DB accounts to Australia East initiated
- 31 August 2023 @ 04:43 – Final Cosmos DB cluster recovered, restoring all traffic for accounts that were not failed over
- 31 August 2023 @ 08:45 – All external customer accounts back online and operating from Australia
Note that 11:34, the decision was made to shut down infrastructure without Microsoft failing your databases over elsewhere. If you were an Azure SQL DB or Cosmos DB user, and you weren’t paying for replicas in another data center, it was up to you to follow Microsoft’s disaster recovery guidance.
Controversial opinion: I actually love that and I think it’s great.
I see a lot of Azure SQL DB users make the mistake of assuming that Azure includes disaster recovery, but it does not. It’s on you, and as a result, you save money. (Same thing in AWS Aurora PostgreSQL.) I’m sure there are plenty of small business databases that don’t need disaster recovery within a day or two. Heck, even Bank of Queensland probably has some databases that fit into that category, although… probably not as many as actually went down, hahaha.
There’s a problem with that, though: Microsoft didn’t notify affected customers about which of their databases were down, or that the customers should start their DR processes. Microsoft couldn’t notify customers because … they didn’t know who those customers were. Microsoft’s Azure status history doesn’t let you easily link to a single event, but if you expand the outage on 30 Aug, the preliminary writeup is really detailed, and explains why they were flying blind:
From a SQL perspective… Some databases may have been completely unavailable, some would have experienced intermittent connectivity issues, and some databases would have been fully available. This uneven impact profile for databases in the degraded ring, meant that it was difficult to summarize which customers were still impacted, which continued to present a challenge throughout the incident.
Boy, I have been there. When multiple databases and servers go down, one of the first thing management wants to know is, “Which specific apps are down?” When you can’t answer that question, it makes management pretty nervous, and adds even more stress to the situation.
As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in degraded health scenario. Soon this became our largest impediment to mitigating impact.
It might be tempting to point and say, “Well, Microsoft, you should have that” – and they should – but I don’t see a lot of shops with well-tested automated DR failover tools.
I’ve long said that Azure SQL DB does a better job of database administration than not having a DBA altogether, and this is a good example. Customers who didn’t have a DBA wouldn’t have been any better off managing their own DR in a situation like this, and frankly, most customers who do have a DBA wouldn’t have been better off either. (If you smugly think you’d be fine, point to your current list of production servers & databases, and prove that every single database you have is also synced with DR. Go ahead. I’ll wait.)
Because every DB moved required manual mitigation via scripts, it seriously undermined our ability to move fast even once impacted DBs were identified, and DB moves were scheduled.
Elsewhere in the post, they mention that over 250,000 databases were involved in just one of these troubled rings of databases alone. You just can’t manually do anything with 250,000 databases, so I can only imagine how stressful it was to try to write the automation code under fire. Props to the folks working that night.
Overall, this kind of incident – and how Microsoft responded to it afterwards – is why I think that if you don’t have a DBA, you could do a lot worse than relying on Microsoft, Amazon, and Google doing that job for you instead. Platform-as-a-Service lets someone else stress out about the outage, troubleshoot it as quickly as they can, then build better processes to shorten the next outage.