The postmortem is FANTASTIC: open, honest (at least it reads that way), and goes into enough technical detail to satisfy a wide variety of readers from managers to technical implementers.
This section explains a lot about their HA/DR strategy:
Why didn’t VSTS services fail over to another region? We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS). This enables us to replicate data within the same geography while respecting data sovereignty. Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.
To rephrase, in the event of losing a region, the plan was to restore from backups. That’s absolutely fair, and it’s probably the same disaster recovery plan your company has, dear reader. Don’t get all high-and-mighty on me now – I like that plan just fine for disasters, and it’s the same thing we designed for our Faux PaaS project.
But I want to draw your attention to what their plan didn’t include: synchronous Availability Groups across data centers.
Cross-data-center synchronous AGs are something that work great in theory, but usually fall down in practice. Your applications just don’t want to wait until a write is committed across two different data centers. I’ll let Microsoft explain why:
However, the reality of cross-region synchronous replication is messy. For example, the region paired with South Central US is US North Central. Even at the speed of light, it takes time for the data to reach the other data center and for the original data center to receive the response. The round-trip latency is added to every write. This adds approximately 70ms for each round trip between South Central US and US North Central. For some of our key services, that’s too long. Machines slow down and networks have problems for any number of reasons. Since every write only succeeds when two different sets of services in two different regions can successfully commit the data and respond, there is twice the opportunity for slowdowns and failures. As a result, either availability suffers (halted while waiting for the secondary write to commit) or the system must fall back to asynchronous replication.
That’s Microsoft talking.
Microsoft can’t get sync AGs to work for them in a way that makes them happy.
Before you design a DR plan aiming for zero data loss using synchronous AG replication, make sure you build a solid proof of concept, and load test it with production-quality workloads. Make sure your end users will accept the latency slowdowns – or if they won’t, make sure they sign off on the RPO and RTO involved with a single-data-center solution. The time to learn these numbers isn’t when the hurricane is approaching, or when you’re writing a postmortem about your own apps.