Today’s a Good Day to Talk to Your Manager About Disaster Recovery.

Last Updated July 19, 2024

Last night, two major IT disasters struck:

Microsoft Azure’s Central region went down for about 4 hours. The official post-mortem isn’t out yet, but rumor has it that while decommissioning legacy storage services, the product group deleted the wrong thing.
Crowdstrike pushed a bad update, leading to blue screens of death on Windows systems worldwide, affecting banking, healthcare, airlines, and more.

If you were affected by one of those outages, you have my warmest virtual hug. At times like this, the stress level can be really tough, and I hope you can take care of yourself. Remember that your own self-worth is not determined by the IT solutions you work on.

If you weren’t affected by one of those outages, it’s a good time to spend an hour writing up a few things:

Which of our production services are hosted entirely in a single region, availability zone, or data center?
How are we monitoring the status of that single point of failure? If there’s a widespread outage like that, how much time are we going to waste troubleshooting our own services when there’s a bigger problem?
When our single-region or single-AZ production services go down, what users/customers would be affected?
How will we communicate the outage to those affected users? Can we write that notification ahead of time so that it’s ready to go quickly in the event of the next disaster like this?
How much would it cost us (monthly or annually) to add in a second region or availability zone for protection from these kinds of incidents?

Summarize that, pass it up to your manager in writing, and it’ll help them have discussions this morning with their managers and executives. Today, a lot of business folks are going to be asking questions, and having these answers will help get you the resources you want.

(Or, it’ll help you feel more comfortable that the business understands the risks of putting all their eggs in a single basket, and that when that basket breaks, it’s not your fault. You warned ’em, and they chose not to spend the money to double-up on baskets.)

[Video] Office Hours in My Backyard

Announcing New Membership Tiers: Free & Mentoring

10 Comments. Leave new

Brendan Mason
July 19, 2024 5:21 pm

You’ve been trying to warn everyone for years by having sp_Blitz report the CloudStrike Falcon driver as a “dangerous third-party module”

Reply
- Brent Ozar
  July 19, 2024 5:26 pm
  
  I have indeed!
  
  Reply
  - wolf
    July 19, 2024 7:53 pm
    
    is there any elaboration/link on why it’s a dangerous module?
    
    Reply
    - Brian Boodman
      July 19, 2024 8:17 pm
      
      The original issue and pull request are described at https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit/issues/3147 : “[CrowdStrike’s Modules] were highlighted in a recent RCA with Microsoft to investigate why an AG went offline.”
      
      https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/performance-consistency-issues-filter-drivers-modules lists CrowdStrike’s drivers as potentially causing problems.
      
      Reply
    - Keith
      July 22, 2024 4:05 pm
      
      It isn’t unique to crowdstrike – mcafaee, sophos and cylance (an others) all have them as well and all are equally as pernicious. Given the nature of some of the outages some of them have caused, there obviously is minimal testing happening, if any.
      
      Their patch management is also an abomination, something as invasive to a system as god level privileged third-party software that also includes real mode code should never just go out to all devices concurrently. This is not something any sysadmin would ever do with something as mundane desktop updates, but third-party AV vendors will big bang a bad update to thousands (or millions) of devices without hesitation. I have even had updates vendors had announced as bad, instruction organizations do not install them then push them out through automatic update when our updates were behind for some reason.
      
      I am absolutely not in the ‘do not install AV,’ camp – but the more that AV vendors appeal to management (vs IT and compliance) the more catastrophic the rare AV problems seem to become.
      
      For what it is worth, I am becoming a bigger and bigger fan of Defender. Not an enthusiastic one, the onboard process is horrible, the licensing (packaging of the licensing mainly) is utterly confusing, the documentation is terrible, and the dashboards are mediocre at best, but I don’t remember a time they have pushed out a high-end bad update, when you correctly configure a rule or exception, it reliably adheres to it and MS doesn’t send nontechnical salesmen to butter up management about new features that aren’t relevant to the environment, or features that have already been vetted as nonfunctional/impractical that have to be endlessly met and argued over
      
      Reply
    - Martin Blazek
      July 28, 2024 7:32 pm
      
      Well just like other AVs it’s basically a virus with a couple of major differences: you _volunteer_ yourself into installing it _and_ paying for it.
      
      Reply
TechnoCaveman
July 19, 2024 5:34 pm

This goes triple for Municipalities. A disaster can take out power, water, roads, even the hospital and jail but what ever tent the city puts up afterwards is where FEMA will start. Yes that could start with a generator, a few servers and HAM operators.
Yes that means “installation media” for .DLL, .EXE, config files, notes on names, etc.
That said – PLEASE start with yourself and loved ones. Cause if you the DBA do not show up then a lot less gets done. YOU are important to more than just family. My motto is “Those who prepare suffer far less than those who do not”
Yes, having Disaster Recovery (DR) in the cloud is a great idea. From a few conventions the trend is “citizens have StarLink and sat com – cities, jails and police do not. Yes a couple of schools did but they are not on the same network. Funny how that mimics fire drills. OSHA requires a plan but does not require it be practiced.

Reply
Keith
July 22, 2024 4:33 pm

I don’t even believe in region redundancy anymore – at least from the same cloud provider.

Granted most organizations I implement at it don’t want to implement an AWS/Azure redundant environment and do it all on azure or all on AWS – I still remember the MS datacenter outage in Austin in 2017 or 2018.

I had 3 clients at the time who had paid for DR, and it didn’t work for any of them; MS had so severely oversubscribed the south-central region that the failover to Minneapolis just didn’t happen. One of my clients were toward the end of a business cycle where they onboard tens of thousands of new customers over a period of a few weeks. All three clients relied on use of email to coordinate DR efforts. Exchange Online was the platform for all of them and given that it was georedundant it was believed it would be most reliable and hardened against the most expected types of disasters – except the outage took out a major portion of mailboxes in southcentral, its failover didn’t work either and for the mailboxes that did work, delivery was extremely slow and unreliable.

Reply
- Brent Ozar
  July 22, 2024 4:37 pm
  
  I think that’s absolutely a fair concern. Microsoft has consistently had oversubscription issues – if you watch the Azure subreddit, there’s a constant stream of people having problems spinning up various VM types or services in different data centers due to capacity issues.
  
  Reply
  - Keith
    July 22, 2024 4:52 pm
    
    Even worse – I just remembered this – When spectre and meltdown were announced a few months after that outage, south central was still oversubscribed. The updates they pushed out to their virtual hosts to patch spectre and meltdown made some of the guests unstable and with the number of guests rebooting, the region couldn’t handle it and if I remember correctly, servers with 2 or 4 cores would not come back up after hard booting them.
    
    Reply