In the cloud, treat your servers like cattle, not like pets.
In the cloud, systems administration is very different than the on-premises stuff you’re used to. When you build VMs in the cloud with Infrastructure-as-a-Service (IaaS, meaning AWS EC2, GCE, or Azure VMs), you expect them to die. It’s just a matter of time. If you’re lucky, it’ll be years from now, but if you’re unlucky, it’ll be tomorrow.
This sort of thinking drove Netflix to create the Chaos Monkey. In their 2010 post 5 Lessons We’ve Learned Using AWS, they wrote:
“One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most?—?in the event of an unexpected outage.”
That’s right: they have a tool that randomly terminates production instances.
Imagine being an admin in an organization that runs the Chaos Monkey. You start to think completely differently: that precious server you’re about to put into production was born to fight a villain enemy out to kill him. The enemy will win the battle, but not necessarily the war.
You have to be able to lose soldiers – individual servers – at any time, but design your infrastructure in a way that the entire application, whatever service you’re providing, will win the war overall.
That’s why cloud sysadmins have started treating infrastructure as code.
Don’t think of VMs as servers you build.
Think of them as applications you deploy.
In companies like this, from the moment the VM powers on, everything you do to get it ready for production needs to be scripted, repeatable, and eventually, automated to the point where it happens without human involvement. And when you’ve got this much scripting in play, that also means controlling the source code just like you would an application’s code.
You rarely see Chaos Monkeys around database servers.
Database administrators have always treated our servers like the most special of pets. We pick just the right breed, we give them special names, we train them very carefully, we teach them tricks, and we build a personal bond with them. When they die, we’re heartbroken, and we feel like we have to start over from scratch.
Most SQL Server shops big and small simply aren’t prepared to do all of this in an automated fashion:
- Deploy Windows
- Install & configure failover clustering so the server can join an AG
- Install SQL Server and configure Always On High Availability
- Join the right cluster
- Install all your utility stored procedures, backup jobs, Agent alerts
- Restore the relevant databases
- Join the right Always On Availability Group with the right replication type (sync vs async)
- Modify the read-only routing lists on all the replicas
This stuff is hard to do manually, let alone automatically. Therefore, we think of our servers as precious hand-crafted pets rather than cattle we could just lose at any time without heartbreak.
That’s what a Platform-as-a-Service does for you.
Azure SQL DB and Amazon RDS do all this stuff for you at the swipe of a credit card. They take away the plumbing parts of database administration that suck: building, backing up, patching, corruption repair, etc. For any new app builds today, I recommend thinking about PaaS first. (It’s what we use for our own development.)
- Missing features – for example, Relativity relies on linked server queries. Amazon RDS has a kind of sketchy implementation, and Azure SQL DB requires you to pre-define individual table structures. Neither of those work with the way Relativity is built today, and the code changes required to support either platform would be fairly expensive. (That’s not the only one, obviously, but I’m keeping the examples simple for this post. I can already hear the armchair architects going, “just tell the stupid developers to rewrite their stupid app.” That ain’t how the real world works.)
- Capacity/performance limitations – Azure SQL DB maxes out at 4TB per database, and Amazon RDS maxes out at 30 databases per instance. Both of those present problems for Relativity.
- Exporting raw data – Relativity users want to be able to sync on-premises versions of the data, sometimes migrating from on-prem to the cloud, and sometimes migrating back out. With PaaS, this involves long outages that aren’t acceptable to attorneys in the middle of frantic case review.
So in this case – as I’ve seen with a few other clients – PaaS isn’t quite ready today to handle the challenges of a global ISV with a mature, profitable application. They simply can’t hit the pause button on building new features, and take time out to do an expensive back end rewrite. (Although in the case of kCura, they’re moving some parts of the data out of SQL Server where it makes sense, thereby making the app easier to handle in PaaS-like environments.)
Hmm. If Azure SQL DB and Amazon RDS aren’t a good fit yet, but we want bulletproof reliability and automatic scaling, and we gotta use the boxed product (Microsoft SQL Server), what do we do?
What would SQL Server need to go up against the Chaos Monkey?
What if, at the push of a button, we could deploy a VM with the right Windows config, set up clustering, get SQL Server installed correctly, restore the right databases, and join an Availability Group?
Sure, that would help us defend against the Chaos Monkey because if any one instance disappeared, we’d be able to rapidly stand up its replacement just by hitting the button. The other AG members would cover in its place for a while.
Even better, what if we mounted that button in a place that anyone – or anything – could push? What if we enabled our monitoring tools to push that button for us when things were going wrong?
That button would be good for a whole lot of bonus stuff.
That button could completely change how we do patching, for example. Rather than patch an existing server, just define (in code) the appropriate patch levels, and hit the button. Let the system stand up a new replica with the right patch levels. When it’s good to go, simply fail over to it, and then delete the old replica. Cattle, not pets – that cow did a good job, but it’s not needed anymore, and we can just make it go away.
The button would change how troubleshooting works. Having problems with a janky replica? Can’t figure out why it’s failing or throwing strange errors? Just hit the button, and a new one appears to replace it.
SQL Server DBAs have a hard time with this button.
We’re so attached to our pets that we have a hard time saying, “Aw, screw it, just stand up another SQL Server and kill that troublesome one.”
We need to flip things around:
Ops/cloud admins are all about putting a lot of work into building that button, and then hitting that button as often as necessary.
That button is what kCura’s Mike Malone termed a Faux PaaS: it’s like Azure SQL DB’s Platform-as-a-Service, but something you build and manage yourself. It’s a SQL Server service that stands a much better chance of overall success against the Chaos Monkey. Sure, you could build this kind of thing on-premises, but the cloud’s easier deployment APIs make this more feasible.
I’ve had clients build this kind of thing privately, but I’m excited that for the first time, I’m on a project that I can talk about publicly. Over the coming weeks, I’ll talk about why you might build something like this, design considerations, and common gotchas. Next week, we’ll cover choosing the right instance types, storage, and backup location for your RPO/RTO goals. Along the way, I’ll even share new open source tools with you to make the journey easier.