Building a Faux PaaS, Part 1: The SQL Server DevOps Scene in 2017

Last Updated June 8, 2017

In the cloud, treat your servers like cattle, not like pets.

In the cloud, systems administration is very different than the on-premises stuff you’re used to. When you build VMs in the cloud with Infrastructure-as-a-Service (IaaS, meaning AWS EC2, GCE, or Azure VMs), you expect them to die. It’s just a matter of time. If you’re lucky, it’ll be years from now, but if you’re unlucky, it’ll be tomorrow.

This sort of thinking drove Netflix to create the Chaos Monkey. In their 2010 post 5 Lessons We’ve Learned Using AWS, they wrote:

“One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most?—?in the event of an unexpected outage.”

That’s right: they have a tool that randomly terminates production instances.

Imagine being an admin in an organization that runs the Chaos Monkey. You start to think completely differently: that precious server you’re about to put into production was born to fight a villain enemy out to kill him. The enemy will win the battle, but not necessarily the war.

You have to be able to lose soldiers – individual servers – at any time, but design your infrastructure in a way that the entire application, whatever service you’re providing, will win the war overall.

That’s why cloud sysadmins have started treating infrastructure as code.

Don’t think of VMs as servers you build.
Think of them as applications you deploy.

In companies like this, from the moment the VM powers on, everything you do to get it ready for production needs to be scripted, repeatable, and eventually, automated to the point where it happens without human involvement. And when you’ve got this much scripting in play, that also means controlling the source code just like you would an application’s code.

You rarely see Chaos Monkeys around database servers.

Database administrators have always treated our servers like the most special of pets. We pick just the right breed, we give them special names, we train them very carefully, we teach them tricks, and we build a personal bond with them. When they die, we’re heartbroken, and we feel like we have to start over from scratch.

Most SQL Server shops big and small simply aren’t prepared to do all of this in an automated fashion:

Deploy Windows
Install & configure failover clustering so the server can join an AG
Install SQL Server and configure Always On High Availability
Join the right cluster
Install all your utility stored procedures, backup jobs, Agent alerts
Restore the relevant databases
Join the right Always On Availability Group with the right replication type (sync vs async)
Modify the read-only routing lists on all the replicas

This stuff is hard to do manually, let alone automatically. Therefore, we think of our servers as precious hand-crafted pets rather than cattle we could just lose at any time without heartbreak.

That’s what a Platform-as-a-Service does for you.

Azure SQL DB and Amazon RDS do all this stuff for you at the swipe of a credit card. They take away the plumbing parts of database administration that suck: building, backing up, patching, corruption repair, etc. For any new app builds today, I recommend thinking about PaaS first. (It’s what we use for our own development.)

But for existing apps, PaaS has a few showstoppers. Let’s take kCura Relativity, an app I’ve blogged about before. Here’s why they can’t just switch to PaaS hosting:

Missing features – for example, Relativity relies on linked server queries. Amazon RDS has a kind of sketchy implementation, and Azure SQL DB requires you to pre-define individual table structures. Neither of those work with the way Relativity is built today, and the code changes required to support either platform would be fairly expensive. (That’s not the only one, obviously, but I’m keeping the examples simple for this post. I can already hear the armchair architects going, “just tell the stupid developers to rewrite their stupid app.” That ain’t how the real world works.)
Capacity/performance limitations – Azure SQL DB maxes out at 4TB per database, and Amazon RDS maxes out at 30 databases per instance. Both of those present problems for Relativity.
Exporting raw data – Relativity users want to be able to sync on-premises versions of the data, sometimes migrating from on-prem to the cloud, and sometimes migrating back out. With PaaS, this involves long outages that aren’t acceptable to attorneys in the middle of frantic case review.

So in this case – as I’ve seen with a few other clients – PaaS isn’t quite ready today to handle the challenges of a global ISV with a mature, profitable application. They simply can’t hit the pause button on building new features, and take time out to do an expensive back end rewrite. (Although in the case of kCura, they’re moving some parts of the data out of SQL Server where it makes sense, thereby making the app easier to handle in PaaS-like environments.)

Hmm. If Azure SQL DB and Amazon RDS aren’t a good fit yet, but we want bulletproof reliability and automatic scaling, and we gotta use the boxed product (Microsoft SQL Server), what do we do?

What would SQL Server need to go up against the Chaos Monkey?

What if, at the push of a button, we could deploy a VM with the right Windows config, set up clustering, get SQL Server installed correctly, restore the right databases, and join an Availability Group?

Sure, that would help us defend against the Chaos Monkey because if any one instance disappeared, we’d be able to rapidly stand up its replacement just by hitting the button. The other AG members would cover in its place for a while.

Even better, what if we mounted that button in a place that anyone – or anything – could push? What if we enabled our monitoring tools to push that button for us when things were going wrong?

That button would be good for a whole lot of bonus stuff.

That button could completely change how we do patching, for example. Rather than patch an existing server, just define (in code) the appropriate patch levels, and hit the button. Let the system stand up a new replica with the right patch levels. When it’s good to go, simply fail over to it, and then delete the old replica. Cattle, not pets – that cow did a good job, but it’s not needed anymore, and we can just make it go away.

The button would change how troubleshooting works. Having problems with a janky replica? Can’t figure out why it’s failing or throwing strange errors? Just hit the button, and a new one appears to replace it.

SQL Server DBAs have a hard time with this button.

We’re so attached to our pets that we have a hard time saying, “Aw, screw it, just stand up another SQL Server and kill that troublesome one.”

We need to flip things around:

Ops/cloud admins are all about putting a lot of work into building that button, and then hitting that button as often as necessary.

That button is what kCura’s Mike Malone termed a Faux PaaS: it’s like Azure SQL DB’s Platform-as-a-Service, but something you build and manage yourself. It’s a SQL Server service that stands a much better chance of overall success against the Chaos Monkey. Sure, you could build this kind of thing on-premises, but the cloud’s easier deployment APIs make this more feasible.

I’ve had clients build this kind of thing privately, but I’m excited that for the first time, I’m on a project that I can talk about publicly. Over the coming weeks, I’ll talk about why you might build something like this, design considerations, and common gotchas. Next week, we’ll cover choosing the right instance types, storage, and backup location for your RPO/RTO goals. Along the way, I’ll even share new open source tools with you to make the journey easier.

Continue Reading Part 2:
Choosing and Testing a Cloud Vendor

The Case of Entity Framework Core’s Odd SQL

Group Post: If I Took Another DBA Job, My First Question Would Be…

13 Comments. Leave new

BrettC
May 9, 2017 9:23 am

Great post! I’m very excited to read the follow-up posts about this project.

Reply
rich
May 9, 2017 10:39 am

Great stuff. Very much looking forward to reading how this is done

Reply
Greg Besso
May 9, 2017 11:25 am

Thanks for sharing, I’m happy to have learned about the Chaos Monkey that’s being used to intentionally kill instances. It will make for better practices.

Reply
Wes Crockett
May 9, 2017 11:56 am

Oh man, this is exciting! Can’t wait for the follow ups!

I did some ‘dev-ops’-ish stuff in a previous position for on-premise systems. It included auto-building of dev code followed by automated regressions testing all with Jenkins. I really loved developing the solutions.

Reply
Database Antichrist
May 9, 2017 12:11 pm

Ooooooh…. this is relative to my interests!!!!!!

Reply
- Brent Ozar
  May 9, 2017 12:38 pm
  
  Thanks guys! Glad there’s an interest in it. It’s a ton of fun to share projects like this.
  
  Reply
SQL_dude
May 9, 2017 1:35 pm

Really shows how our consume everything and throw it away for the latest shinny object is being pushed into business practices. We do not care about quality anymore, just how fast to go before crashing it. So how do you store historical data for analysis? Different system or do not care about old data.

Reply
- Brent Ozar
  May 9, 2017 1:36 pm
  
  You may want to read the supporting links. In the e-discovery business, you very, very much do care about old data. 😀
  
  Reply
Dragos N.
May 9, 2017 4:08 pm

I will go get the popcorn 🙂

Reply
David Spearritt
May 9, 2017 5:37 pm

Reminds me of Cato.
https://www.youtube.com/watch?v=uk_2-ib3ENc

Reply
terry foster
May 18, 2017 2:37 pm

highly informative as always, please keep these coming.

Reply
Ron Dameron
May 22, 2017 6:42 am

Oh, man. This was an excellent post. Before leaving my last job, I was working on an automation project to build database servers. It is hard work getting a magic button to work. We had hundreds of servers. I told the management, “I can provision a server in AWS or Azure in minutes.” That’s our goal. We were still trying to get a cluster install working when I left. The new Google SRE book has some interesting coverage on how they do automation in their data centers also that might be relevant. I’m only 16% through that book and it’s been an eye opener.

Reply
- Brent Ozar
  May 22, 2017 7:25 am
  
  Thanks sir!
  
  Reply