Building a Faux PaaS, Part 1: The SQL Server DevOps Scene in 2017

In the cloud, treat your servers like cattle, not like pets.

Rounding up a herd of servers in Texas

In the cloud, systems administration is very different than the on-premises stuff you’re used to. When you build VMs in the cloud with Infrastructure-as-a-Service (IaaS, meaning AWS EC2, GCE, or Azure VMs), you expect them to die. It’s just a matter of time. If you’re lucky, it’ll be years from now, but if you’re unlucky, it’ll be tomorrow.

This sort of thinking drove Netflix to create the Chaos Monkey. In their 2010 post 5 Lessons We’ve Learned Using AWS, they wrote:

“One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most?—?in the event of an unexpected outage.”

That’s right: they have a tool that randomly terminates production instances.

Imagine being an admin in an organization that runs the Chaos Monkey. You start to think completely differently: that precious server you’re about to put into production was born to fight a villain enemy out to kill him. The enemy will win the battle, but not necessarily the war.

You have to be able to lose soldiers – individual servers – at any time, but design your infrastructure in a way that the entire application, whatever service you’re providing, will win the war overall.

That’s why cloud sysadmins have started treating infrastructure as code.

Don’t think of VMs as servers you build.
Think of them as applications you deploy.

In companies like this, from the moment the VM powers on, everything you do to get it ready for production needs to be scripted, repeatable, and eventually, automated to the point where it happens without human involvement. And when you’ve got this much scripting in play, that also means controlling the source code just like you would an application’s code.

You rarely see Chaos Monkeys around database servers.

Database administrators have always treated our servers like the most special of pets. We pick just the right breed, we give them special names, we train them very carefully, we teach them tricks, and we build a personal bond with them. When they die, we’re heartbroken, and we feel like we have to start over from scratch.

Most SQL Server shops big and small simply aren’t prepared to do all of this in an automated fashion:

  • Deploy Windows
  • Install & configure failover clustering so the server can join an AG
  • Install SQL Server and configure Always On High Availability
  • Join the right cluster
  • Install all your utility stored procedures, backup jobs, Agent alerts
  • Restore the relevant databases
  • Join the right Always On Availability Group with the right replication type (sync vs async)
  • Modify the read-only routing lists on all the replicas

This stuff is hard to do manually, let alone automatically. Therefore, we think of our servers as precious hand-crafted pets rather than cattle we could just lose at any time without heartbreak.

That’s what a Platform-as-a-Service does for you.

Azure SQL DB and Amazon RDS do all this stuff for you at the swipe of a credit card. They take away the plumbing parts of database administration that suck: building, backing up, patching, corruption repair, etc. For any new app builds today, I recommend thinking about PaaS first. (It’s what we use for our own development.)

But for existing apps, PaaS has a few showstoppers. Let’s take kCura Relativity, an app I’ve blogged about before. Here’s why they can’t just switch to PaaS hosting:

  • Missing features – for example, Relativity relies on linked server queries. Amazon RDS has a kind of sketchy implementation, and Azure SQL DB requires you to pre-define individual table structures. Neither of those work with the way Relativity is built today, and the code changes required to support either platform would be fairly expensive. (That’s not the only one, obviously, but I’m keeping the examples simple for this post. I can already hear the armchair architects going, “just tell the stupid developers to rewrite their stupid app.” That ain’t how the real world works.)
  • Capacity/performance limitations – Azure SQL DB maxes out at 4TB per database, and Amazon RDS maxes out at 30 databases per instance. Both of those present problems for Relativity.
  • Exporting raw data – Relativity users want to be able to sync on-premises versions of the data, sometimes migrating from on-prem to the cloud, and sometimes migrating back out. With PaaS, this involves long outages that aren’t acceptable to attorneys in the middle of frantic case review.

So in this case – as I’ve seen with a few other clients – PaaS isn’t quite ready today to handle the challenges of a global ISV with a mature, profitable application. They simply can’t hit the pause button on building new features, and take time out to do an expensive back end rewrite. (Although in the case of kCura, they’re moving some parts of the data out of SQL Server where it makes sense, thereby making the app easier to handle in PaaS-like environments.)

Hmm. If Azure SQL DB and Amazon RDS aren’t a good fit yet, but we want bulletproof reliability and automatic scaling, and we gotta use the boxed product (Microsoft SQL Server), what do we do?

What would SQL Server need to go up against the Chaos Monkey?

What if, at the push of a button, we could deploy a VM with the right Windows config, set up clustering, get SQL Server installed correctly, restore the right databases, and join an Availability Group?

Sure, that would help us defend against the Chaos Monkey because if any one instance disappeared, we’d be able to rapidly stand up its replacement just by hitting the button. The other AG members would cover in its place for a while.

Example: a dog’s butt

Even better, what if we mounted that button in a place that anyone – or anything – could push? What if we enabled our monitoring tools to push that button for us when things were going wrong?

That button would be good for a whole lot of bonus stuff.

That button could completely change how we do patching, for example. Rather than patch an existing server, just define (in code) the appropriate patch levels, and hit the button. Let the system stand up a new replica with the right patch levels. When it’s good to go, simply fail over to it, and then delete the old replica. Cattle, not pets – that cow did a good job, but it’s not needed anymore, and we can just make it go away.

The button would change how troubleshooting works. Having problems with a janky replica? Can’t figure out why it’s failing or throwing strange errors? Just hit the button, and a new one appears to replace it.

SQL Server DBAs have a hard time with this button.

We’re so attached to our pets that we have a hard time saying, “Aw, screw it, just stand up another SQL Server and kill that troublesome one.”

We need to flip things around:

My job: 90% Excel, 10% WordPress

 

Ops/cloud admins are all about putting a lot of work into building that button, and then hitting that button as often as necessary.

That button is what kCura’s Mike Malone termed a Faux PaaS: it’s like Azure SQL DB’s Platform-as-a-Service, but something you build and manage yourself. It’s a SQL Server service that stands a much better chance of overall success against the Chaos Monkey. Sure, you could build this kind of thing on-premises, but the cloud’s easier deployment APIs make this more feasible.

I’ve had clients build this kind of thing privately, but I’m excited that for the first time, I’m on a project that I can talk about publicly. Over the coming weeks, I’ll talk about why you might build something like this, design considerations, and common gotchas. Next week, we’ll cover choosing the right instance types, storage, and backup location for your RPO/RTO goals. Along the way, I’ll even share new open source tools with you to make the journey easier.

Continue Reading Part 2:
Choosing and Testing a Cloud Vendor

Previous Post
The Case of Entity Framework Core’s Odd SQL
Next Post
Group Post: If I Took Another DBA Job, My First Question Would Be…

13 Comments. Leave new

  • Great post! I’m very excited to read the follow-up posts about this project.

    Reply
  • Great stuff. Very much looking forward to reading how this is done

    Reply
  • Thanks for sharing, I’m happy to have learned about the Chaos Monkey that’s being used to intentionally kill instances. It will make for better practices.

    Reply
  • Wes Crockett
    May 9, 2017 11:56 am

    Oh man, this is exciting! Can’t wait for the follow ups!

    I did some ‘dev-ops’-ish stuff in a previous position for on-premise systems. It included auto-building of dev code followed by automated regressions testing all with Jenkins. I really loved developing the solutions.

    Reply
  • Database Antichrist
    May 9, 2017 12:11 pm

    Ooooooh…. this is relative to my interests!!!!!!

    Reply
  • Really shows how our consume everything and throw it away for the latest shinny object is being pushed into business practices. We do not care about quality anymore, just how fast to go before crashing it. So how do you store historical data for analysis? Different system or do not care about old data.

    Reply
  • I will go get the popcorn 🙂

    Reply
  • David Spearritt
    May 9, 2017 5:37 pm
    Reply
  • terry foster
    May 18, 2017 2:37 pm

    highly informative as always, please keep these coming.

    Reply
  • Oh, man. This was an excellent post. Before leaving my last job, I was working on an automation project to build database servers. It is hard work getting a magic button to work. We had hundreds of servers. I told the management, “I can provision a server in AWS or Azure in minutes.” That’s our goal. We were still trying to get a cluster install working when I left. The new Google SRE book has some interesting coverage on how they do automation in their data centers also that might be relevant. I’m only 16% through that book and it’s been an eye opener.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.