Background: I’m working with kCura to build a Faux PaaS: something akin to Microsoft’s Azure SQL DB, but internally managed. You can catch up with what we’ve discussed so far in Part 1 and Part 2 of the series.
In the last post, I talked about measuring backup and restore throughputs across different instance types, regions, storage configs, and backup locations. It’s a lot of work to answer questions like “How should we configure our new SQL Server VMs?”
And that decision work isn’t ever done.
As you read the news feeds from Amazon Web Services, Google Compute Engine, and Microsoft Azure VMs, you’ll realize that this stuff changes all the dang time. New instance types, new storage devices, new ways of storing your backups – they all pop up at least once a quarter.
In a perfect world, your ops or engineering team needs to:
- Love the scientific method
- Measure things and make sound, data-based decisions
- Prefer building scripts & tools over manual tasks
- Check their work into source control, and test it (which, with database deployment code, is much harder than it sounds)
- Be inspired by things like Google’s Site Reliability Engineering book
- Understand that this work is never done, and treat it as a career-long journey
A good engineering team uses code to connect dots.
Let’s take Always On Availability Groups as an example. If you’re building an automated self-healing Availability Group, you’re going to face the following failures in production:
- 1 node down, doesn’t respond to pings (could be for 60 seconds, or 5 minutes, and you probably want to react to those differently)
- The node is up, but 1 instance of SQL Server won’t start
- Node up, and SQL Server is running, but unresponsive (like threadpool issues and worker thread starvation)
- Cluster down – like insufficient quorum voters present, or IP address conflicts
- Cluster service stops on one node
- AG listener down, but the AG is OK
- AG down, but cluster is OK
- Witness down
- Replica out of sync with the primary
- Instance failed, and the AG can’t fail over automatically (like if the sync replicas had gotten behind)
- Backups didn’t run
- Extreme performance degradation
- Storage corruption or failed CheckDB
When you only have a few AGs, you can troubleshoot this kind of thing manually. When you have a few dozen, you need to build a runbook so that multiple team members can do root cause analysis exactly the same way, as efficiently as possible. When you have a few hundred or a few thousand, these troubleshooting steps – and even some of the repair efforts – have to be automated.
Systems & database administrators are used to fixing systems.
Developers are used to fixing code and processes.
Engineering teams fix code and processes, that thereby fix systems.
It only gets harder with 2016’s new Distributed Availability Groups, which are basically an availability group of availability groups. Troubleshooting those will be fiendishly complex, and they’re going to be a lot more common as teams upgrade. Done right, they look like a great tool to facilitate upgrades from SQL 2016 to SQL 2017.
You, dear reader, may be a database administrator wondering, “Is my job safe?” I’ve got great news: if you want an engineering role like this, and you’re a good DBA, then you’re a great fit for jobs like this. Good DBAs are curious problem-solvers who love learning about things outside of their core job duties. That’s exactly the kind of seed that grows into a great engineering mind. (After all, in the team we’re building at the moment, most of the team have been DBAs of some sort or another.)
The engineering role just involves more than databases – especially in the cloud. For example:
- Part 1 of this series talked about being able to hit a button to stand up a new replica, configure it, restore the right databases, and join it to the right AG
- Part 2 talked about doing restores as performance tests, and how this never ends because new instance types and storage options keep coming out
Good engineering team members look at that and say, “Right, so when my button script restores databases onto the new replica, it needs to capture the performance metrics, and log those somewhere.”
Great engineering team members say, “I’ll also build this so I can run it whenever a new instance type shows up in my region. Or maybe I’ll even use the Azure cmdlets in a job to check when new instance types are available, and just do experiments as they show up.” Great engineering teams use code to be proactive because they know the cloud is going to change.
The best engineering team members think like open source developers. They don’t publish conclusions: they publish their methodology, scripts, and resulting data. Even better, they push to publish this data not just internally, but externally, working together with the community to improve everybody’s game. They default to open.
“This sounds crazy. Who’s really doing this?”
Oh sure, there’s Google’s SRE team, Facebook’s engineers, and Etsy’s deployment processes, but I’ll use an example closer to home in the SQL Server community: Stack Overflow. (See, Nick, I got the space right this time.) Granted, Stack doesn’t use the cloud the way we’re discussing in this blog post series, but I’d argue their engineering teams exhibit exactly the kinds of behavior that make good cloud engineering teams successful.
Your first clue that they’re a little different is the Engineering team bio list. The developers are intermingled with the site reliability engineers – because honestly, they have a lot in common. I used to jokingly say that Stack doesn’t have database administrators, but the reality is that a lot of their developers and SREs are better at SQL Server than the full time DBAs I know. It’s just that the database knowledge is scattered throughout the team.
To monitor SQL Server (and their other core technologies), they built Opserver. They couldn’t find a monitoring tool that met their specific and ambitious needs, so they wrote their own, and open sourced it. This isn’t a good fit for traditional DBAs, though. Take the installation instructions: “Installation should be a snap, just build this project as-is and deploy it as an IIS website.” However, if you’re in the Ops team demographic that we’re talking about here, those instructions make sense.
And to top it all off, they share what they’re doing in their amazing blogs – Nick Craver (who’s an architecture lead these days) in particular:
- Stack Overflow: The Architecture – 2016 Edition
- Stack Overflow: The Hardware – 2016 Edition
- Stack Overflow: How We Do Deployment – 2016 Edition
- How We Upgrade a Live Data Center
- HTTPS on Stack Overflow (if that doesn’t teach you that engineering is development, I don’t know what will)
“Uh, that sounds hard. And expensive.”
It is. These inquisitive, scientific minds aren’t cheap, and building your own tools isn’t cheap either.
Sure, companies like Stack are building a lot of open source tools to help this process, and that will eventually drive costs down. (It’s kinda like how data center operating systems used to be hella expensive, but over time, cutting edge folks put a lot of work into Linux, and now it’s the de facto OS for a lot of data center builds.)
In 2017, though, there’s a relatively limited open source community around SQL Server tooling. For example, at the start of this Faux PaaS project, we looked at DBAtools.io for tools to stand up Azure VM instances and join them into an existing AG. Those tools don’t exist publicly today – but I have confidence that sooner or later, I’ll have a project willing to sponsor that kind of development. Then, once it’s open source, it can be leveraged by more folks. Today, it’s expensive in terms of manpower.
For now, though, that means the companies doing this level of automation tend to be large, global ISVs with at least a half-dozen SQL-Server-savvy folks (if not dozens) on their ops/engineering teams.
And the point of this post (and the entire series) isn’t to get you to build your own Faux PaaS: it’s to get you to understand how hard it is, and understand why the alternative is so awesome.
For the rest of us, there’s Real PaaS.
If you have an existing SQL Server application, and you’re considering building something like the Faux Paas project to scale to thousands of databases, you should first consider Azure SQL DB or Amazon RDS. The vendors have already built this wheel for you. Yes, you’re probably going to have to modify parts of your application to get it to be PaaS-compatible, but in most cases, that’s way less work than building the engineering team, processes, and tools that we’ll be describing in this series.
If you’re building a new application from scratch in 2017, there’s a very good argument to be made that it shouldn’t be in Microsoft SQL Server at all. Google Cloud Spanner, Amazon Aurora, and Microsoft’s
DocumentDB CosmosDB are cloud-scale databases that give you a lot of what developers want in a database, but sacrifice the things you never really cared about (like in some cases, foreign key integrity).
A good engineering team, when building their own solutions to manage things like a Faux PaaS, is open to whatever technology makes the most sense. In the case of the Faux PaaS, we needed to store data about cluster configuration and health. We wanted that data to be available even when entire clusters (or even data centers) were unavailable. Gathering that data would be done by a brand new service coded from scratch, so it had no ties to any particular back end.
And since we’re in Azure, the first choice for that repository was Azure SQL DB.