Building a Faux PaaS, Part 2: Choosing and Testing a Cloud Vendor

Last Updated 7 years ago

Background: I’m working with kCura to build a Faux PaaS: something akin to Microsoft’s Azure SQL DB, but internally managed. In Part 1 of this series, I talked about why we’re building it, and today, I’ll talk about where.

The database team rarely gets to pick the cloud vendor.

The choice between Amazon, Google, and Microsoft is a very big decision involving:

What services you need – because not every cloud vendor offers every service, and some vendors offer killer advantages in specific services
Pricing, discounts, licensing, and contracting – classic example being Microsoft’s willingness to wheel and deal to get enterprise customers up into Azure
Datacenter availability – because sometimes for legal reasons, you need a very specific service in a specific country or location

For this project, for reasons outside of the DBA team’s scope, the SQL Servers had to live in Microsoft Azure. However, I’m going to try to write this series for as many readers as possible (even on-premises and hybrid folks.)

So I need you, dear reader, to understand that in this series:

The general strategies will be relevant to all cloud providers, and even on-premises private cloud hosting, but
The specific techniques are for Microsoft Azure VMs only, and
The exact implementation details are specific to one client’s business requirements, at one frozen moment in time.

Design starts with your RPO and RTO.

Recovery Point Objective (RPO) is how much data the business is willing to lose.

Recovery Time Objective (RTO) is how long the business is willing to be down.

These two numbers, multiplied across a few different scenarios, are the very first thing you gotta consider when doing a new SQL Server implementation. You’ve gotta make sure you’re going to be able to restore databases on a timeline that works for the business. As a refresher, you’re going to want to have read these posts before going further:

But ironically, the business can’t give you RPO/RTO numbers set in stone. During the course of a project, as costs, requirements, and capabilities change, you should expect the RPO/RTO numbers to change repeatedly. (And that’s fine.)

However, two RTO numbers are really important, and the business needs to give me ballpark ranges as early as possible. We take the HA/DR Planning Worksheet from our free First Responder Kit, and we discuss what happens when we have database corruption or oops queries:

In the event of database corruption, if the business wants near-zero downtime, we’re going to need automatic page repair for starters, which means we need database mirroring or Always On Availability Groups. However, there are some kinds of corruption that even automatic page repair can’t repair, and you need to plan your strategy for that ahead of time.

The much harder one is “oops” queries because SQL Server can’t protect you from those.

How bad is a really bad “Oops” query?

Folks always think of a DELETE without a WHERE clause, but that’s by no means a bad query. That’s fairly easy to recover from. Instead, let me tell you a more diabolical story. The times, tables, and names have been changed to protect the innocent.

4:30 PM: Dan the DBA is doing a deployment. He’d tested these scripts before going live, but something seems to be going wrong in production. He needs to run a script on each of the 200+ databases on the server, but some of them are failing, and he’s not sure why yet.

4:45 PM: He realizes some of the tables have data that isn’t supposed to be there – test data that got migrated during a prior deployment. Some of his scripts are violating unique key constraints. He decides to get the test data out of all of the databases.

5:00 PM: He’s still struggling, and the outage window is ending now. He gives managers the green light to let people into the database because he believes he only has to delete old testing data.

5:05 PM: Customers start putting data into the databases again. Meanwhile, our intrepid DBA is automating the data cleanup work by running it with sp_msforeachdb, a stored proc he thinks he’s pretty good with.

5:15 PM: sp_msforeachdb finishes, and Dan notices a bunch of errors. He starts troubleshooting. Customers are still putting live data into the databases.

5:30 PM: Dan realizes his error – his command wasn’t quite right, and he’s effectively destroyed data in all of the live databases, and even worse, he doesn’t know exactly which rows were involved. (He’ll be able to figure it out by going through logs, but it’s going to take hours.)

5:35 PM: Dan alerts management, who then has to decide:

Should we get everybody out of the app?
Should we restore the databases?
If so, what point in time do we restore to?
When does the RTO clock start? (Some folks will say it’s already started at 5:15 PM, when Dan’s query trashed live customer data, because customers may already be noticing invalid data.)

That’s why restore speed is really important.

If the business’s service level agreements (SLAs) may involve service refunds, attorneys, contracts, etc, then these details are important. I can’t help with the business decisions, but I can build a SQL Server infrastructure to match their goals. They’ve just gotta tell me what the goals are.

If the RTO for that scenario is, say, X hours, then get a load of this:

I can’t put more data on a server than I can restore in X hours.

Relational databases have non-restore features to protect you from corruption, failed service packs, or broken VMs, but if Dan drops a table, we’re screwed. I’m not saying we always have to protect from Dan – and in most projects, the business actually chooses not to. Businesses say things like, “Well, we’ll make sure to be real good and careful around the database server. We’ll be sober when we touch that one.”

In this project, we needed better protection than that.

That means something special for cloud projects. In the cloud:

Storage throughput is expensive
Network throughput is expensive
Often, backup targets are slow

Believe it or not, backup and restore speed often present the very first bottleneck that we have to overcome for performance reasons. That’s why the techniques in Erik’s white paper, SQL Server Performance Tuning in Google Compute Engine, comes in so handy for all of the new-build projects that we do.

In this particular project, the requirements included:

Microsoft Azure as the IaaS vendor
A specific list of data centers (which determines the types of instances available for us – Azure doesn’t offer the same instance types in every data center)
A set of RPO/RTO requirements

And the question was, “How can we meet our RPO/RTO goals and query performance goals in Azure VMs while keeping costs under control?”

Start by testing the cloud vendor’s restore speeds.

We set about the first set of tests to determine:

What’s the fastest place to write backups?
What’s the fastest place to host data & log files, in terms of the fastest restore speeds?
How much data can we restore onto a given instance type within our RTO window?
Given that capacity, what’s the effective cost per terabyte, per month, of that instance type?

We built this testing process as a repeatable methodology that the DBA/DevOps teams could iterate over. In the cloud, all this stuff changes over time:

RPO and RTO – after all, the business may suddenly say, “What would it cost to achieve a 30-minute RTO?”
Data center acceptability for the business (like we suddenly have to add servers in Korea)
VM instance type availability in a given data center
VM and storage speeds – you may fall victim to noisy neighbor syndrome, even within your own cloud account due to shared resource utilization, so we’re working on sp_BlitzBackups to monitor RPO/RTO

Yes, you can performance tune restores.

There’s a catch: you have to do it ahead of time, not after the fact. Once the disaster has started, it’s too late to start performance tuning.

Write your backups to a location that helps restores go faster. Traditional DBAs will hear that and think, “Ah, he’s telling me to write to a fast file server.” Think bigger: does the cloud vendor have a file-server-as-a-service with faster performance than an old-school VM file server? Do they have something that can handle automatic file replication across multiple data centers, without you having to manage anything? In Azure’s case, this is Azure Blob Storage.

Tune your backup files for restore speed. The most common backup tuning tip is that striping your backups across multiple files can make backups faster, but it holds true for restores, too. Interestingly, during my testing for this project, we saw different sweet spots for restore speed than backup speed. Striping backups across 20 files meant slower backups than 8 files, but at restore time, the 20 file backups restored faster than the 8-file ones. (The exact file numbers can vary by backup target, number of cores, number of drives, etc.) We also saw different throughput for different numbers of Azure Blob Storage accounts: for example, backing up to 8 files on the same storage account sucked, but backing up to 8 files on 8 different storage accounts was much faster. Like so many infrastructure decisions, it comes down to your willingness to trade implementation complexity for performance. Easy is rarely fast. Along those same lines, here’s a ton of great resources around backup performance tuning. (Just know that you have to test the restore speed on those techniques, too.)

In an emergency, spin up a replacement instance with phenomenally fast storage. If you’re under the gun to restore fast, you can leverage the flexibility of a cloud by throwing down your credit card and getting the fastest instance money can buy. It’s easier said than done, though: long before disaster strikes, you need to do load testing on the various VM types to know which one is the fastest for your needs. Do you need high network throughput to copy large backups from a UNC path? Or do you need all solid state storage underneath? Is your VM type fast enough to leverage that much storage? Would your app support splitting the databases across multiple servers to get them back online faster, but with different connection strings? The cloud offers a lot of design options, but you have to think them through ahead of time.

Implement tiering: different availability levels. Put the most critical databases on fast storage and make sure they get restored first. Then, after the most critical stuff is online, restore less-critical databases onto slower storage. For more about this technique, read my post Tiering kCura Relativity Databases (Or Any SaaS Product).

Try file snapshot backups. If you host your databases on Azure blob storage, you can leverage file snapshots to take nearly-instant backups. Pay close attention to the limitations on that page, though, in particular the frequency of backups, where the backup files are located, and the performance impacts. Plus, check out the limitations on storing data files in Azure blobs, like the storage-account-level security, and the inability to add files to a database in an AG. For a bunch of those reasons, this feature wasn’t a good fit for this particular project, but I really like where it’s going in the future. Just keep an eye on the performance limits and pricing.

As you test, measure and document your results.

For this project, we built a shared spreadsheet with:

Instance-level details:

Instance type (L16S, H16mr, etc)
Spec (cores/CPU speeds/memory)
Network speed (because this definitely impacts restores)
Region (because not all instance types are available in all data centers, but we wanted to test in case we saw a dramatic difference and needed to talk to our MS rep about getting a specific instance type added to a region)
Instance cost per month, without SQL licensing

Storage details:

Location (ephemeral or persistent)
# of disks, total capacity, cost per month
Encryption
Caching settings (Azure VMs use caching by default, using local ephemeral storage, and on some instance types, you can set the caching to off, read-only, or read-write)

Performance details:

Synthetic tests (like CrystalDiskMark, a quick sanity check on whether the VM was OK)
Backup tests, MB/sec – to different destinations (nul, local ephemeral, persistent, blob storage, UNC paths) and with different file quantities (1, 4, 8, etc.)
Restore tests, MB/sec – with same permutations as above

Let’s see some raw numbers.

Here’s an abridged version of the test results. (We actually ran more tests than that, but it’s a good starting point to show the methodology.)

Make sure to read the notes at the bottom. The Azure costs are rack rates, but the licensing number is just a rough ballpark – it’s not this company’s licensing rates, nor is it probably yours, but you need to fill in yours from your licensing agreement. (To edit the sheet, you’ll need to make your own copy of the file. You’re welcome to do that, by the way.)

Note 7 points you here to this blog post because some of these numbers need more explanations:

Column AF, Max data we can restore in RTO from Cost Drivers, in TB: if we have, say, 2 hours to do a restore, then how much data can we restore on this server in those 2 hours? This is calculated using the restore speeds from column AE, network restores of a 20-file backup. That’s probably not going to be the final backup/restore config – we still have performance tuning to do on backups & restores. We wanted to get the machine type first, then work on tuning for that particular machine type.

Note that as you adjust RTO – say, down to 30 minutes (.5) or 24 hours, this changes the hosting sweet spot. At a 30-minute RTO, the H-series makes more sense, and at a 4-8 hour RTO, the DS-series makes more sense.

Column AG, Lesser of max restore, vs storage capacity: you might have blazing-fast restore speeds, but a really tiny drive. (Hello, H-series.) In that case, we can’t just use column AF as your total hosting capacity on this instance type because the drive space is limited. On the flip side, you might have a huge drive, but dead-slow restore speeds. (We ran into that right away with magnetic storage, for example.)

Column AJ, Cost per TB per month: think of this as restorable cost. The instance might have 10TB of storage, but if you can only restore 1TB of data inside your RTO window, then the cost is calculated with the 1TB number.

Note that in this spreadsheet, I’m only calculating costs for each type of storage individually, not the sum total of possible storage on a given server. For example, in the LS series, you could use both the ephemeral AND premium managed storage, and possibly get higher throughput during restores (assuming you weren’t network-bottlenecked), and get better bang for the buck.

This kind of raw data helps you answer questions.

We reviewed the full spreadsheet (more tests, more instance types, more backup targets, etc) to figure out stuff like:

Should we host the databases in ephemeral or persistent storage?
Given that choice, what’s the right instance type with the best bang for the buck?
If we change the RTO goal, how does that change the cost-per-terabyte equation?
Should we back up to local (ephemeral) drives to finish backups faster?
Should we restore directly from a network share, or copy/pre-stage backups locally first?

“But where are the easy answers?”

I know what you’re thinking, dear reader:
“DAMMIT, BRENT, WHAT ARE THE ANSWERS?
CAN’T YOU GIVE ME A TL;DR FOR THIS?”

No, because remember: the exact implementation details are specific to one client’s business requirements, at one moment in time. I’m teaching you how to fish rather than giving you the fish, especially because the fish will quite possibly be spoiled and smell terrible by the time you get around to reading this post. (Seriously, we even made assumptions based on things Microsoft says they’ll bring to market at various times during the project’s deployment.)

The goal here wasn’t to test Amazon, Google, and Microsoft and then determine a winner.

The business dictates things like the cloud vendor, and then your job is to test the different hosting options inside that vendor to choose the right instance, storage, and backup design for one project’s needs. And now, over 2500 words later, hopefully you’ve got some insight as to how you’d do it for your own business requirements!

Continue Reading Part 3:
What the Ideal Engineering Team Looks Like

How Scalar Functions Can Stop You From Getting Adaptive Joins

Do SQL Server 2017’s Adaptive Joins Work with Cross Apply or Exists?