This month, to mark the 20th anniversary of BrentOzar.com, I’m stepping back and looking at the big picture rather than blogging about database problems & solutions like I usually do.
When I started working with databases, we had to:
- Provision the right CPUs and memory
- Provision the right storage throughput and size
- Install the database
- Configure the database
- Protect it – which sounds easy, but that also includes designing the right high availability and disaster recovery solution
- Design the right tables, and put indexes on them based on how we wanted to query the data
- Write fast, accurate queries to load & retrieve data
- Troubleshoot the whole thing when it went wrong
Today, decades later…yeah.
We’re still doing all of that stuff. It’s just that we’re cave-people using rocks for tools instead of banging on things by hand.
Vendors aren’t focused on these problems.
Oh sure, they’re marketing that they’ve built something better, but when you start to rub your finger on the product, the wet paint wipes off and you see the conventional problems underneath.
Let’s take something really simple: let’s build a new Azure SQL DB:
You get sliders for vCores and data max size.
But notice the number with the biggest font size on the screen: 150. Your choices for vCores and data size determine log space allocated, and it’s important enough that Microsoft puts it in bold, but you can’t actually control it directly. If you want more log space, you have to increase data space.
So if it’s important enough to put in bold, we must know the right number for it, right?
Nope. In Azure SQL DB, space consumption is something that’s controlled only by the database server, not by administrators. You can’t fix it by backing up more frequently: Microsoft controls your backups. Microsoft is hinting that, “Yo buddy, you better know how large your transactions are and how far behind your AG secondaries get, and if you get this wrong, inserts/updates/deletes will fail.” They’re supposed to have fixed this with Accelerated Database Recovery, which the documentation says is on in Azure SQL DB and can’t even be turned off – but it’s still an issue, or it wouldn’t be the biggest number font on the screen. Or even worse, if it is fixed, why is it being featured so prominently, suggesting to users that they need to build a bigger server to handle a log file size that doesn’t matter anymore?
It’s 2021, and we’re back to 2001.
We’re still screwing around with micromanaging transaction log sizes, trying not to run out of disk space. On premises, we learned the easy fix long ago: just set log file sizes reasonably, and then leave autogrow on. If one database out of fifty happens to have a runaway transaction, they can grow temporarily. We’ll deal with it.
Scaling CPU? Self-re-configuring? Nope.
Forget about storage – what about sudden demands for CPU or memory? Serverless, self-scaling databases like AWS Aurora Serverless and Azure SQL DB Serverless say they solved it with auto-scaling. That’s not solving the problem – it’s just unlimited provisioning. You wake up to a $100,000 cloud bill when some bot scrapes your site or when an app ships with an N+1 bug.
I know first-hand because our own SQL ConstantCare® product is built atop Aurora. Last year, I blogged about how a single database view cost me $50/day for months, and you know what? I’m in that same exact boat again today. Our database costs gradually creeped up because we deployed more and more queries for new features, and some of the new queries weren’t well-tuned. I had an item in my backlog for weeks to go do more database tuning, and our Aurora costs crept up to over $200/day. I had to go bust open AWS Performance Insights, find the queries involved, tune them, check them into Github, have Richie fix the unit tests, and then deploy it.
Self-configuring databases? Not even close: Microsoft SQL Server 2019 and its cloud equivalents, Azure SQL DB, still ship with the same Cost Threshold for Parallelism setting that every freakin’ setup checklist still tells you to fix manually. Every release, it seems like we get thrown a bone for one or two configuration settings just so the marketing team can say the database is becoming more self-configuring. That’s great, but at this rate, it’s going to be 2070 before setup checklists can go away – if ever, because vendors keep adding features and dials that need extensive tweaking.
Self-patching? Well, kinda, but only in the worst way. In Azure, you don’t get a truthful changelog, and you don’t get to test before they do it to your production server, or when. Just this month, in March of 2021, Azure SQL DB finally announced that you can pick what window they’ll use for maintenance to take your server down, and get alerted before they’re gonna do it. You can choose between:
- Every day, 5PM-8AM local time, or
- Mon-Thurs 10PM-6AM
- Fri-Sun 10PM-6AM
That’s it. Those are your only choices. And that feature is only in preview, and only in select regions. If vendors really cared about database administrators, this stuff never would have gone into production like this.
Somebody in the audience is going to say, “Does Azure Arc help?” Well, it purports to, in the sense that it puts both your on-premises SQL Servers and Azure instances into the same Azure control panel, gives you self-provisioning, and has policy-based management. Given Microsoft’s track record with Policy-Based Management and the Utility Control Point, both of which were grandly-announced features that quietly died off without getting fixes or improvements, I’m gonna wait to see if this is another one-and-done turkey.
Vendors are focused on selling,
and database administrators don’t buy databases.
As important as you think you are, you’ve likely never written a check for database licensing or cloud services.
The reason you have a job is that your employer implemented a database, it grew over time, and it grew beyond the self-managing capabilities touted by the brochure. You were hired long after the database choice was made. And no, you’re not really able to influence future purchasing decisions because that ship has sailed: your company picked a persistence layer, and when it comes time to pick additional persistence layers for new products, you’re only going to be one of many people sitting at the table.
Databases are bought by two roles: developers and executives.
When developers make database decisions, in most cases, they’re not experienced enough with multiple database platforms. They’re either familiar with one, and that’s the one they pick, or they’re not familiar with any, and they’re forced to make a selection based on the vendors’ marketing material, what their friends say, what they read on HN, or some meaningless database rankings site.
When executives make database decisions, they either work from feature lists like EnterpriseReady.io, looking for things like auditing, role-based access, Active Directory integration, or else…they work off the vendor’s brochure, and self-managing is just another feature in a feature list.
Developers and executives buy features.
I don’t mean this in a negative way! It’s the same way we buy appliances or cars or laptops. We have a rough idea of the features we need. Products rarely have all of the features we want – at least at the price point we’re willing to pay – and so we have to make compromises. We might accept a half-baked version of one feature because we want another feature so badly.
In 2021, every database’s brochure has “self-managing” slapped on the brochure somewhere, and every vendor will show just enough demos to hoodwink non-DBAs. “Self-managing” doesn’t have to be fully true in order to sell – just like “secure” or “performant” or “scalable.” Everybody uses these keywords. They don’t mean anything.
Even worse, self-managing is just one of the features people are looking for. Every time a database vendor slaps on another feature, that feature has dials and switches to configure it – and when the feature is brand new, those dials and switches have to be configured by hand. There’s never a best practices doc – the feature has to ship first before the real world learns the best practices, and we all write documentation to tell each other how to set the dials and switches.
Want proof? Check out the home page of Couchbase.com. Look at the kinds of features that they tout. They emphasize the power of the product, empowering you to do all kinds of things – but you still have to actually do those things, like size it, monitor & tune the queries, configure the security, and configure all the settings. It simply isn’t self-managing – and it doesn’t matter, because people love it. They’re going public with a $3 billion valuation. The market has spoken, and features are where the money is.
It’s a race between new features
and self-managing those features.
Sure, vendors can say that a brand-new feature is self-managing – but again, rub your finger on it, read the release notes, run an edge case performance test, and you’ll pretty quickly find the undocumented dials and switches that need to be turned just so in order to make this pig fly.
New features will always win the race.
They have to. Vendors are in their own race against other vendors, racing to ship features that nobody else has, or to keep up with features that someone else has already shipped.
This influenced my own career:
I gave up on production DBA work.
Ten years ago, when Microsoft first brought out Always On Availability Groups, I was ecstatic. I was doing a lot of production database administration work at the time – making sure SQL Servers were highly available and well-protected – and I saw Availability Groups as a better way to achieve that goal.
But then the features versus self-managing race started.
The good news was that Microsoft poured resources into AGs, adding all kinds of features to them. More replicas. Cross-database transaction support. Availability Groups that spanned multiple Windows clusters.
The bad news was that not only were these features not self-managing, they were practically unmanageable: the documentation was terrible, the GUI was non-existent, the damn thing broke constantly in crazy unexpected ways. (For fun details, read the past blog posts on TarynPivots.com, Stack Overflow’s DBA.)
By the time Microsoft said they were bringing Availability Groups to Linux, I said okay, I give, no more, time out. I had come to see SQL Server high availability as a Sisyphean task: I could never really be done with that miserable work, and even when I thought I was done, some new surprise knocked me flat out. Even worse, those surprises always came after hours, on weekends, on holidays, when all I really wanted to do was relax.
If you do want to learn Availability Groups, Edwin Sarmiento is about to open a new rotation of his class. He only does a few of these per year, and it’s a hybrid of consulting, coaching, and training. Learn more here.
I focused on development DBA work instead.
Here’s a quick rundown of the difference between developer, development DBA, and production DBA:
I chose to move from Production DBA to Development DBA. These days, my consulting work and my training classes focus on solving performance emergencies. I can at least schedule those engagements in advance because companies know when their performance is slowly getting worse, and we can work together during weekdays, business hours, in order to turn things around. (As opposed to production outages, which always hit at the worst possible times when I’ve just made a fresh gin & tonic.)
Production DBA work is still there, mind you, and every month I get emails from folks asking me to record new training classes on how to do it. I’m just flat out not interested in that work, and I’m not interested in doing training on it anymore.
So with that in mind, I put our Fundamentals of Database Administration class and Senior DBA Class on YouTube, for free, forever. They’re older classes, and they don’t have the best audio & video quality, but I’d rather just let them out there to the public to help the folks that need ’em. Hope that helps – and I also hope that database vendors start taking the self-managing thing more seriously, for the sake of production DBAs that still have to deal with this mess.