This month, to mark the 20th anniversary of BrentOzar.com, I’m stepping back and looking at the big picture rather than blogging about database problems & solutions like I usually do.
I’m going to generalize a little by decade here to keep things simple.
In 1990-2000, databases were mostly tied to specific development languages. When you developed an application with a Microsoft Access back end, you were very likely using Microsoft Access to store the data, too. If you wrote a web page in Microsoft FrontPage, you were likely using Access or Microsoft SQL Server as the back end.
During 1990-2000, one of our biggest database administration problems was server sprawl. Every time we had a new app, we just spun up a new with a new database. Most of those servers had fairly low load in terms of performance, so as licensing costs went up, we went through consolidation projects to merge multiple databases onto less servers to cut costs. (This continued in the 2001-2010 time frame, especially with the advent of server virtualization.)
In 2001-2010, object-relational-mapping tools (ORMs) tried to abstract the database away. In theory, developers worked with objects, and the ORM layer mapped things to the physical database storage. In theory, some ORMS were cross-platform, saying that you could switch your database back end without rewriting your code. In practice, that didn’t work very well. In the 2000s, our app code language choice still pretty much dictated the most popular database we would use for that particular language. C#? SQL Server. Java? Oracle.
During 2001-2010, Microsoft SQL Server rode the wave of .NET popularity, being the easy choice for C# developers. Microsoft’s strategy to grow SQL Server appeared to be making SQL Server a good fit for other purposes, too. It was hard to grow revenue by convincing PHP developers to switch from MySQL to Microsoft SQL Server – the cost difference just doesn’t make sense.
However, it was much easier to grow revenue by convincing ETL developers to switch from Informatica to SQL Server Integration Services. SSIS was free in the box with SQL Server, so if you were using Informatica to push data into SQL Server…why waste your licensing money on Informatica? Same thing with SQL Server Analysis Services, SQL Server Reporting Services, Filestream/Filetable to use SQL Server as a file server, XML in the database, CLR code to use SQL Server as an application server, and so on.
In 2011-2020, 3 things changed how we pick databases.
In no particular order, all of them being independent:
- Developers got tired of SQL’s rigid requirements and started building open source NoSQL persistence layers (MongoDB and Redis are great examples) to solve specific challenges at a very low cost.
- Cloud providers made it really easy to spin up a new production-quality persistence layer, like one you might not have otherwise had the skill or spare hardware to spin up, and only pay for what you use (in theory.)
- Hosted software-as-a-Service (SaaS) apps meant that often business folks didn’t even have to pick a persistence layer at all. Software vendors that used to have to support their apps on-premises on a variety of databases now just took that decision (and the support) away altogether. Instead of having to pick a database their enterprise customers would be comfortable with, they could use whatever they wanted.
As these 3 things continue to spread – and I use that term in the current tense, because they’re still happening – Microsoft SQL Server is less and less the obvious de-facto choice for C# developers. If you’re building a C# app in the cloud, it might make more sense to use:
- DynamoDB if you just need key/value storage, not a full blown relational database
- Redis if you just need a cache
- Azure SQL DB if you just need a little code in Azure Functions, and you need relations
- PostgreSQL if you want relational at a low cost without being tied into a cloud vendor
- Just a plain ol’ file share if you want to save XML or JSON data for later analysis with a variety of tools
- And so on.
Plus, those 3 things above lowered the barrier to using multiple persistence layers, each for their own specialized purpose. There were so many persistence layers available, at such a seemingly low cost, without having to worry about on-premises support for each of ’em – that developers became more emboldened to use multiple persistence layers even inside a single application.
The architect in me loves this because we get to choose the perfect tool for each task at hand.
The business owner in me is horrified because we’re heading for something completely untenable.
2021-2030: Welcome to Sprawl 2.0.
Data lives everywhere, in redundant copies, none of which speak to each other, and all of which cost us by the byte, by the month.
Performance sucks everywhere, and every query costs us by the byte of data read.
None of these persistence layers (SQL Server included) have good cost management built in for size or performance. Most databases are still in their infancy of telling you which queries are burning the most resources, let alone telling you which queries are costing you the most money.
I got a sneak peek when I was talking to the data warehouse manager at one of my clients. He showed me their architecture diagram for their next-generation data warehouse: Azure Data Factory pulling files from sources, loading them into various databases, dumping flat files into a data lake, queried from Spark and Microsoft Big Data Clusters. He explained which parts were pay-as-you-go services, which parts were VMs, and which parts were in Kubernetes.
Then he said, “Yesterday, my costs quadrupled. The only reason we even found out so quickly is because the accounting team had an alert set up. My team has been working on it for four hours, and they’ve narrowed it down to query workloads (not storage or ADF) – but we’re realizing we don’t even have the skills to answer the question of what queries cost us the most money yesterday, or if they’re new queries, or new users, or changed database structures, or what.”
“We’re just letting people run queries and light money on fire.”
New database servers could solve this, but won’t.
When we solved Sprawl 1.0, we were at least only dealing with a handful of database platforms. We could gather all the SQL Server applications together onto less servers.
With Sprawl 2.0, we have:
- Countless database platforms hosting…
- Countless apps built with different languages and connection libraries
- Living in a mix of different cloud vendors and on-premises
In theory, vendors could:
- Build a single database that masquerades as multiple databases
- Let you connect lots of existing apps down to just one platform
- Build good cost management, making it compelling to executives
- Maybe even build in cost reduction with automated query tuning, making it even more compelling to execs
Azure Cosmos DB is the forerunner here, I think: one database back end, but lots of APIs so that it could act as any number of different NoSQL or SQL servers. In theory – we’re talking 2021-2030 here – Microsoft could gradually enhance it so that you could take applications written for different back ends, and then consolidate them so everything lives in Cosmos DB.
But at the same time, Azure Cosmos DB is yet another new server that we have to manage, calling to mind the classic XKCD cartoon:
Similarly, but way less ambitious, Amazon’s working on Babelfish for Aurora PostgreSQL. The idea is that you’ll put both your Postgres and SQL Server databases in the same place, and connect your applications to it. It’ll service T-SQL queries out of PostgreSQL.
Look, I said it was less ambitious – I didn’t say it wasn’t ambitious.
Babelfish has a hell of an uphill battle because just SQL Server alone has an incredible amount of features, data types, and edge case gotchas. Switching your SQL Server application over to Babelfish on Aurora is going to call to mind the ORM lessons of 2001-2010. Existing applications are complex enough that in most cases, it’s just too hard to forklift their data out of one database, move it over to another, and expect everything to work. With both Cosmos DB and Babelfish, there will be case studies funded by vendors that will gloss over the complexity, but…these migrations will be few and far between.
(Just because someone in the audience will ask – no, Azure Arc doesn’t solve this problem of database sprawl. If anything, it enables sprawl because it encourages self-provisioning of databases & servers & services, both in the cloud and on-premises.)
At the same time that vendors are trying to build one-size-fits-all databases, they’re going to be in a race against sprawl. More new specialized databases will pop up. Existing databases will get new capabilities. The centralized, sprawl-reducing databases won’t be able to keep up.
Sprawl 2.0 is the new normal.
As a database administrator, your job isn’t just to manage the configuration, health, and performance of the database servers you have today.
In 2021-2030, your job is to:
- Watch the options for persistence layers out there, reading developer news sites to see what new options are popping up, where they’re succeeding, and where they’re falling short
- Help your team evaluate new persistence layers by discussing their advantages and disadvantages compared to the persistence layers you’re already using
- Encourage the use of existing persistence layers rather than picking new ones – because every new one that comes in, you’re gonna have to learn to manage it, and you’re not exactly sitting around bored, and the company’s not hiring help for you, either, because the sales pitch on all these new persistence layers is that they’re all self-managing – and you know how that goes
You don’t have to develop a level of expertise on all of the options out there – it’s just too hard, and the new players change too quickly. For example, it’s a waste of your time to try to master a new platform like Cosmos DB if your company isn’t actually using it yet. However, you wanna learn about where it’s a good fit and what its weaknesses are – so that when your developers start talking about using it, you can understand its strengths and weaknesses relative to the ones you’ve already got.
As a blogger, I hate telling you to do something without giving you references to actually do it. In this example, I should be a good blogger and say, “Now go here to learn about Cosmos DB’s strengths and weaknesses.” The problem is that I just don’t have a good list of resources for every persistence layer out there, and I don’t have the time to go build it, either. I can’t even recommend vendor documentation sites because they’re so biased in favor of the product.
I’d leave you with one piece of advice: when you want to learn more about a database, Google for:
- (database name) release notes – and then read about the known weaknesses or bugs.
- (database name) limitations – and look for features that are important to you.
- (database name) site:news.ycombinator.com – because Hacker News surfaces a lot of interesting pro and con stories. You have to take these with a grain of salt – but the same thing is true for vendor documentation, too.
- If it’s available, read the database’s Jepsen analysis to learn about edge case failures
- And finally, if it’s open source, find its Github repo, and read the list of issues. Sort them by most commented, and see what problems the community has run into. Pay particular attention to how long those issues have been open, and what kind of progress the maintainers are making.
Because it’s really easy to start building a database or an online service.
It’s much harder to do it well.
And if you thought it was hard to manage just one, wait til you see how hard it is to manage a whole bunch of half-ass persistence layers.