Databases Five Years from Today
Five years from now, in March 2018, what will be different about databases? I’m a Microsoft SQL Server guy, so keep that in mind when reading my thoughts.
Looking Back at 2008-2013
The big story here for me was virtualization. In mid-2008, Microsoft added support for SQL Server running under Hyper-V, and later they added support for VMware as well. Today, almost all of my clients have at least one virtual SQL Server. They typically start with development/QA/test instances, then disaster recovery, then production.
For the remaining physical servers, cheaper memory and solid state storage meant you could get insane performance out of commodity hardware. Microsoft got caught without a good scale-out story, but cheap scale-up prices meant it didn’t matter. I don’t know whether this was luck or strategy, but good on ‘em either way. In 2008, I’d never would have believed that 2013 Brent would have a dozen clients with 2TB of memory per server. Over the last couple of years, I’ve had a few companies think they needed to switch database platforms in order to get the performance they need, but the real answer has always been quick (and relatively cheap) changes to code, configuration, and hardware.
Management tools didn’t really change at all, and that’s a story in and of itself. Microsoft made several attempts to change how DBAs and developers interact with databases – Visual Studio Data Dude, DACPACs, Utility Control Points. When SQL 2008R2 came out, I wrote about why these features wouldn’t really have an impact, and today, I’m not surprised that they’re just not getting any serious adoption. Microsoft threw in the towel on Data Dude and replaced it with SQL Server Data Tools, but didn’t include all of the functionality. I don’t see a lot of developer confidence in Microsoft’s short attention span here, so tooling hasn’t been a big story. (We did get an all-new SQL Server Management Studio under the hood, but Microsoft went to great pains to ensure it looked/worked basically the same as the old one, so…yeah.)
Business Intelligence (BI) got a lot of headlines, but here we’ve got another hyper-distraction story. Microsoft threw so many different tools against the wall that the naming even became a joke – does PowerPivotPointPro use the XVelocitySuperMart in v3.14? I don’t envy the BI pros who have to keep up with this jumbled mess of licenses, features, and names, but I do think Microsoft is heading in the right direction. The combination of Excel, SharePoint, columnar storage, and hella fast laptops means Microsoft is in a good spot to give insight to managers. It just wasn’t a huge revolution in 2008-2013 because the stories and products kept changing.
Looking Forward at 2013-2018
When our servers were physical, they had a built-in expiration date. The hardware support would grind to an end, and we’d be under pressure to migrate them onto more reliable hardware. We often included a SQL Server version upgrade in that same project.
Those days are over. The combination of virtualization and SQL 2005/2008 will leave an interesting legacy challenge for DBAs. Once your SQL Server is virtualized, it’s really easy to get it off old hardware – just VMotion or LiveMigrate it to another host. You can do it even while it’s still powered on. Does that old version need some more horsepower? Shut it down, add a couple of virtual CPUs and more memory, and power it back on. What used to be a big ugly maintenance project is now a matter of just a reboot.
This means you’ll be supporting SQL Server 2005 and 2008 forever.
SQL Server 2000 has thankfully (mostly) already been exterminated from serious production work. Its lack of management tools and mainstream support means it’s painful to troubleshoot, so most of us have already migrated production work to 2005 and 2008. Support for those newer versions doesn’t end for years, so settle in and get comfy. Sure, SQL Server 2005 and 2008 have bugs, and they’re missing cool features like backup compression in Standard Edition, but for the most part, they just work. Businesses will stick with ‘em for most applications because they don’t see enough compelling features in 2012.
In SQL Server 2012 and beyond, we’ve got:
- AlwaysOn Availability Groups – high availability, disaster recovery, and scale-out reads
- Hekaton - in-memory storage with optimized stored procedures and new data formats on disk
- Column store indexes – faster data retrieval for certain kinds of queries
Call me maybe crazy, but I don’t see really widespread adoption for any of these. To do them right, we’ve gotta make changes to application code. The changes won’t pay off for the majority of customers, so it’s risk without much reward. Don’t get me wrong – when you need this kind of speed, then you need it, and the features are fantastic. I do see widespread adoption coming in 2013-2018 for AlwaysOn, but only for high availability and disaster recovery, not the scale-out reads part.
The virtualization/SQL2005-is-good-enough combination also means we’re not going to see massive, widespread migrations from on-premise SQL Servers to cloud services like SQL Azure. (We’re also not going to see people call it by its official product name, Microsoft Windows Azure SQL Database.) Your app would require code changes to make that switch, and code changes are risky.
New development, on the other hand, means you can pick the features and compatibility you want. In those environments…I still don’t see a lot of widespread SQL Azure adoption coming. If I’m a developer building a new app from the ground up, I’m going to pick the cheapest, easiest database possible. These days, that’s probably PostgreSQL. Like SQL Server, it’s a platform that is available in a lot of different delivery mechanisms – on-premise on bare metal, on-premise virtualized, colo boxes, cloud-based services, etc. It’s open source, and it has all kinds of cool features we don’t get in SQL Server. I’m not changing horses mid-stream, but if I was starting again as a developer, I’d go that route.
What Do You See Changing?
When you look at your company’s needs, the job you do, and the tools you use, what do you see coming down the pike? In 2018, what will be the defining thing that changed your work?
The Trouble with Keys
Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.
Sequential Beginnings: Identity and IDENTITY
In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)
You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.
Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.
Distributing Identity: The Case for GUIDs
Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?
As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.
This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.
A Better Solution for Distributed Identity
Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).
Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.
Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.
The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.
If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET decimal which lines up with SQL Server’s DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.
Wrapping it Up
There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.
Why I’m at Amazon Architect Training This Week
People bring me in when they’re having data problems. They can’t store data fast enough, they can’t make it reliable enough, they can’t hire people to manage it, etc. When I’m in the conference room, it’s because there’s a fire in the disco datacenter.
You know how it is as a DBA, though – DBA means Default Blame Acceptor. Everybody thinks the fire started in the database, but often it’s the SAN, VMware, crappy code, bad third party apps, or any number of combustible materials. The company gets more and more concerned about the growing smoke, and they ask the DBAs, “Who’s a SQL Server expert you can call to put this fire out?” The DBA thinks about my crazy blog posts and blurts my name out – mostly because he wants to find out if I’ll show up in the Richard Simmons costume. (That costs extra.)
Now put yourself in my shoes: I show up in a conference room or on a WebEx, and there’s a huge problem somewhere in the infrastructure. Everybody involved is pointing fingers at each other, and they’re all armed with volumes of reports proving that it’s not their problem. In a matter of 3-4 days, I need to:
- Find the real root cause of the problem
- Prove it to everyone involved using their own language
- Show a few possible solutions and recommend the right one
- Convince them to implement it as quickly as possible
SQL Server isn’t the only fire extinguisher, and I have to know how to put out data fires with other tools. Amazon’s got a ridiculous list of services that are easy to get started with, including:
- Relational Database Service – point, click, and deploy Microsoft SQL Server, Oracle, and MySQL instances. Amazon manages the backups, patching, and security. The MySQL ones even support readable replicas and replication to multiple datacenters.
- DynamoDB – super-fast NoSQL database hosted on SSDs. You pick how fast you want it to go, and Amazon makes it happen.
- Glacier – store your backups in the cloud for $.01 per gigabyte per month with no cost for incoming data.
- Import/Export – ship them a USB drive, and they’ll hook it up to the cloud. For folks with slow upload links, this is the fastest way to move your data online.
That’s why I’m in Dallas, Texas for a few days attending Amazon Web Services Architect Training. It’s a three-day design session that covers how to design solutions with their services. It’s not going to make me a Certified Master of Cloudiness across their broad range of tools, but that’s not the point. Clients don’t usually want me to do the work myself: they want me to find the right answer fast, get the staff on the right page, and let the staff knock out the work together.
If you’re a data professional, and you’re frustrated when people keep saying it’s a database problem when it’s not, what are you doing to bridge the gap? Are you frustrated that The Other Guy doesn’t know anything about SQL Server? Or are you reaching out to learn The Other Guy’s technology to help him to see where the smoke is coming from?
