You’ve been nursing your SQL Server through day after day of increasing load. Stored procedures are taking longer and longer to run. There are timeouts on the client side. Both customers and management are getting upset. As you’re bemoaning the terrible code that the developers wrote, it hits you: you don’t have a code problem, you have a scalability problem.
The Types of SQL Server Scalability Problems
It seems obvious to an outsider, but hardware has to be replaced on a regular basis. Between replacements, hardware is like code: it requires attention and maintenance to keep it running smoothly. As a DBA, it’s your job to pay as much attention to the database hardware as you do to the wait stats, slow queries, and missing indexes.
This doesn’t mean you can throw a monitoring package in place and walk away. Understanding how SQL Server and the underlying hardware respond under your application’s workload is important. Once you have a baseline on SQL Server and the hardware, you can easily tell if a problem is bad code or you need to call up your friendly hardware salesperson.
Database design is probably not your bailiwick; you might even say you’re not a database architect, you’re a database administrator. That may be, but you’ve also got the keys to the kingdom of database performance.
Through the DMVs you can tell which indexes have lock escalation problems, which files have slow reads, and even narrow these problems down to specific queries and times of day. Even if you can’t tell Boyce-Codd Normal Form from Backus-Naur Form, you have tools to help you identify problems at a physical level.
You probably want an example – if you add several new queries to the application and suddenly there are a lot more lock escalation attempts on a table, you can safely conclude that at least one of those queries would benefit from a new index.
I lied, sometimes the scalability problem is a code problem.
DBAs love to blame those pesky developers for causing problems. There’s some truth in that statement – developers introduce changes into an otherwise stable system. But when developers are adding new features and functionality, they can’t always know which indexes will be the best; after all, you didn’t have the resources to provide them with a full copy of the database, right?
Thankfully, you can track down top resource consuming queries very easily. Once you’ve identified those queries, you can either tune them or work with the developers to educate them about what they can do better the next time. All too often, time is the only issue that prevents things from being done correctly – developers are pressed for time to get a feature out, so they don’t spend as much time tuning code as they’d like. Help them out – find the bad queries and share the knowledge or just tune them yourself.
Outside of bad T-SQL, there are a number of patterns that can cause problems for database performance. So bad, in fact, that they’re going in a new section.
If you’ve thought that something didn’t smell quite right in the database, you were probably right. There are a number of ways to use SQL Server that work… for a while. I like to call these scalability anti-patterns. These anti-patterns work well as long as you can keep throwing more physical resources at the problem.
Anti-patterns can be difficult to identify as an anti-pattern – is it just bad code or are you seeing something worse? Over at ACM Queue there’s a guide 20 Obstacles to Scalability that lists patterns that will prevent you from moving forward. The most common patterns you’ll find are a lack of caching, serial processing, using the database as a queue, and full-text search.
Fixing anti-patterns will take a long time. These are architectural decisions that are baked into the entire application, not just the database. In order to fix these, you’re going to need to work with the developers, research new tools, and figure out how to implement the features
Solving the Problems
Easy Mode: Hardware
Let’s be honest, using money to solve your scalability problems isn’t a solution. Using money to solve scalability problems can often be a quick fix. If storage is slow, you can embark on a query tuning project or buy additional memory (or even SSDs). Although you’ll eventually need to tune those queries, you can buy yourself some breathing room by adding new hardware.
Don’t let new hardware lull you into a false sense of security. If you buy hardware to temporarily solve a problem, make sure you also record the performance problems and get time scheduled in the future to put a fix in place.
Medium Difficulty: Tuning
When you finally decide to tune your application, identify your biggest bottlenecks (if you don’t know, we can teach you how) to make sure that you’re tuning in the right place. You can’t tune T-SQL to make the network faster. Once you know what you’re trying to fix, identify the queries cause the most pain. Is memory a problem? Look for queries with large memory grants and tune them. Think about adding indexes.
Database performance tuning is as much about improving code as it is about making the database do less. Better indexes mean you have to sort less, but so does sorting in the application layer. Maybe you can pre-compute complex queries – even aggregations may be slightly out of date, there may not be a real-time requirement.
The Hard Stuff: Redesign
When hardware and tuning aren’t enough, it’s time to roll up your sleeves and redesign parts of the application. You’ll still need to identify bottlenecks, but you aren’t going to be tuning code.
Got a queue in the database? Figure out how you can move that queue based processing outside of SQL Server or process the queue in large batches rather than line by line.
Reporting against live data? Find ways to report against readable secondary servers or even report against a time delayed copy of the data in a data warehouse.
There are ways to redesign any feature for scalability. Tools and techniques have been documented in books, articles, blog posts, and conference talks. The question is no longer “How could you possibly do that?” The question is “Which technique are we going to use?”
What Should You Do?
For most of you, the answer is simple: find your bottleneck and decide if you can make a simple fix or if you need to devote more time to the problem. If you need more time, buy a bit of hardware to help you through. Don’t jump on the re-write bandwagon too early, but always keep ideas in the back of your head. And, above all else, make sure you’re solving the right problem with the right solution. Adding more processors won’t help if every query is wrapped in a
Picking a database doesn’t have to be as one sided as “Use SQL Server”. Although SQL Server is a great general purpose database, it is a general purpose database. If there are other properties to your database, you might need to start looking at different database platforms to meet your needs.
Don’t believe me? Take a look at the video from “Nobody Got Fired for Buying SQL Server”. In it I cover a list of questions that you can use to help figure out the best database for getting the job done. You might wind up at SQL Server a lot, depending on your needs. That’s not a bad thing. Heck, it’s a good thing because SQL Server is solid.
Let’s face it, AlwaysOn Availability Groups solve the scale out read problem, but they don’t make it possible to do scale out writes in SQL Server. Whether you have a lot of data, need a lot of memory, or just want to push writes faster than a single server can handle – there’s no clear cut SQL Server solution right now.
Popular scale out databases:
Maybe you don’t need the rich semantics of the relational model, but you need something else. Document databases are designed around the idea of a document as a loose collection of properties formatted in a standardized way (XML, JSON, YAML). The idea behind a document database is that you pull back an entire document all at once. Document databases have query languages and indexes, both of which make it possible to provide rich query semantics while moving away from the relational model.
Popular document databases:
Graph databases are designed to track the relationships between objects. If most important part of your application is tracking asymmetrical relationships, then a graph database may be what you need. What’s an asymmetrical relationship? When it’s important to know that Brent and I are friends on Twitter and Brent and I both follow Wolf Blitzer, who only follows me, that’s something you’d want to track in a graph database.
Popular graph databases:
Just as querying functionality defines database choice, so does failure scenario. SQL Server responds well to a host of failure scenarios, depending on your SLAs, but it might not be everything that you’re looking for.
Make sure that you thoroughly understand the possible failure scenarios for your application. Understanding the requirements of how your application responds to failure will make your database choices easier. An application that only needs to survive single server failure within one datacenter has very different requirements than an application that has to survive entire datacenter failure.
Picking a database isn’t an easy choice. There are a lot of factors to consider from administration, development, feature set, and failure conditions. You can’t just pick something because you read about it on Hacker News or heard about it in a conference. (Well, you could, but you’ll probably be in a lot of pain.) Making the right database decision means that you won’t be scrambling to solve a lot of difficult problems long after you’ve made some questionable starting choices.
Download the database decision diagram or sound off in the comments if you’ve got more questions.
Five years from now, in March 2018, what will be different about databases? I’m a Microsoft SQL Server guy, so keep that in mind when reading my thoughts.
Looking Back at 2008-2013
The big story here for me was virtualization. In mid-2008, Microsoft added support for SQL Server running under Hyper-V, and later they added support for VMware as well. Today, almost all of my clients have at least one virtual SQL Server. They typically start with development/QA/test instances, then disaster recovery, then production.
For the remaining physical servers, cheaper memory and solid state storage meant you could get insane performance out of commodity hardware. Microsoft got caught without a good scale-out story, but cheap scale-up prices meant it didn’t matter. I don’t know whether this was luck or strategy, but good on ‘em either way. In 2008, I’d never would have believed that 2013 Brent would have a dozen clients with 2TB of memory per server. Over the last couple of years, I’ve had a few companies think they needed to switch database platforms in order to get the performance they need, but the real answer has always been quick (and relatively cheap) changes to code, configuration, and hardware.
Management tools didn’t really change at all, and that’s a story in and of itself. Microsoft made several attempts to change how DBAs and developers interact with databases – Visual Studio Data Dude, DACPACs, Utility Control Points. When SQL 2008R2 came out, I wrote about why these features wouldn’t really have an impact, and today, I’m not surprised that they’re just not getting any serious adoption. Microsoft threw in the towel on Data Dude and replaced it with SQL Server Data Tools, but didn’t include all of the functionality. I don’t see a lot of developer confidence in Microsoft’s short attention span here, so tooling hasn’t been a big story. (We did get an all-new SQL Server Management Studio under the hood, but Microsoft went to great pains to ensure it looked/worked basically the same as the old one, so…yeah.)
Business Intelligence (BI) got a lot of headlines, but here we’ve got another hyper-distraction story. Microsoft threw so many different tools against the wall that the naming even became a joke – does PowerPivotPointPro use the XVelocitySuperMart in v3.14? I don’t envy the BI pros who have to keep up with this jumbled mess of licenses, features, and names, but I do think Microsoft is heading in the right direction. The combination of Excel, SharePoint, columnar storage, and hella fast laptops means Microsoft is in a good spot to give insight to managers. It just wasn’t a huge revolution in 2008-2013 because the stories and products kept changing.
Looking Forward at 2013-2018
When our servers were physical, they had a built-in expiration date. The hardware support would grind to an end, and we’d be under pressure to migrate them onto more reliable hardware. We often included a SQL Server version upgrade in that same project.
Those days are over. The combination of virtualization and SQL 2005/2008 will leave an interesting legacy challenge for DBAs. Once your SQL Server is virtualized, it’s really easy to get it off old hardware – just VMotion or LiveMigrate it to another host. You can do it even while it’s still powered on. Does that old version need some more horsepower? Shut it down, add a couple of virtual CPUs and more memory, and power it back on. What used to be a big ugly maintenance project is now a matter of just a reboot.
This means you’ll be supporting SQL Server 2005 and 2008 forever.
SQL Server 2000 has thankfully (mostly) already been exterminated from serious production work. Its lack of management tools and mainstream support means it’s painful to troubleshoot, so most of us have already migrated production work to 2005 and 2008. Support for those newer versions doesn’t end for years, so settle in and get comfy. Sure, SQL Server 2005 and 2008 have bugs, and they’re missing cool features like backup compression in Standard Edition, but for the most part, they just work. Businesses will stick with ‘em for most applications because they don’t see enough compelling features in 2012.
In SQL Server 2012 and beyond, we’ve got:
- AlwaysOn Availability Groups – high availability, disaster recovery, and scale-out reads
- Hekaton - in-memory storage with optimized stored procedures and new data formats on disk
- Column store indexes – faster data retrieval for certain kinds of queries
Call me maybe crazy, but I don’t see really widespread adoption for any of these. To do them right, we’ve gotta make changes to application code. The changes won’t pay off for the majority of customers, so it’s risk without much reward. Don’t get me wrong – when you need this kind of speed, then you need it, and the features are fantastic. I do see widespread adoption coming in 2013-2018 for AlwaysOn, but only for high availability and disaster recovery, not the scale-out reads part.
The virtualization/SQL2005-is-good-enough combination also means we’re not going to see massive, widespread migrations from on-premise SQL Servers to cloud services like SQL Azure. (We’re also not going to see people call it by its official product name, Microsoft Windows Azure SQL Database.) Your app would require code changes to make that switch, and code changes are risky.
New development, on the other hand, means you can pick the features and compatibility you want. In those environments…I still don’t see a lot of widespread SQL Azure adoption coming. If I’m a developer building a new app from the ground up, I’m going to pick the cheapest, easiest database possible. These days, that’s probably PostgreSQL. Like SQL Server, it’s a platform that is available in a lot of different delivery mechanisms – on-premise on bare metal, on-premise virtualized, colo boxes, cloud-based services, etc. It’s open source, and it has all kinds of cool features we don’t get in SQL Server. I’m not changing horses mid-stream, but if I was starting again as a developer, I’d go that route.
What Do You See Changing?
When you look at your company’s needs, the job you do, and the tools you use, what do you see coming down the pike? In 2018, what will be the defining thing that changed your work?
Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.
Sequential Beginnings: Identity and IDENTITY
In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)
You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.
Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.
Distributing Identity: The Case for GUIDs
Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?
As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.
This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.
A Better Solution for Distributed Identity
Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).
Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.
Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.
The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.
If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET
decimal which lines up with SQL Server’s
DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.
Wrapping it Up
There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.
People bring me in when they’re having data problems. They can’t store data fast enough, they can’t make it reliable enough, they can’t hire people to manage it, etc. When I’m in the conference room, it’s because there’s a fire in the
You know how it is as a DBA, though – DBA means Default Blame Acceptor. Everybody thinks the fire started in the database, but often it’s the SAN, VMware, crappy code, bad third party apps, or any number of combustible materials. The company gets more and more concerned about the growing smoke, and they ask the DBAs, “Who’s a SQL Server expert you can call to put this fire out?” The DBA thinks about my crazy blog posts and blurts my name out – mostly because he wants to find out if I’ll show up in the Richard Simmons costume. (That costs extra.)
Now put yourself in my shoes: I show up in a conference room or on a WebEx, and there’s a huge problem somewhere in the infrastructure. Everybody involved is pointing fingers at each other, and they’re all armed with volumes of reports proving that it’s not their problem. In a matter of 3-4 days, I need to:
- Find the real root cause of the problem
- Prove it to everyone involved using their own language
- Show a few possible solutions and recommend the right one
- Convince them to implement it as quickly as possible
SQL Server isn’t the only fire extinguisher, and I have to know how to put out data fires with other tools. Amazon’s got a ridiculous list of services that are easy to get started with, including:
- Relational Database Service – point, click, and deploy Microsoft SQL Server, Oracle, and MySQL instances. Amazon manages the backups, patching, and security. The MySQL ones even support readable replicas and replication to multiple datacenters.
- DynamoDB – super-fast NoSQL database hosted on SSDs. You pick how fast you want it to go, and Amazon makes it happen.
- Glacier – store your backups in the cloud for $.01 per gigabyte per month with no cost for incoming data.
- Import/Export – ship them a USB drive, and they’ll hook it up to the cloud. For folks with slow upload links, this is the fastest way to move your data online.
That’s why I’m in Dallas, Texas for a few days attending Amazon Web Services Architect Training. It’s a three-day design session that covers how to design solutions with their services. It’s not going to make me a Certified Master of Cloudiness across their broad range of tools, but that’s not the point. Clients don’t usually want me to do the work myself: they want me to find the right answer fast, get the staff on the right page, and let the staff knock out the work together.
If you’re a data professional, and you’re frustrated when people keep saying it’s a database problem when it’s not, what are you doing to bridge the gap? Are you frustrated that The Other Guy doesn’t know anything about SQL Server? Or are you reaching out to learn The Other Guy’s technology to help him to see where the smoke is coming from?