Solving SQL Server Scalability Problems

You’ve been nursing your SQL Server through day after day of increasing load. Stored procedures are taking longer and longer to run. There are timeouts on the client side. Both customers and management are getting upset. As you’re bemoaning the terrible code that the developers wrote, it hits you: you don’t have a code problem, you have a scalability problem.

A slight scale problem
A slight scale problem

The Types of SQL Server Scalability Problems

Hardware

It seems obvious to an outsider, but hardware has to be replaced on a regular basis. Between replacements, hardware is like code: it requires attention and maintenance to keep it running smoothly. As a DBA, it’s your job to pay as much attention to the database hardware as you do to the wait stats, slow queries, and missing indexes.

This doesn’t mean you can throw a monitoring package in place and walk away. Understanding how SQL Server and the underlying hardware respond under your application’s workload is important. Once you have a baseline on SQL Server and the hardware, you can easily tell if a problem is bad code or you need to call up your friendly hardware salesperson.

Design

Database design is probably not your bailiwick; you might even say you’re not a database architect, you’re a database administrator. That may be, but you’ve also got the keys to the kingdom of database performance.

Through the DMVs you can tell which indexes have lock escalation problems, which files have slow reads, and even narrow these problems down to specific queries and times of day. Even if you can’t tell Boyce-Codd Normal Form from Backus-Naur Form, you have tools to help you identify problems at a physical level.

You probably want an example – if you add several new queries to the application and suddenly there are a lot more lock escalation attempts on a table, you can safely conclude that at least one of those queries would benefit from a new index.

Code

I lied, sometimes the scalability problem is a code problem.

DBAs love to blame those pesky developers for causing problems. There’s some truth in that statement – developers introduce changes into an otherwise stable system. But when developers are adding new features and functionality, they can’t always know which indexes will be the best; after all, you didn’t have the resources to provide them with a full copy of the database, right?

Thankfully, you can track down top resource consuming queries very easily. Once you’ve identified those queries, you can either tune them or work with the developers to educate them about what they can do better the next time. All too often, time is the only issue that prevents things from being done correctly – developers are pressed for time to get a feature out, so they don’t spend as much time tuning code as they’d like. Help them out – find the bad queries and share the knowledge or just tune them yourself.

Outside of bad T-SQL, there are a number of patterns that can cause problems for database performance. So bad, in fact, that they’re going in a new section.

No matter what you call it, that doesn't work.
No matter what you call it, that doesn’t work.

(Anti)Patterns

If you’ve thought that something didn’t smell quite right in the database, you were probably right. There are a number of ways to use SQL Server that work… for a while. I like to call these scalability anti-patterns. These anti-patterns work well as long as you can keep throwing more physical resources at the problem.

Anti-patterns can be difficult to identify as an anti-pattern – is it just bad code or are you seeing something worse? Over at ACM Queue there’s a guide 20 Obstacles to Scalability that lists patterns that will prevent you from moving forward. The most common patterns you’ll find are a lack of caching, serial processing, using the database as a queue, and full-text search.

Fixing anti-patterns will take a long time. These are architectural decisions that are baked into the entire application, not just the database. In order to fix these, you’re going to need to work with the developers, research new tools, and figure out how to implement the features

Solving the Problems

Easy Mode: Hardware

Let’s be honest, using money to solve your scalability problems isn’t a solution. Using money to solve scalability problems can often be a quick fix. If storage is slow, you can embark on a query tuning project or buy additional memory (or even SSDs). Although you’ll eventually need to tune those queries, you can buy yourself some breathing room by adding new hardware.

Don’t let new hardware lull you into a false sense of security. If you buy hardware to temporarily solve a problem, make sure you also record the performance problems and get time scheduled in the future to put a fix in place.

Medium Difficulty: Tuning

When you finally decide to tune your application, identify your biggest bottlenecks (if you don’t know, we can teach you how) to make sure that you’re tuning in the right place. You can’t tune T-SQL to make the network faster. Once you know what you’re trying to fix, identify the queries cause the most pain. Is memory a problem? Look for queries with large memory grants and tune them. Think about adding indexes.

Database performance tuning is as much about improving code as it is about making the database do less. Better indexes mean you have to sort less, but so does sorting in the application layer. Maybe you can pre-compute complex queries – even aggregations may be slightly out of date, there may not be a real-time requirement.

The Hard Stuff: Redesign

When hardware and tuning aren’t enough, it’s time to roll up your sleeves and redesign parts of the application. You’ll still need to identify bottlenecks, but you aren’t going to be tuning code.

Got a queue in the database? Figure out how you can move that queue based processing outside of SQL Server or process the queue in large batches rather than line by line.

Reporting against live data? Find ways to report against readable secondary servers or even report against a time delayed copy of the data in a data warehouse.

There are ways to redesign any feature for scalability. Tools and techniques have been documented in books, articles, blog posts, and conference talks. The question is no longer “How could you possibly do that?” The question is “Which technique are we going to use?”

What Should You Do?

For most of you, the answer is simple: find your bottleneck and decide if you can make a simple fix or if you need to devote more time to the problem. If you need more time, buy a bit of hardware to help you through. Don’t jump on the re-write bandwagon too early, but always keep ideas in the back of your head. And, above all else, make sure you’re solving the right problem with the right solution. Adding more processors won’t help if every query is wrapped in a SERIALIZABLE transaction.

Previous Post
Making Tables Look Good
Next Post
We’re Restarting the Chicago .NET User Group

2 Comments. Leave new

  • Being a DBA is synonymous with being a firefighter. We’re typically outnumbered by a 10:1 ratio (developers:DBA’s). Developers want to hit releases and milestones, and often forget about coordinating with the DBA as new code is pushed to production. It’s the nature of the beast. In fact I just read this the other day: http://www.confio.com/logicalread/devops-essentials-for-dba-developer-collaboration.

    At my current position, in my production servers I’ve gone from four to eight cores, doubled the RAM, enabled hyperthreading, and put transactional replication in place to offload read overhead. That was the easy way to address performance issues (or cover up less than optimal procs, database design, etc.), but the first place I started was basic performance tuning. Create jobs to harvest and analyze performance metrics (DMVs) so they persist over time. Trace for low-hanging fruit (high Reads, high CPU, long Duration, etc.). It was important to ask the developers (not database developers, coders) to leave performance tuning to me – it’s much better use of everybody’s time. Communication is huge between the developers and DBA(s) – good communication can really cut down on performance issues because it allows us to avoid performance issues proactively rather than reactively. The last thing I do is, as you alluded to, point out the Top X poor performing procs/statements. Work with the developers to find alternative means/methods to get the data they need.

    DBA’s are not only firemen, but what we do is largely an art and we have to understand all the tools we have at our disposal. If we do our job, nobody knows us 😉

    Great topic Jeremiah! I think this article sums up the life cycle of a DBA.

    Reply
  • Hardware is the easiest way to fix issues and is changing unlike I have seen in the last couple decades. RAID cards have lost the war and will never return due to the latency they add to data access. Medium/Large SANs are very expensive and are being replaced by VMware’s vSAN and MS Storage Spaces. vSAN and Storage Spaces will then be replaces by PCIe networks of NVMe drives which have been designed for massive queue depths and parallel access. It is an exciting time for storage and I think AWS and Azure are driving demand for new technologies to lower costs for their data centers as their CPUs are mostly idle waiting on slow spinning disks.

    I’m running Storage Spaces on Windows Server 2016 with mirrored Intel 750 1.2TB drives and six very slow 7200rpm 3TB drives but I have 2x the memory that’s needed. This brings the database entirely into memory and minimizing disk access to the 750s which are not true enterprise drives but will get me by till they drop in price and increase in speed. This setup runs at 500GB/s on demand with random writes and I’ve seen it hit 1.5GB/s with sequential writes on a 6 year old server.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.