Hash Partitioning, SQL Server, and Scaling Writes

At first glance, SQL Server’s partitioning seems like it should be an easy way to solve problems inserting data into busy tables. If the writes are spread across many partitions it only makes sense that we can avoid write hot spots in SQL Server, right? The truth is much more complicated than it appears.

Asking the Right Questions

A reader commented on Kendra’s post How To Decide if You Should Use Table Partitioning asking “what are the plusses and minuses to using Hash partitioning on OLTP tables with large numbers of inserts on a daily basis?” The pros are pretty easy to see – SQL Server will spread the writes out across many partitions which, in turn, allows you to spread the writes out within multiple file groups in each partition. It seems obvious that this will solve problems with allocation map contention under write heavy workloads.

If the only problem that you’re trying to solve is eliminating write contention in a filegroup, you can improve SQL Server performance by adding multiple files to that file group, then rebuild indexes. After the index rebuild, data will be distributed evenly between the different files. Since each file has its own set of allocation map pages, you’ll see less contention across the different allocation map pages. The upside of this approach is that you don’t have to make any changes apart from adding extra data files to the filegroup. SQL Server’s proportional fill algorithm should take care of the rest.

Of course things are never as simple as that, right?

Even if you manage to add enough data files to handle our data file write throughput, the transaction log will still be the biggest bottleneck. You can push SQL Server write performance a long way, especially if you follow the advice in these SQLBits videos from Thomas Kejser, but eventually you will hit the limits of a single transaction log. If the log can’t keep up, the number of data files won’t matter: you won’t be able to write any faster than you can write to your transaction log.

This is why it’s so important to ask “What’s the problem you’re trying to solve?”

Write Throughput & Multi-Tenancy

The original poster asked about the pros and cons of using hash partitioning in an OLTP system to deal with a large number of inserts. Hash partitioning uses some kind of hash function to evenly distribute data between a finite number of buckets. While a hash function will solve the problem of distributing data, hashing won’t make it possible to distribute writes to multiple log files – SQL Server only allows for one log file per database. Sure, you can add more disks behind the log file to make it respond faster but there is only so much you can do. Limitations are imposed on storage throughput simply because you can only fit so many HBAs in a server chassis before you run out of PCI slots and you can only buy so much Infiniband before you run out of budget.

When the transaction log becomes the limiting factor in a database’s write performance, it’s time to split the database into multiple databases. This is where things get difficult: where is the decision made to partition data across databases?

In multi-tenant systems it’s relatively easy to partition data across databases – each client’s data moves to its own database and any global metadata can either live in a central database or can be replicated to each client database. Partitioning the data by client ID is the natural way to split out the data. Most, if not all, queries will be limited to a single client and the application can handle any cross client joins that administrative users need to run.

The upside of this approach is that you can spread your data across multiple servers and handle performance problems on each server as needed. The downside is that using a hash function with a fixed number of servers means that adding another server means repartitioning all of the data and shuffling it between servers. Needless to say, this can take a long time.

Write Throughput & Single Tenancy

Let’s pretend that you don’t have a lot of different clients. You have one big client and you’re streaming a lot of sensor readings from multiple sensors into a single database. Things become more difficult when data is being streamed into a single tenant application – say you’re streaming sensor data into a single application database. Sometimes it’s not so easy to partition this data. There are a few ways to handle the load and none of them are great.

One of the most obvious ways to handle massive write concurrency in a single tenant system is to find the a level of the application where writes can be partitioned – in this example write partitioning could happen at the sensor level, sensors could be grouped, or the hash might be found by converting the current time to seconds and reversing the digits to bring more randomness the hash function output. This will help spread the load across multiple servers but it won’t make any easier to query the data. You need to do something other than spread the data randomly across some random number of servers.

Writing Faster Without A Database

Assuming that our goal is to handle more write throughput, the easiest thing to do is remove the database from the equation altogether. After all, if we aren’t writing to a database then we can’t run into issues with log throughput or allocation map contention. We can’t actually get rid of the database, but we can remove it from directly causing performance problems.

The first thing to do is add several cache servers in front of our database server. Ideally you want to have multiple cache servers for redundancy. Since the cache servers are nothing more than boxes loaded full of memory it shouldn’t be expensive to find servers that fit the bill. Before any data is committed to the database it needs to be written into the cache. It might be difficult to modify application patterns to work with the cache, but it can be done. 37signals redesigned how they handled caching page elements in their Basecamp product do use tiers of cache. The same ideas can be applied to your application – write data at the most granular level possible in your cache and cascade calculations throughout your cache.

The next step is to add a queuing mechanism to handle writes into your database. After you write to the cache, throw your writes into a durable queue. RabbitMQ is an example of a queue that can be clustered, is highly available, and can be backed by a durable store. By adding a durable queue alongside the cache it is possible to scale reads (the cache) independently of writes (the queue) without having to worry as much about performance in the underlying database.

Processing writes is a matter of setting up additional processes to read data from the queue, process the instructions, and write data to the database. This could happen by pulling off batches of tens or hundreds or even thousands of instructions at a time, processing them, and then inserting them into the database in one transaction and hopefully improving performance by avoiding a lot of small inserts. You still have to handle the write load into the database and handling scalability issues with the transaction log so this strategy relies on being able to find a performance point where it is advantageous to medium to large size batches instead of many small transactions. At some point, you might even have to start partitioning your data across multiple servers underneath the caches and queues, but your application won’t need to be as aware of that data split taking place. Caching and queuing hide that implementation pain from the middle tiers.

The Answer

There’s no correct answer. Pushing writes into a partitioned table means having to deal with the pain points of SQL Server’s table partitioning. Splitting writes out among multiple databases or servers can make it difficult to query your data. Implementing caches and queues adds operational complexity and requires expertise that some teams may not have.

The more flexibility you require, the more difficult it becomes to handle additional writes. At some point it might even become necessary to find ways to move the write heavy portions of your application out of SQL Server and into another application that does a better job of handling distributed writes.

Always ask yourself what problem you’re trying to solve and make sure that you’re solving it in the best way. Bolting something like consistent hashing on top of SQL Server is going to be painful, at best. But there may be a different way to solve the problem that’s less painful, operationally simpler, or a little bit of both.

Interested in SQL Server Table Partitioning Instead?

Before you start designing a SQL Server table partitioning strategy, or if you’re trying to troubleshoot why your partitioned tables aren’t working as fast as you expect, check out our SQL Server table partitioning resources page.

Previous Post
How to Interview Storage Vendors
Next Post
Breaking News: Changes to Microsoft SQL Server Certifications

18 Comments. Leave new

  • >> Splitting writes out among multiple databases or servers can make it difficult to query your data.

    you can always create federated views to get around this…

  • Todd Everett
    April 11, 2012 8:27 am

    Great Post – a real eye opener. One subtle take away from this is how important it is to really understand the underlying architecture of the DBMS in question when attempting to solve a problem like increasing write throughput. What may be a good solution in one platform may not be so good in another. It is always complicated. Thanks for sharing..

    • Thanks! I’m glad you liked it.

      As I’ve worked with previous employers and clients to scale SQL Server I’ve run into a variety of problems. Some of them have been related to hardware n some of them have been related to data access patterns, and occasionally it’s related to the limitations/architecture of the DBMS itself. The answer, like most things, is to keep learning and build the best solution to the problem at hand. Of course, don’t be afraid to change the way you solve the problem either 😉

  • I really enjoyed this article. As an in the trenches DBA myself I find it all too easy to get caught up perhaps a little too much in the here and now. Posts such as this that discuss architecture, design and adopt more big picture thinking are really insightful and valuable. They take more time to produce than a typical blog post and I appreciate it!

    Good stuff, more of the same please….

  • Richard L. Dawson
    April 15, 2012 7:14 pm

    It was a good article. I have one question though.
    You state that the database can only have one transaction log file. In what way do intend this statement?

    I have always created multiple physical transaction log files on my databases and have ensured round robin access to them with a trace flag.

    Even within a single transaction log file there are multiple virtual log files dependent on file size and amount of activity.

    Can you clear up my confusion on this?


    • I should have said that SQL Server can have one active transaction log file. This Technet article Understanding Logging and Recovery in SQL Server covers a lot of the functionality of SQL Server’s transaction log and how the different pieces and parts fit together.

      Under normal circumstances, SQL Server will write sequentially to one log file until it is full (barring transaction log backups and the like). Once a log file is full, SQL Server will increase the size of the log file until either the drive is full or the file’s growth limit is reached. Virtual log files are used to handle backup, restore, and the crash recovery process – SQL Server doesn’t round robin writes across multiple

      I’m not sure which trace flag you’re referring to, could you provide a link?

    • Kannan Chandrasekaran
      February 16, 2017 5:11 am

      Absolutely Right. SQL Server only allows for one log file per database is a false statement. We can create multiple log file for the single database. 1 primary, multiple secondary, multiple logs. The statement confusing me too.

  • Amazing post. In my (limited) experience every time I see someone throw partitioning at a large workload – as if its going to solve every ones problems – it makes things worse (write latency in specific). Very good point about the transaction log also.

  • yeah i think sharding (partitioning) is a good way to go if you can figure out your sharding keys – if you can partition your data easily.

    I also like the idea of writing through a message queue to bundle writes. I have never done that before but sounds like something that could scale and not be overly complicated.

    thanks for the post

  • Hi

    Thanks for a great post. We brainstormed the same problem yesterday and came up with the same sorts of solutions, which is pleasing. We even coined the same phrases, like hash buckets. LOL. Luckily for us, we had the big daddy of randomness, already built in, a GUID column. Finally I think I’ve found a use for a GUID.

    So I built the Partition Function with 20 Range values and built the associated scheme. I created my test table on this scheme and copied 300k records into it. The distribution between the different partitions was a near perfect split. Awesome!!!

    Thanks again

  • Kannan Chandrasekaran
    February 16, 2017 5:09 am

    SQL Server only allows for one log file per database is a false statement. We can create multiple log file for the single database. 1 primary, multiple secondary, multiple logs.

    • Kannan – no, not one primary, just multiple logs, and they don’t do you much good because you’re still writing to them serially.

      • Kannan Chandrasekaran
        February 16, 2017 7:02 am

        Thanks for the reply, i understand, But in order to avoid contention, Microsoft / other forum recommends to create multiple files mainly for reducing PFS latch contention. when it is placed in different drives. Also for parallel insert and update one thread is fetching record in one file group other is updating record in second file group. Please correct me if i am wrong. Our production environment uses this strategy and we succeed after migrating from single file group on larger volume transactions

        • PFS has nothing to do with log files. That’s data file activity, and usually associated with tempdb — the contention — hence multiple data files to further spread out PFS pages, which occur every 8088 pages (around 64 MB), and using Trace Flags like 1118 to avoid PFS pages when allocating extents for new objects.

          I think you’re confusing terminology a bit here.

          • Kannan Chandrasekaran
            February 16, 2017 11:57 pm

            Dear Erik,

            Thanks for the clarity. Yes i agreed the point that multiple log file will not give any performance improvement, additional log file will be added in other disk if we the previous log file disk is running out of space.

    April 9, 2019 6:15 am

    Nice post!


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.