Database Sharding

Scaling up is hard, scaling out is even harder. We all know that we can scale reads by adding some kind of replication or read-only copies of databases or using a massive caching layer. What happens when we have to scale writes?

Using a grocery store metaphor: a database that is effectively scaled for reads will have ample parking, the aisles will be wide and well lit, and there will be plenty of high capacity shopping carts available for everyone to use. There are even 20 staffed checkout lines. The problem is that goods coming into the store only enter the store through a single loading dock that has space for one truck. Shoppers can get in or out quickly, but there’s no way for new groceries to get to the shelves.

If we were to redesign the grocery store to be able to handle more shipments, we need to ask ourselves:

How often do we need new shipments? Do we need more shipments or more shipments at the same time?
Where do those shipments need to go in the store?
Do we need to put similar items together on the truck?

Scaling Out to Handle More Produce

To get more data into SQL Server, we need more places to write. At some point, we reach the limits of our hardware budget. Even with a sufficiently large budget, there comes a time when the hardware in a server can’t handle any more writes, no matter how much hardware is thrown at the problem. This happens, in part, because writers issue locks and hang on to those locks until that lock passes out of scope (either as an individual statement or at the transaction level).

We can approach the problem a few ways:

Writing to one server and reading from others
Writing to many servers and reading from many servers
Finding a new job

Since this is an article about solving the problem, we can rule out finding a new job right away. That leaves us with two possible solutions.

Write Primary

This is a tried and true way to solve the problem of scaling writes. Since a mix of reading and writing to a single server can cause a lot of locking and blocking, much of the locking and blocking can be eliminated by moving reads to a separate server.

Moving reads to a new server solves the problem of writing quick enough to our primary server, but it just spreads the problem out to all of the read replicas. Instead of having to troubleshoot locking and blocking problems on a single server, someone will have to troubleshoot locking and blocking problems combined with whatever mechanism we’re using to move data to the read servers. Instead of having one problem, we now have a lot of problems. And, to put a big red cherry on top of this sundae of problems, eventually there will be problems scaling writes on a single write server.

This feature has even been built into SQL Server as SQL Server Always On Availability Groups. Although it’s a helpful feature, it requires a lot of specialist employees to set up, deploy, and maintain. This isn’t for the faint of heart.

Sharing the Load

Instead of routing all writes to one server and scaling up, it’s possible to write to many servers and scale out. We can set up sharding (sometimes called database federation) pretty easily at one of many levels. For example, MySQL can be sharded through a driver, PostgreSQL has the Postgres-XC project, and other databases have their own solutions. Clearly this is a problem that people have approached before.

A major difficulty with sharding is determining where to write data. There are several approaches to determining where to write data, but these approaches can be broken down into three categories: range partitioning, list partitioning, and hash partitioning.

Range partitioning involves splitting data across servers using a range of values. Rows between 1001 and 10000 go to server A, 10001 to 20000 to server B and so on. This approach seems like it could work well, but it has some problems. The largest problem is that it creates write hot spots – all new data will be written to the same range server. Sending all writes to a single server doesn’t help us scale out. Range partitioning also doesn’t guarantee an even distribution of data or activity patterns – we can’t predict which ranges will be active and which will be sparsely populated.

List partitioning is similar to range partitioning. Instead of defining a range of values, we assign a row to a server based on some intrinsic data property. This might be based on department, geographic region, or customer id. This is can be an effective way to shard data. Members of each list grouping are likely to use the same features and have similar data growth patterns. The downsides of this approach are that it may require domain knowledge to create effective lists, lists are unlikely to experience even growth, and in the real world the list taxonomies may become very complex.

The last approach to partitioning is hash partitioning. Instead of partitioning by some property of the data, we assign data to a random server. The randomness comes from applying a hashing function to some property of the data. Hashing makes it easy to take an input of any length and produce an identifiable output of a fixed length – we’re taking randomly sized strings and mapping them to a known size number.

A naïve approach to hashing keys would be to split data into multiple buckets (four buckets in this example) based on the output of the hashing function. If our business grows and we decide to scale out to additional servers, all of the data will need to be re-written.

A more sophisticated approach uses something called consistent hashing. However you get to a the value, consistent hashing distributes keys along a continuum – think of it as a ring. Each of our sharded servers is responsible for a portion of the data. If we want to add another server, we just add it into the ring and it takes over for a portion of the hashed values. With consistent hashing, only a small portion of the data needs to be re-written, unlike our naïve hashing example where all data needs to be redistributed. This has been discussed at length throughout computer science and in multiple papers, including Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web and Dynamo: Amazon’s Highly Available Key-value Store.

Where Should We Split the Load?

Implementing a combination of consistent hashing and database sharding isn’t easy. The decision to route a write has to be made somewhere.

In the application A layer of code is written to determine how a write will be routed. This is often combined with re-writing connection strings on the fly to determine which server a read or write should hit.
On the wire Instead of routing writes in our application, writes are sent to the appropriate server by router that performs packet inspection looking for a hash value.

Both approaches have their merits. Routing in the application can make this process as easy as deploying a new DLL or configuration file. A determined developer could write code to look for new databases and partition writes accordingly. On the flip side, routing writes on the wire makes this routing transparent to any calling applications – no code will need to be changed to add additional databases. Router configuration changes will be required, but the changes can be performed at any time rather than waiting for a regular application deployment.

Where Do I Split The Load?

Designing a scale out strategy isn’t easy. There are many factors to consider, and knowing how to scale out writes is just one thing to think about before designing your overall architecture.