The Basics of Database Sharding

SQL Server

There are many ways to scale out your database; many of these techniques require advanced management and expensive add-ons or editions. Database sharding is a flexible way of scaling out a database. In this presentation, Jeremiah Peschka explains how to scale out using database sharding, covers basic techniques, and shares some of the pitfalls. This talk is for senior DBAs, database architects, and software architects who are interested in scaling out their database.

More resources are available over in our sharding article.

Previous Post
5 Things About Fillfactor
Next Post
Write a Note and Shoot Yourself Today

5 Comments. Leave new

  • You said you avoid linked servers where possible, but…

    Where do distributed partitioned views fit in the sharding discussion (Enterprise Edition also supports updating them*)? If at all.

    They *sound* like a natural fit for this problem, and it could save changes to an application.


    • Jeremiah Peschka
      May 4, 2013 10:07 pm

      In my opinion, they don’t fit into the discussion at all. Part of the purpose of sharding/federation is to split data into smaller, more manageable chunks where we can guarantee performance characteristics. When you bring linked servers and the network into play, you bring an element of unpredictability into the equation: you don’t know if a server, NIC, or router is down or if another server is under going maintenance. Plus, we can’t predict query performance even in perfect conditions with linked servers – it can be difficult to determine if a join is going to be remoted or not.

      Inserts and updates to partitioned views can also end up causing probes – SQL Server will probe each “partition” looking for the partition responsible for the row. This is expensive enough with only one server involved and gets very expensive when that probe occurs across the network.

      If you’re working with OLTP, you need to make sure you’re sharding in a way that keeps queries local to one server. If you can do that, you only need to change your connection mechanism to route people to the correct server.

      If your concern is about reporting, you should consider a dedicated reporting server that is designed to handle these workloads. An enterprise data warehouse will help out in this situation.

      What kind of code changes are you worried about?

      • Thanks for the response. Sorry in advance for the walls-of-text (it’s mostly quotes from BOL)

        > Inserts and updates to partitioned views can also end up causing probes … [this] is expensive enough with only one server involved and gets very expensive when that probe occurs across the network.

        Out of my depth here, but the documentation for distributed partitioned views has some suggestion that SQL Server might avoid this where possible.

        > The range of values in each member table is enforced by a CHECK constraint on the partitioning column, and ranges cannot overlap.

        > …

        > Setting the lazy schema validation option, by using sp_serveroption, for each linked server definition that is used in distributed partitioned views. This optimizes performance by making sure the query processor does not request metadata for any one of the linked tables until data is actually needed from the remote member table.

        > The SQL Server query processor optimizes the performance of distributed partitioned views. The most important aspect of distributed partitioned view performance is minimizing the amount of data transferred between member servers.

        > …

        > The query processor first uses OLE DB to retrieve the CHECK constraint definitions from each member table. This allows the query processor to map the distribution of key values across the member tables.

        > The query processor compares the key ranges specified in an SQL statement WHERE clause to the map that shows how the rows are distributed in the member tables. The query processor then builds a query execution plan that uses distributed queries to retrieve only those remote rows that are required to complete the SQL statement.

        With regards to code changes, it wasn’t so much concern about changing the code, but perhaps a reasonable way to avoid changing the code *at all*. Instead of the application having the smarts and connecting to each server that stores the data, it could connect to one and each sharded table would be represented as a distributed partitioned view.

        • Jeremiah Peschka
          May 8, 2013 7:15 am

          Great points. In theory, practice should be the same as theory. In practice, they’re not the same.

          If you’re routing all reads and writes through the same “master” SQL Server, you’ll need to make sure that you can scale the network pipe on that one SQL Server to handle the load from all of your applications. One of the benefits of sharding is that you can spread all load across multiple servers. Using a master server with a distributed partitioned view still requires that you beef up the central master server and that it may become your primary bottleneck for network throughput and, potentially, tempdb performance.

          Ultimately, I suggest trying this out at a small scale, but I suspect you’ll find that the distributed partitioned view scenario is disappointing.

          • I didn’t consider the implications of network bandwidth for the one “master” SQL Server. You’re right; it would probably lead to disappointment.

            Thanks for the insight.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.