Scaling Up or Scaling Out? Part Two

In part 1 of this post, I covered how SQL Server handles scaling up.  We talked about how quickly it becomes expensive to add more CPU power, memory, and storage throughput to our database servers.  Today, we’re going to focus on a different way to scale.

When most folks talk about scaling out SQL Server, they’re adding more SQL Servers to the infrastructure and dividing the work between them.  I want you to take a bigger step back, though, and ask the question, “What are the different things we’re asking SQL Server to do?”  The answer isn’t just storing data, either:

  1. We insert data – and not just data, but different kinds of data
  2. We update data
  3. We delete data
  4. We do some processing of that data
  5. We retrieve the data – and we want to include the different ways we retrieve it, and with which tools
  6. We check the data to make sure it’s still correct
  7. We back up the data – or sometimes we don’t, because it’s easier to rebuild from source

By the time our application gets large enough for the word “scale” to get thrown around, we’ve usually got several different kinds of data.  Each of those data groups has different answers to those seven processes above.  Let’s take a common type of data – orders for a web site.  At first glance, a web site might seem like it has just one type of data, but here’s several types for a system I tuned recently:

  • Items – product details about what we sell.  This data is only periodically updated, and we’re not the primary source for this data.  Our manufacturers update the data via file feeds they send us.  Reads have to be absurdly fast, and the load will be very, very high.
  • Price rules – no, one price does not fit everybody.  We run sales, referral programs, and discounts for our bulk buyers.  This data may not need frequent updates, but when updates happen, they have to be available everywhere instantaneously.  Otherwise, a pricing error waiting for a rollout might cost us millions.
  • Reviews – end users can add their own reviews about our items.  We’re the primary source for this data, but it doesn’t have to be transactionally consistent with our sales data.  We can afford to lose some of this data.  Reads have to be fairly fast, and the data can be completely out of date.
  • Shopping carts – transient data about people shopping at any given time.  We’re the primary source for this.  It’s insert-focused, and data has to be up-to-the-second.  We could lose all of this data at any time without a serious outage, but while it’s down, our sales stops.
  • Orders placed – This is the good stuff.  We’re the primary source for this, and it absolutely, positively has to be completely consistent.  Our credit card records, items ordered, and shipment addresses have to be complete, and they have to be available across multiple datacenters simultaneously.

If we take each of those data profiles and put our 7 questions to it, we could get completely different answers.  Of particular interest is question 5 – how we retrieve the data.  At first glance, we give answers like, “When we retrieve orders placed, we need to join to items to see what they bought.”  There’s an important scalability part of that question, though – how old can the related data (items) be when we query the orders placed?  If I’m querying an order from yesterday, I can probably live with yesterday’s copy of Items.

Scaling Back Our Demands on SQL Server

As we dissect our data into different types, start exploring the needs of each type, and stay completely honest with ourselves, we’ll probably discover that a relational database like SQL Server might not be the best option for everything.  I’m a Microsoft SQL Server cheerleader, but if I’m going to get it to scale, I have to be honest about its strengths and weaknesses.  Pushing a technology to its limits isn’t always the right answer – especially if there’s another technology at hand that can do the job with less effort.

Separating this data off the SQL Server and onto other servers is a method of scaling out – throwing more hardware and services at our business problems.  StackOverflow’s first iteration of search used SQL Server’s built-in full-text search and we ran into scaling challenges.  SQL Server 2008 moved the full text search into the engine, and this introduced some concurrency issues that we weren’t able to solve in a cost-effective way.  I could have recommended that they build out a replication or log shipping infrastructure and start querying slightly stale copies of the database, but that just didn’t make sense for their needs.  They were armed to the teeth with web server hardware and guys who knew how to write code, so why not scale ourselves right out of SQL Server?  They switched to Lucene, and I (as a DBA guy) have been happy ever since.  Are the programmers happy?  Who cares?  Not me – I’m able to help them focus on what SQL Server really does well.

There are going to be parts of our data storage solution that absolutely require a relational database with transaction support.  Our example business needs the orders-placed data to leverage all of the data integrity features built into SQL Server.

Scaling Out SQL Server’s Remaining Work

Once we’ve pared down our list of demands, we can make different architectural decisions about how to add more hardware into the mix.  Some of the most common scale-out infrastructures I’ve seen include:

Using bidirectional or merge replication – this allows two SQL Servers to handle writes to the same database, same tables, at the same time.  Changes are replicated between the servers so that within a few seconds, both servers will have the same records.  Schema designs that rely on identity fields can run into trouble here, and we have to compensate by using identity fields with different seeds – one server uses odd numbers, the other uses even.  I only recommend replication as a scale-out method when the client has an around-the-clock database team on duty at all times because when this thing breaks, it breaks hard.  A DBA has to be available to get the alerts of problems with replication latency, jump to work on solving the problem, and get it fixed before a performance-killing reinitialization is required.

Putting several read-only SQL Servers behind a load balancer – if our primary bottleneck is reads (like reporting queries or a lack of app-tier caching), we can build several SQL Servers that are refreshed via log shipping or SAN snapshots.  End users or app servers don’t access SQL Server for reads directly – they get a server name that points to the load balancer hardware (like an F5 Big-IP), and the load balancer redirects that connection to an available SQL Server automatically.  Every X minutes, we pull a server out of the farm by telling the load balancer it’s no longer open for accepting connections.  After the last remaining query finishes, we refresh the data by applying new transaction logs or mounting a new SAN snapshot.  We then put it back into the load balancer pool, and the load balancer starts sending user connections to it.  This is a lot of moving parts, and while it’s all automated by scripts, scripts can still break.  Depending on the robustness of our scripts and our DBA team, we might be able to get away with this solution without an around-the-clock DBA team, but people still have to be on call.

Using SQL Server Denali (2011) AlwaysOn – this one isn’t actually available yet, but when this new version of SQL Server comes out, it holds the promise of up to 4 read-only replicas.  Apps can declare a read-only intent in the connection string, and the production SQL Server will redirect them to one of the available replicas.  Up to 2 of the replicas can be synchronous, although I don’t think many customers will opt for live reads from a synchronously updated SQL Server – the overhead will slow down production transactions.  You can read more at my post on Denali AlwaysOn.

Using third-party scale-out products – I’m always leery of adding third-party infrastructure to scale out SQL Server because it’s just so darned difficult.  Xkoto Gridscale sounded like the brightest hope in a while, but that was discontinued when Teradata bought Xkoto.  I don’t know of any other reliable technology that I’d trust to pull it off, and maybe more importantly, that I’d trust with my long-term business model.

The first two solutions (replication and load balanced read-only servers) don’t require anything unsupported, so they’re a safe bet.  Denali might not be seen as a safe bet because it’ll be a version 1 technology when it ships, but if you need to scale out, you can’t afford not to investigate it.  I’d even argue that some of my scale-up customers (like data warehouses) would be well-served by kicking AlwaysOn’s tires – if you could shed 50% of your load by moving your read-only queries away from your >8-CPU server currently running SQL Server Datacenter Edition, you could easily save hundreds of thousands of dollars in licensing.  If you’re currently running a 4-CPU box and you’re worried about headroom, Denali might be your silver bullet to avoid a big expenditure.

Previous Post
How to Get Paid to Take a Cruise
Next Post
Consulting Lines: “Do you want 10% faster, 100% faster, or 1000% faster?”

12 Comments. Leave new

  • Okay, this made me feel guilty for being so demanding on the poor thing! It’d be nice to have 4 read-only replicas just lying around…

    Reply
  • Tripp Southern
    March 9, 2011 10:26 am

    Brent, this is pure gold and exactly why I read your posts daily. This article couldn’t be any more topical. I’ve only read it once so far, but I plan on memorizing it and mapping your “seven questions” to the data profiles of my own interconnected SQL/Oracle application environments.

    Excellent stuff. You, sir, make me a better DBA. I feel like I should be giving you a percentage my raises. Instead, I’ll give a little more to charity this year in your honor. Thanks for being so prolific and keep it up.

    Reply
  • Great article. I think for items data which has a very high read, perhaps using the new columnar storage being introduced in Denali might be beneficial especially if aggregate queries are being run. The shopping cart data might be best suited using a NoSql solution like Amazon’s Dynamo database. Reviews as well could be handled better in NoSQL solution which sacrifices read consistency for performance but then using replication or SAN snapshot to scale out might be fine too. I do think you could go crazy and use the “ideal” tool for the job and end up with a search engine, a couple of no sql dbs and merge replication to maintain for each data type and end up incurring heaving support costs where SQL Server might be a good enough solution in most cases. I do think there is a great topic and I hope you continue to discuss the various options and trade-offs since this really the root cause of poor db performance is not picking the right tool for the job.

    Reply
  • Great post, thanks.
    I think that replication is by far the best option for scaling out today.
    Keep in mind that aside from HADR, we will also get SQL Azure’s FEDERATIONS feature. It will allow unbeleiveable scale out/sharding capabilities, up to thousands of nodes. Quite similar to NoSQL, only on SQL…

    Reply
  • Putting an F5 in front of a pool of SQL Servers that get data pushed out to them via transactional replication/log shipping works really well for large search catalogs being hit by thousands of users. I supported this in practice for three years and it’s a very solid, reliable architecture, which allows servers to be taken out of use for service packs and other maintenance with minimal disruption to the end user. And configuring the F5s to bring servers in or out of the pool is a doddle.

    The only downside I experienced was that if you’ve a catalog that’s huge in size, it can take a while for the replication to catch up when the server’s brought back online, likewise when a new server’s introduced into the pool. No big deal if you’re monitoring search response times and scale out before it’s needed.
    It’s a very sound technical solution to the scale out problem, proven in practice. Nice article Brent, that’s given me some more ideas 🙂

    Reply
  • Hi Brent,

    I’ve read your article about Scaling Out. What do you think about how to handle large concurrent user of Reporting Services in SQL Server 2012 R2 ? Is it better to scale-out the Reporting Server instead of scale-up ?

    Thanks
    Octoni

    Reply
    • Octoni – typically you want to scale out with virtual machines.

      Reply
      • Hi Brent,

        is there any specific official statement from Microsoft website that can explain more detail about scaling out the ssrs in terms of how to handle large concurrent user spesifically in version 2012 ? because I’ve heard that in version 2012, it can handle unlimited concurrent user.

        Thanks

        Reply
  • Hi Mr Ozar! Any update on this article with latest version/upgrade?

    Reply
  • Ademir – no, I tend to not update every blog post every year. 😉 Hope you enjoy the free posts though!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.