In part 1 of this post, I covered how SQL Server handles scaling up. We talked about how quickly it becomes expensive to add more CPU power, memory, and storage throughput to our database servers. Today, we’re going to focus on a different way to scale.
When most folks talk about scaling out SQL Server, they’re adding more SQL Servers to the infrastructure and dividing the work between them. I want you to take a bigger step back, though, and ask the question, “What are the different things we’re asking SQL Server to do?” The answer isn’t just storing data, either:
- We insert data – and not just data, but different kinds of data
- We update data
- We delete data
- We do some processing of that data
- We retrieve the data – and we want to include the different ways we retrieve it, and with which tools
- We check the data to make sure it’s still correct
- We back up the data – or sometimes we don’t, because it’s easier to rebuild from source
By the time our application gets large enough for the word “scale” to get thrown around, we’ve usually got several different kinds of data. Each of those data groups has different answers to those seven processes above. Let’s take a common type of data – orders for a web site. At first glance, a web site might seem like it has just one type of data, but here’s several types for a system I tuned recently:
- Items – product details about what we sell. This data is only periodically updated, and we’re not the primary source for this data. Our manufacturers update the data via file feeds they send us. Reads have to be absurdly fast, and the load will be very, very high.
- Price rules – no, one price does not fit everybody. We run sales, referral programs, and discounts for our bulk buyers. This data may not need frequent updates, but when updates happen, they have to be available everywhere instantaneously. Otherwise, a pricing error waiting for a rollout might cost us millions.
- Reviews – end users can add their own reviews about our items. We’re the primary source for this data, but it doesn’t have to be transactionally consistent with our sales data. We can afford to lose some of this data. Reads have to be fairly fast, and the data can be completely out of date.
- Shopping carts – transient data about people shopping at any given time. We’re the primary source for this. It’s insert-focused, and data has to be up-to-the-second. We could lose all of this data at any time without a serious outage, but while it’s down, our sales stops.
- Orders placed – This is the good stuff. We’re the primary source for this, and it absolutely, positively has to be completely consistent. Our credit card records, items ordered, and shipment addresses have to be complete, and they have to be available across multiple datacenters simultaneously.
If we take each of those data profiles and put our 7 questions to it, we could get completely different answers. Of particular interest is question 5 – how we retrieve the data. At first glance, we give answers like, “When we retrieve orders placed, we need to join to items to see what they bought.” There’s an important scalability part of that question, though – how old can the related data (items) be when we query the orders placed? If I’m querying an order from yesterday, I can probably live with yesterday’s copy of Items.
Scaling Back Our Demands on SQL Server
As we dissect our data into different types, start exploring the needs of each type, and stay completely honest with ourselves, we’ll probably discover that a relational database like SQL Server might not be the best option for everything. I’m a Microsoft SQL Server cheerleader, but if I’m going to get it to scale, I have to be honest about its strengths and weaknesses. Pushing a technology to its limits isn’t always the right answer – especially if there’s another technology at hand that can do the job with less effort.
Separating this data off the SQL Server and onto other servers is a method of scaling out – throwing more hardware and services at our business problems. StackOverflow’s first iteration of search used SQL Server’s built-in full-text search and we ran into scaling challenges. SQL Server 2008 moved the full text search into the engine, and this introduced some concurrency issues that we weren’t able to solve in a cost-effective way. I could have recommended that they build out a replication or log shipping infrastructure and start querying slightly stale copies of the database, but that just didn’t make sense for their needs. They were armed to the teeth with web server hardware and guys who knew how to write code, so why not scale ourselves right out of SQL Server? They switched to Lucene, and I (as a DBA guy) have been happy ever since. Are the programmers happy? Who cares? Not me – I’m able to help them focus on what SQL Server really does well.
There are going to be parts of our data storage solution that absolutely require a relational database with transaction support. Our example business needs the orders-placed data to leverage all of the data integrity features built into SQL Server.
Scaling Out SQL Server’s Remaining Work
Once we’ve pared down our list of demands, we can make different architectural decisions about how to add more hardware into the mix. Some of the most common scale-out infrastructures I’ve seen include:
Using bidirectional or merge replication – this allows two SQL Servers to handle writes to the same database, same tables, at the same time. Changes are replicated between the servers so that within a few seconds, both servers will have the same records. Schema designs that rely on identity fields can run into trouble here, and we have to compensate by using identity fields with different seeds – one server uses odd numbers, the other uses even. I only recommend replication as a scale-out method when the client has an around-the-clock database team on duty at all times because when this thing breaks, it breaks hard. A DBA has to be available to get the alerts of problems with replication latency, jump to work on solving the problem, and get it fixed before a performance-killing reinitialization is required.
Putting several read-only SQL Servers behind a load balancer – if our primary bottleneck is reads (like reporting queries or a lack of app-tier caching), we can build several SQL Servers that are refreshed via log shipping or SAN snapshots. End users or app servers don’t access SQL Server for reads directly – they get a server name that points to the load balancer hardware (like an F5 Big-IP), and the load balancer redirects that connection to an available SQL Server automatically. Every X minutes, we pull a server out of the farm by telling the load balancer it’s no longer open for accepting connections. After the last remaining query finishes, we refresh the data by applying new transaction logs or mounting a new SAN snapshot. We then put it back into the load balancer pool, and the load balancer starts sending user connections to it. This is a lot of moving parts, and while it’s all automated by scripts, scripts can still break. Depending on the robustness of our scripts and our DBA team, we might be able to get away with this solution without an around-the-clock DBA team, but people still have to be on call.
Using SQL Server Denali (2011) AlwaysOn – this one isn’t actually available yet, but when this new version of SQL Server comes out, it holds the promise of up to 4 read-only replicas. Apps can declare a read-only intent in the connection string, and the production SQL Server will redirect them to one of the available replicas. Up to 2 of the replicas can be synchronous, although I don’t think many customers will opt for live reads from a synchronously updated SQL Server – the overhead will slow down production transactions. You can read more at my post on Denali AlwaysOn.
Using third-party scale-out products – I’m always leery of adding third-party infrastructure to scale out SQL Server because it’s just so darned difficult. Xkoto Gridscale sounded like the brightest hope in a while, but that was discontinued when Teradata bought Xkoto. I don’t know of any other reliable technology that I’d trust to pull it off, and maybe more importantly, that I’d trust with my long-term business model.
The first two solutions (replication and load balanced read-only servers) don’t require anything unsupported, so they’re a safe bet. Denali might not be seen as a safe bet because it’ll be a version 1 technology when it ships, but if you need to scale out, you can’t afford not to investigate it. I’d even argue that some of my scale-up customers (like data warehouses) would be well-served by kicking AlwaysOn’s tires – if you could shed 50% of your load by moving your read-only queries away from your >8-CPU server currently running SQL Server Datacenter Edition, you could easily save hundreds of thousands of dollars in licensing. If you’re currently running a 4-CPU box and you’re worried about headroom, Denali might be your silver bullet to avoid a big expenditure.