Back in 2009 (wow, seems like only yesterday!), I wrote about designing a recovery strategy for Stack Overflow. Back then, I wrote:
With these answers in mind, Stack Overflow’s decisions not to do transaction log backups, offsite log shipping, database mirroring, and so on make good business sense. Us geeks in the crowd may not like it, and we might demand the latest and greatest in backup & recovery technology, but at the same time we want Stack Overflow to remain free. As their volunteer DBA, I’d love to do 24×7 log shipping or database mirroring to a secondary server at another colo facility – but I wouldn’t be willing to pay out of my own pocket for expenses like that.
Today, the situation is totally different. They’re in the top 50 web networks, 4.4 million users, a job posting network, dozens of crazy smart employees, and I’m not even close to a volunteer DBA for them anymore. (Heck, we’ve even paid to advertise on Stack Exchange.) I hop into the company chat rooms now and then, and I swing by the offices whenever I’m in New York, but these guys don’t need me. I jump in whenever I can for fun, because it really is fun working with engineers this sharp.
That means this time, I’m blogging about designing Stack’s recovery strategy more as an outsider’s perspective. I know you folks like reading real-life case studies, and Stack’s pretty open about their infrastructure, so this will be fun for all.
What Changed? Downtime became more expensive.
If you’ve looked at the Stack Exchange team page, you’ll notice dozens of people with the word “sales” in their job title. Stack now sells ads on the Q&A sites, plus sells job postings to companies on Careers.StackOverflow.com.
There’s real money going through the network now, and downtime starts to cost more money. If the sites are down, people may go back to Google, get their answers from another site <shudder>, and there goes some ad revenue.
This meant that at least a few databases – for ads and careers – we needed to do full recovery mode in SQL Server, and start doing transaction log backups. This didn’t start across-the-board – it started only with the most high-value databases.
As the company grew, the relative cost of standby SQL Servers in a different location started to drop. Downtime seemed more expensive, and interestingly, the actual price of the standby SQL Servers started to drop. As Stack Exchange added more systems administrators, it wasn’t really much extra work for these guys to manage a few extra database servers in other locations. And as long as we’ve got extra database servers somewhere else, kept up to date with the most recent data, we might as well put ’em to use.
What else Changed? We Scaled Out.
Stack Exchange’s API lets the public run queries against the databases in real time (and starting with API v2.1, they can even write). For a demo of how it works (but not using the live database), try out Data.StackExchange.com. For example, here’s the most recent 10 questions from Stack Overflow:
Yes, Virginia, that’s really an Execution Plan tab at the bottom with the actual plan from your query:
Like many of Stack Exchange’s tools, Data Explorer is completely open source, so you can install this same tool in your own environment if you’d like to let internal power users query your databases without having to install SQL Server Management Studio or a reporting tool.
Enter SQL Server 2012’s AlwaysOn Availability Groups
SQL Server 2012’s AlwaysOn Availability Groups allow for multiple replicas to serve 2 purposes at Stack Exchange: easier failover to remote data centers with minimal data loss, and read-only capabilities out of those remote data centers.
I’m a huge fan of AlwaysOn AGs, but they’re like the opposite of “the easy button.” Sure, you can offload read-only queries, backups, and DBCCs to secondary replicas, but you also have to introduce a lot of new dependencies like clustering, Windows Active Directory, heartbeats, and quorums. After Stack and a few of my other clients went live with the early 2012 versions, I started jokingly calling it OftenOn – the high availability functionality ended up being a source of a lot of downtime.
Stack worked with Microsoft for months to troubleshoot sporadic outages, resulting in this Windows 2008R2 hotfix and some other fun discoveries. After a lot of challenges, Stack (and most of my other AG clients) ended up moving to Windows Server 2012 for their SQL clusters because so many of the clustering problems were fixed with rewritten clustering code.
The big gotcha, though, was that if any replica loses its network connectivity to any of the other replicas, all of the involved databases will drop offline. This is not a bug – this is the desired result.
Well, this was Microsoft’s desired result. It sure wasn’t anybody else’s, ha ha ho ho, and thankfully those nice folks at Microsoft decided the behavior would be different in SQL Server 2014. Now, if a node loses connectivity, its AG-involved databases continue to stay online.
Stack Exchange’s Upgrade to SQL Server 2014
As Nick Craver writes in his post about running SQL Server 2014 in production at Stack, this advantage was absolutely killer for Stack’s uptime goals. The Stack developers have done a killer job at coding the sites – if they detect that the SQL Server databases are in read-only mode, all of the involved sites fail into a read-only mode too, along with a nice polite banner informing the reader that they can’t make any changes right now.
The Stack engineers, God bless ’em, are so dedicated to making IT better that they’re not just willing to document their infrastructure to help others, plus build open source tools to help people, but they’re even willing to put their production web site in the hands of preview code. So earlier in November, Nick Craver and Steven Murawski did a rolling upgrade of their production clusters to SQL Server 2014.
Each cluster involves 3 nodes. Take the StackOverflow cluster:
- NY-SQL01 – the primary replica handling all incoming write connections. Sits in Manhattan.
- NY-SQL02 – the asynchronous secondary replica sitting next to it in the same cage. NY-SQL01 copies transaction log data off to this server gradually in the background. In the event of a localized failure on NY-SQL01, the admins can manually fail over to NY-SQL02 with some data loss.
- OR-SQL01 – the asynchronous secondary replica sitting in Oregon. NY-SQL01 also copies data here in the background, but due to the vagaries of wide area networks, it can be farther behind than NY-SQL02. To leverage extra hardware in Oregon, Data Explorer can be hosted in Oregon’s web servers using this read-only capacity.
All three were running SQL Server 2012 on Windows Server 2012. Unfortunately, with Windows clustering, we can’t upgrade any of the nodes in-place to a new version of Windows (2012 R2), so if we wanted to deploy that, we’d have to tear down some of the existing nodes temporarily. That was a lot of work for relatively little gain. There wasn’t a need for new hardware, either – the database servers typically run under 10% CPU, and they can cache everything they need in memory.
Since we could keep the same OS, the SQL Server 2014 upgrade process looked like this:
- Upgrade one of the readable replicas (in our case, NY-SQL02) to SQL 2014. From this point forward, replication stops to the other nodes. They can’t apply data from a newer SQL Server version, so they’ll just kinda freeze in limbo. We could still fail back to those, but we would lose all data changes made from this point forward.
- Test the apps in read-only against the newly upgraded replica.
- Test the apps in writeable mode against the newly upgraded replica.
- Make a go/no-go decision. If no-go, fail back to the original replica (NY-SQL01) and send the upgraded replica to detention. If go, start upgrading the other readable replicas. As they come online with SQL Server 2014, the AlwaysOn AG replication will catch them back up.
- Optionally, fail back to NY-SQL01 as a primary.
- Monitor the servers and the plan cache looking for queries getting unexpectedly poor execution plans. (Remnant of our SQL 2008 upgrade, and subsequent full text search problems.)
The upgrade went really smoothly, barring one technical issue involving a combination of the installer and Windows Core. The installer assumes that a particular feature will be installed (whether you need it or not), and that feature requires the full Windows GUI, so it fails on Windows Core. The “fix” was to tell the installer to skip feature compatibility checks.
I was online purely for comic relief. I think I made the sum total of one technical contribution during the entire call, and I left shortly after the go/no-go decision. Nick and Steven are top notch admins. Having said that, I wouldn’t recommend this DO-IT-LIVE! upgrade approach for most companies.
Should You Follow Suit with SQL 2014?
The Stack Exchange team is very in tune with what’s happening in their SQL Server environment. They know exactly which queries are the top 20 resource users, what those execution plans look like, and when a new query pops up to the top of the list. If SQL Server suddenly produces different query plans for a frequent query, these guys know how to work around it. They have great relationships with the dev teams, instant access to source code, and easy, automatic deployments to production to fix query problems.
I’m thankful that there’s cutting-edge shops out there like Stack Exchange that dedicate so much talent to managing their database stack and give so much back to the community (and to Microsoft). They’re helping make sure SQL 2014 is a solid product when it eventually hits RTM, and hopefully they’ll iron out any surprise bugs before folks like you deploy it in production.
If you believe your shop has what it takes to run SQL 2014 in production, you can join Microsoft’s early-access programs to get access to more frequent builds, extra support contacts, and community experts to help with the deployments. Contact us to get started.
It would be lovely to understand how data moved from the active node ie SQL 01 to SQL 02 that was upgraded first and how a failover was done to the 2014 nodes while the last 2012 node was being upgraded.
Its also nice to know how much down time will be incurred during this process . . .
I have 3 node AAGs and I want to upgrade with no downtime, I hope to fail over to an upgraded node and update the passive nodes to the newest version.
I love what you fellas are often upward way too. This sort of intelligent operate and also coverage! Carry on the actual fantastic performs fellas I’ve additional you folks to help our blogroll.