StackOverflow went through a few infrastructure upgrades this week. Jeff Atwood blogged about the changes, and Kyle Brandt goes into more detail. Here’s a short recap of some of the SQL Server changes:
- Changed physical server models (from IBM/Lenovo to Dells)
- Went from two quad-core Xeons to two six-core Xeons
- Enabled hyperthreading (so it looked like 24 cores)
- Went from SQL Server 2008 to SQL Server 2008 R2
- Went from 6 SATA drives to 8 SATA drives (yes, they’re really doing over 2 million page views a day using a handful of cheap drives, but it’s all in the l33t programming, because they do their own caching)
- Did some index tuning
- Moved from one datacenter to another
That’s a lot of moving parts. When the dust settled, the CPU use on the SQL Server had gone up, not down. So where do we start troubleshooting?
I love good developers because they do unit testing. (Yes, readers, that means if you’re not doing unit testing, you’re not a good developer.) The most reliable way to troubleshoot is to boil things down to the simplest repeatable test possible. They identified two different queries that had dramatically different results between the old server and the new server. I looked at the execution plans produced by these queries on both the old and new servers. Everything was identical – same plans, similar reads, similar results – except for one thing: CPU time. You can measure CPU time for a query by executing SET STATISTICS TIME ON in SQL Server Management Studio. From that point forward, all queries in that session will have a bonus set of metrics in the Messages tab listing the milliseconds of CPU time used.
The same query doing the same amount of work took about twice as many milliseconds on the new server as it did the old server. That meant we had two possibilities: either something under the hood of SQL Server 2008 R2 had gone horribly awry and was doing twice as much CPU work for the exact same execution plan, or the new CPUs were twice as slow. The server manufacturers said the new CPUs ran at 3.33GHz – but did they really? I busted out the trusty free CPU-Z tool and ran it on the new servers:
Compare the “Specification” line to the “Core Speed” line. As we say in the business, wah wah wah waaaaah.
The new CPUs were Intel’s diabolical power-saving Nehalems, which have given me no end of grief. My hat is off to them for saving the planet by using less power, but my pants are down to them for killing performance without informing the users. In the last few weeks, I’ve seen several cases where server upgrades have resulted in worse performance, and one of the key factors has been throttled-down CPUs. In theory, the servers should crank up the juice according to demand, but in reality, that’s rarely the case. Server manufacturers are hiding power-saving settings in the BIOS, and Windows Server ships with a default power-saving option that throttles the CPU down way too often.
I was able to prove the problem by running several simultaneous load-test queries that push SQL Server’s CPU hard. When running just one or two of them, the processors didn’t speed up, but if I ran enough of them, the server said, “Alright, alright, I get it. I’ll speed my cores up. Jeez.”
The instant I reduced the number of running queries, the CPUs got lazy again. If you think that’s bad, there might be servers in your datacenter right now running just 800MHz, and they won’t even speed up when they’re under load:
There’s a few morals to the story:
Periodically test your CPUs with CPU-Z. Know their spec speed and their current speed. Saving power by throttling down is fine, but understand that they may not run full speed until they’re under very heavy load. StackOverflow was consistently running 50-60% CPU load, and the processors still weren’t speeding up. Since the power-saving speed was 1.6GHz, their CPU-intensive queries were literally running half as fast as their old server.
When you change your infrastructure, minimize variables. If you change half a dozen things at once, troubleshooting performance is much more difficult. I used to work at a company that did periodic outages to fix infrastructure issues like battery backups and air conditioning, and one of our IT team loved to take advantage of those excused outages to upgrade all his servers. More memory, different network cabling designs, new software versions, you name it, he’d do it. When the power came back on and his servers didn’t work, his users kept asking, “But I thought all we were doing was repairing the air conditioning?”
A/B test your solution with the smallest unit test possible. In StackOverflow’s case, we narrowed it down to just two queries, and we set up a separate testing environment where we could try out different solutions. When we narrowed it down to the CPU clock speeds, we were able to instantly say with confidence, “Yep, we found it, and now we can take production down to fix it.”