This month’s Redmond Mag glows about some new features in SQL 2008, and yes, it does have a lot of cool tricks up its sleeve. But before you go upgrading your servers to get those new features, there’s one thing you need to know.
New versions of SQL Server are not always faster for every query.
This may come as a surprise to you, but every versions of SQL Server have areas where they require manual tweaking in order to be as fast as the last version. I worked with Jeff Atwood and the guys at StackOverflow this past weekend to move them onto SQL Server 2008, and we had a nasty surprise. Jeff summed up the issues with the SQL 2008 upgrade on his blog, but I’ll cover it here from a DBA perspective.
The app and SQL 2005 were on the same box (which tells you a lot about the performance of his code, because that site is doing pretty well) and they got a new box running SQL 2008. We restored the databases over to the 2008 box for testing, ran queries, and compared performance.
Full text search results were slower, but we didn’t catch just how slow they were because we focused on the queries that were currently running slow on the 2005 box. Some of the queries we didn’t test had been running great on 2005, but were dead slow on 2008.
Why slower? Different execution plans.
We did catch a second issue: on a particularly slow set of queries, SQL 2008 was building a different execution plan than 2005. This execution plan worked – but not for the kind of load StackOverflow has. I narrowed down the differences and I was able to overcome it with trace flag 2301, which lets SQL spend a little more time building a good execution plan. By spending more time compiling the plan initially, and saving that plan in cache, we got phenomenally better results. Query times went from 190ms on SQL 2005 to 40ms on SQL 2008. Hubba hubba. All systems go.
Denny Cherry, a performance tuning guru with a long history at MySpace.com and Awareness Technologies, asked me why I didn’t manually set up query execution plans for them. If it was my server and I was Jeff’s full-time employee, that’s exactly what I’d do. Problem is, if you don’t have a full-time DBA to watch the server and identify what the right (and wrong) execution plans are, you introduce an element of mystery. I can imagine what would happen three months down the road: performance would go to hell in a handbasket as schemas, queries and indexes changed over time. Jeff wouldn’t know what things were the fault of the engine, versus what things were the fault of the DBA who’d changed these settings a while back. So I had to pick a solution that wouldn’t require StackOverflow to incur a huge new payroll expense.
Went live with 2008, -T2301 killed us.
We went live with SQL 2008, rebuilt the indexes & stats, turned on the site (now hosting IIS on a separate box, mind you) and immediately the server slowed to a crawl. I figured it’d take a few minutes to get a good set of execution plans built, but the server just wasn’t recovering. Doing diagnostics on the server, I discovered that queries using sp_executesql were just knocking the server over. Ahhh-ha! Those were dynamic SQL strings, and those would probably get new execution plans built every time. The trace flag -T2301 failed us there, so we had to rip it back out.
How much would you pay to avoid a scenario like this? $19.95? $29.95? But wait, there’s more!
After ripping out the trace flag, the server stabilized around 20-30% CPU, but those numbers were too high for a weekend night. When they came up to full load during the week, the server fell over, averaging 100% CPU for a few minutes at a time. The problem query was doing a union between three full text searches, but before you scream bloody murder about a union, even running the three searches independently was taking 60-70% of the time they took when unioned together. We were screwed. The guys had to make a change to their application and cache data on the web server’s hard drive in order to sustain their normal load.
Ugh. As a DBA, that’s a failure when the app guys tell me that. This is an application that used to live fine on a single box, and now, even with SQL 2008 on its own hardware, the app guys have to work around a weakness in SQL 2008. Ouch. I take that pretty personally.
The lesson: capture a full trace before you upgrade SQL Server.
The lesson: before you upgrade, capture a full trace of your server’s load and replay it against the new version. Analyze before and after duration times and CPU numbers for both versions, and identify the list of queries that run slower. Examine how often they actually run in production, and think about how that’s going to affect your load. This was my own failure – after working with the guys at StackOverflow and seeing how tight their queries were, it seemed like the slowest queries on SQL 2005 were still in pretty good shape. Unfortunately, hidden below the surface in queries that were running in 50-75ms on SQL 2005, were some queries that ballooned to over 1 second on SQL 2008, and went much higher under load.
Furthermore, a simple trace replay still won’t give you the full picture because traces don’t throw the same amount of load at the replay server in the same time. In a web server scenario, you may have a hundred queries come in simultaneously, and you want to see exactly how the new server will be affected by that – but replaying a trace with the Microsoft native tools won’t give you that answer. For that, you need a benchmarking tool like Quest Benchmark Factory or HP Mercury LoadRunner, etc – something that can capture a load and then replay it with the same velocity and bandwidth.
Do I like SQL Server 2008? Yeah. But do I wish we could have avoided what happened this week with StackOverflow? Hell yeah.