Where Should You Tune Queries: Production, Staging, or Development?

Execution Plans
23 Comments

The #1 fastest way to tune queries is in the production database, on the production server.

I know. Put the knife down. You’re not happy about that, but hear me out: I’m specifically talking about the fastest way to tune queries. In production, you can guarantee that you’re looking at the same data, hosted on the same exact build of SQL Server, with the same horsepower, and the same configuration settings at the server & database level. If you’re under duress, this is the fastest way to get it done.

Having said that, tuning queries in production carries obvious risks:

  • Queries might insert/update/delete data
  • You might see data you’re not supposed to see
  • Your tuning activities might adversely impact other users
  • You might forget to highlight that WHERE clause
  • Or you just plain might not have permissions to production (which frankly, is a great thing)

So instead, in descending order of effectiveness, here are the next fastest places to tune queries:

#2: Production data, on a different (but identical) server – because here you should get identical query plans. I still put this at #2, though, because in my experience, even when folks think it’s an identical server…it’s not. We run sp_Blitz on the two servers, and suddenly we discover that they have different sp_configure settings, different trace flags, different storage, etc.

#3: Production data, on a differently powered/configured server – because at least you’re dealing with the same data size and distribution, so you should get highly similar execution plans. However, they won’t be identical – especially when we start talking about wildly different memory (which produces different grants, therefore different spill behaviors), different SQL Server builds, and different configuration settings.

#4: Non-production data, but an identical server – now, we’ve got different execution plans because the data quantity & distribution is different. Now your tuning becomes dramatically slower because you have to do a great job of comparing query plans between production & non-production. When they’re different – and they’re gonna be different – you have to understand that your tuning efforts in non-production environment may not produce the same awesomely positive results (or even ANY positive results) in production. Great recent example: had a developer who just couldn’t fix a bad query plan issue whose root cause was bad stats in production. Until the developer could see the bad stats, she had no idea there could even be a problem like that.

#5: Non-production data, on a differently powered/configured server – because here, nothing’s alike. It’s like calling a plumber to have them fix a broken pipe in your house, and they can’t see your house, but they’re trying to look at their own house and explain to you what you should do. Nothing’s gonna match, and nothing’s gonna be easy. Can you tune queries this way? Sure – but it’s gonna take a long time and a lot of money.

Previous Post
Building SQL ConstantCare: Let’s Tune an Index in Postgres.
Next Post
Cool Query Hints

23 Comments. Leave new

  • I knew that I was maturing as a developer when I went from wanting access to production, to NOT wanting it…

    Reply
  • where does DBCC CLONEDATABASE fit in that hierarchy?

    Reply
    • For me, it’s pretty much worthless: it’s only good to show estimated plans, not actual. Usually, the reason I’m tuning queries is that SQL Server made bad decisions to begin with.

      Reply
  • I agree with you on the whole. Production at all of the places I’ve worked has always been different on so many levels.

    But I would also first ask, “what do you mean by tuning?”

    At our shop, and despite 7 years of attempting to build bridges and cultivate street cred with developers, 99% of tuning issues actually are just missing (or no longer adequate) indexes. THESE sorts of things that can be checked in DEV/ QA test environments.

    Reply
  • Alin Selicean
    January 22, 2020 9:25 am

    Great post, Brent. I often find myself asking the same question. And ending in the situation #1. I very rarely had the luxury of having different environments available for my tuning efforts, and when I had, they were a 10x lower servers in terms of resources. And let’s not forget about the production workload happening while tuning. That’s completely out of the picture when tuning on a different box. This impacts the effort by taking out the resource contention that’s happening in production.

    Reply
  • In our environment, we use a lot of virtual machines for SQL. In fact, most of our servers are virtual. In this case, we are able to query on an exact copy of production, without actually touching production – we can clone the system and boot it up on a test or DR network. Same exact settings (hardware, sys config, the whole thing). Maybe this is option 1.2?

    Reply
    • Wow, you are really lucky. Systems I usually see are monsters (4TB of memory, 72+ CPUs), which the customer’s virtualization cannot afford to clone. So it’s either #1 or #5 for me…

      Reply
      • Michael Ahearn
        January 23, 2020 9:31 am

        I definitely understand that! We got a few monster servers (1/2 – 1 TB mem.) as well that are also hardware based, but nothing of that size. Wow!

        Reply
  • Simon Holzman
    January 22, 2020 9:40 am

    For most select queries, which are usually the ones that need tuning, the query plan on a test system with production data should be almost identical to Production… so optimize on the test system and then test the optimization on Production. If it is not acceptable there, you may need to do the full optimization on Production but this should be rare.

    The reason I say that most optimizations are of Select statements is that most application Updates will be updating a single record and so that part should not need optimizing – use the primary key. If the update is calculated, the calculation should be able to be optimized as a select statement.

    Reply
  • Or the other way around… when a query is running slow in a non-production environment, I first check (if the required tables/data are available) how the query is running in production. Why bother with performance problems in a non-production environment when production runs ok? Within reason ofcourse.

    Reply
  • This is the kind of topic that, I usually say: It depends….. 😉 But I’m more likely run as safe as possible. Always keep up Production server and avoid issues relate it to human-errors. Sometimes it’s what it is…

    Reply
  • Marcelo Barbieri
    January 22, 2020 1:04 pm

    Totally agree!!!

    Reply
  • What about the best time of day to tune queries?

    Reply
  • Good one Brent. How about using DBCC OPTIMIZER_WHATIF for resource related constraint? I know its undocumented but will it be worth using it?

    Reply
  • “Having said that, tuning queries in production carries obvious risks:”
    I find that most of the time that risk can be mitigated with plan estimation; the “sky is falling, we need a fix quickly” type scenarios are typically a query that has decided to nested loops / table scan something and (assuming you can grab the query text and reproduce the issue) you can demonstrate that you (probably) have the fix without having to actually run anything by just asking ‘what’s the plan for this query’.

    Where would you put plan estimation only on the production server in terms of risk / accuracy?

    Reply
  • But seriously… who ‘tests’ in production, especially in a 24/7 env… I would only do that in cases where problems cannot be re-produced in DEV/ TEST/ QA AND it is a critical issue AND there is no other option (mind the AND operator here). I would rather restore a production database to DEV/ TEST/ QA and try to re-produce the issue there. In my opinion someone should be ( in order of magnitude) screamed at as loud as possible, Chinese water tortured, banned to Elba if he dares to show signs of this frightning behaviour.

    Reply
  • michael.wujcik
    January 23, 2020 7:33 am

    As I’m unable to fully replicate the issues in test (even replicating the data feed and opening a few client sessions doesn’t do it), I look at the stats on live, find the badly-behaving queries, then take them over to test and tune the actual plan against a restored backup.

    For example, right now I’m working on some heavy toilet usage. Blitzing live shows a handful of queries spilling like crazy, and when I re-run those queries in test I get the exact same plans. Some stats change a bit – but when I’m trying to eliminate sort-spills and reduce the raw volume of data read this works really well.

    Reply
  • Thanks Brent..Really helpful to define performance tuning approach for our current development .

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.