Blog

Analyzing a workload can be difficult. There are a number of tools on the market (both free and commercial). These tools universally reduce workload analysis to totals and averages – details and outliers are smeared together. I’m against using just averages to analyze workloads; averages and totals aren’t good enough, especially with the tools we have today.

Collecting Performance Data with Extended Events

A note to the reader SQL Server Extended Events let us collect detailed information in a light weight way. These examples were written on SQL Server 2012 and SQL Server 2014 CTP2 – they may work on SQL Server 2008 and SQL Server 2008R2, but I have not tested on either version.

We need to set up an Extended Events session to collect data. There are three Extended Events that provide information we want:

  • sqlos.wait_info
  • sqlserver.sp_statement_completed
  • sqlserver.sql_statement_completed

In addition, add the following actions to each event, just in case we need to perform deeper analysis at a later date:

  • sqlserver.client_app_name
  • sqlserver.client_hostname
  • sqlserver.database_id
  • sqlserver.database_name
  • sqlserver.plan_handle
  • sqlserver.query_hash
  • sqlserver.query_plan_hash
  • sqlserver.session_id
  • sqlserver.sql_text

In a production environment you should enable sampling. Sampling is just a predicate that filters out random events, typically by using the modulus of the session identifier (e.g. SPID modulo 5 = 0 will sample 20% of activity). Enabling sampling makes sure that you don’t collect too much data (yes, too much data can be a bad thing). During a routine 5 minute data load and stress test, I generated 260MB of event data – be careful about how often you run this and how long you run it for.

To get started, download all of the scripts mentioned in this post.

Reporting on Extended Events

Go ahead and set up the Extended Events session and run a workload against that SQL Server for a few minutes. If you’re analyzing a production server, copy the files to a development SQL Server or to a SQL Server that has less workload – you’ll thank me later. Once you’ve collected some data in the Extended Events session, stop the session, and run the results processing script. On my test VM, processing the data takes a minute and a half to run.

The data processing script creates three tables materialize XML data to speed up processing; shredding XML takes a long time. Relevant data is extracted from each of the different events that were collected and persisted into the processing tables. If you’re going to be doing multiple analysis runs across the data, you may even want to put some indexes on these tables.

Once the data is loaded, run the analysis query. This doesn’t take as long as the processing script, but it does take some time; it’s worth it. The analysis script collects a number of statistics about query performance and waits. Queries are grouped into 1 minute time blocks and metrics around reads, writes, duration, and CPU time are collected. Specifically, each metric has the following statistics built up:

  • Total
  • Average
  • Standard Deviation
  • Minimum
  • Maximum
  • Percentiles – 50th, 75th, 90th, 95th, and 99th

The same thing happens for each wait as well – each wait is time boxed on the minute and then both signal and resource waits are analyzed with the same statistics as query duration metrics.

In both analyses, analysis is performed on a query by query basis. At the end of the analysis we you get a multi-dimensional view of the data by time and query. It should be easy to perform additional analysis on the data to create broader time windows or to analyze the entire dataset at once.

Want to see what the output looks like? Check out these two screenshots:

Query Performance by the Numbers

Query Performance by the Numbers

Query Waits

Query Waits

Why Produce All of These Metrics?

All of these metrics give us insight into how these queries really run; an average just doesn’t help. Standard deviation alone lets us be aware of the variability of a particular metric – high standard deviation on a particular wait type means that we have a lot of variability in how long we wait on a resource. We also collect percentiles of all of these different metrics to help understand the distribution of data.

With this data at our disposal, we can make a better analysis of a workload. Now we can identify variations of a query that are producing bad plans, taking too long, or just reading an excessive amount of data. Or, better yet, if query performance is constant, we know that your code is just plain awful.

How’s that for a parting thought?

Jeremiah Peschka
When I’m not working with databases, you’ll find me at a food truck in Portland, Oregon, or at conferences such as DevLink, Stir Trek, and OSCON. My sessions have been highly rated and I pride myself on their quality.
Jeremiah Peschka on sabtwitterJeremiah Peschka on sablinkedin
↑ Back to top
  1. Why is it 2014 and there isn’t a better way to do this yet? Really, MS SQL?!

    SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(w.statement)), NCHAR(13), ‘ ‘), NCHAR(10), ‘ ‘), NCHAR(9), ‘ ‘), ‘ ‘,”),’><',''),'’,’ ‘) AS QueryText

    • It’s because computers are terrible and the solution is laughably long in just about any other language. The only difference is that functions don’t suck in anything that isn’t T-SQL.

  2. Thank you Jeremiah its nice to see a actual real world examples. I think the real gold in here is highlighting the importance of understanding how statistics can hide outliers.

    Anyway, question for you. In production, how would propose using this event? So, is it scheduled to run a couple of times a week, adhoc or maybe all the time on a 30% sample rate?
    How about the analysis? Again, adhoc or go through the effort of scripting everything out it automatically generate alerts and reports?

    Basically, I’m really struggling (poor search-engine-fu?) to find decent articles about using extended events over the long term.

    • I’d propose using this in a similar way to using a server side trace. If you were asking me as a production DBA, I would only use this when I had a problem. I’d never leave something like this running all the time because of the potential for high load.

      Because of the load on disk and potential to create a very large set of files, I would only sample this at an acceptable rate for your workload, whatever that might be.

      As far as the analysis, I would run it as frequently as you want, but only run it on any system other than the main production system. As I’m sure you’ve noticed, the XML shredding pushes one CPU to 100% for a long time.

      If you want any kind of monitoring, like it sounds like you do, there are many monitoring packages that can help you find this kind of information – not all of them resort to averages.

      • Wow that was quick! Thank you for that, I didn’t feel right that extended events should be running all the time

        I’m left with one other question, if I may. Every now and again, something beats the hell out one of boxes. Are you aware of any techniques, products, anything that would show what was running/happening on that box leading up to the incident?

        Many thanks.

        • Not a problem.

          Most SQL Server monitoring tools will show you what’s been running at a given moment in time (or at least in a given time slice) – SQL Sentry, Spotlight, Idera DM, and Red Gate SQL Monitor all do it. If you want a hosted solution, Red Gate’s hosted solution and Circonus will show you this kind of information.

  3. Great post Jeremiah.

    Just two thing. In the script, the duration column shows that the units are miliseconds but I think are microseconds, mainly for many reasons:

    – The documentation says “The time (in microseconds) that it took to execute the statement”
    – If it was miliseconds, I would have a big problem in my production servers 😉

    The second thing is a doubt about the query_hash column. Why sometimes it shows 0 (so you have to filter on the query)?

    Thanks in advance.

    • Without seeing the query it’s tough to say. I have seen a zero hash for system queries, extended stored procedures, CLR, in-memory OLTP, and a few other things.

      Where did you find XEvents documentation? I’d love a link since Microsoft don’t seem to publish it.

      • I suspect that this should be microseconds because the data would have no sense in my environment, so I tried to find an article that confirm my suspects… nothing found except on the SSMS itself: the window that shows the properties of the session, on the tab “Event field” there’s a description column with that text.

        It’s not exactly an “official document”, but is better than nothing

        • Ah, got it. I was hoping you’d found something more visible.

          Feel free to edit that code sample through github and send over a pull request.

          • Sorry but I’m newbie in github and have no idea how to do it. Maybe it’s safer that you do this change, I only want a beer in return 🙂

  4. hello Jeremiah,

    I try to measure the sql duration of our applications but i do not understand the differences i see in
    sp_statement_completed,sql_statement_completed and sql_batch_completed. The relation between them seems unstable due to e.g. overlap.
    for example: I kick off a sqlcmd file (in order not to have gui-related delays) that runs sp_helpdb 100 times:

    use adventureworks2012
    go

    DECLARE @cnt INT = 0;

    WHILE @cnt < 100
    BEGIN
    exec sp_helpdb adventureworks2012
    SET @cnt = @cnt + 1
    END

    When kicked off in adventureworks i see the following results (grouped in the gui, SUM(duration) )

    use adventureworks2012:

    name timestamp duration
    name: sql_batch_completed (2) SUM: 1034087
    name: sql_statement_completed (303) SUM: 1027417
    name: sp_statement_completed (7400) SUM: 868087

    so about a second in each category. (time is indeed microseconds)
    however, when i kick off the procedure in a different database,the result is :

    use model:
    name duration
    name: sp_statement_completed (600) SUM: 39349

    Same code, just one line? Number of statements is less – and duration is smaller, with a factor 30? This is only one example of the confusing figures i got back from the extended events. Have you got any idea on how to continue the investigation on getting the correct figures?

  5. Hey guys, can you link to something recent about tying together Extended Events and Data Collection/Management Data Warehouse? I want to take capacity metrics in a lightweight form through SQL Server (e.g. max memory used by a node, max CPU usage over a week, etc.) in a way that I can put it on an easy chart. Seems like Management Data Warehouse would provide exactly that.

    • MDW has great intentions, but I’ve never seen someone flip it on, and get lots of useful information with low overhead. You’re better off investing in a monitoring tool that you can store information from on a totally different server, and already has a dashboard and reporting built in. Check out SQL Sentry, Dell, and Idera.

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php