Analyzing a workload can be difficult. There are a number of tools on the market (both free and commercial). These tools universally reduce workload analysis to totals and averages – details and outliers are smeared together. I’m against using just averages to analyze workloads; averages and totals aren’t good enough, especially with the tools we have today.

Collecting Performance Data with Extended Events

A note to the reader SQL Server Extended Events let us collect detailed information in a light weight way. These examples were written on SQL Server 2012 and SQL Server 2014 CTP2 – they may work on SQL Server 2008 and SQL Server 2008R2, but I have not tested on either version.

We need to set up an Extended Events session to collect data. There are three Extended Events that provide information we want:

  • sqlos.wait_info
  • sqlserver.sp_statement_completed
  • sqlserver.sql_statement_completed

In addition, add the following actions to each event, just in case we need to perform deeper analysis at a later date:

  • sqlserver.client_app_name
  • sqlserver.client_hostname
  • sqlserver.database_id
  • sqlserver.database_name
  • sqlserver.plan_handle
  • sqlserver.query_hash
  • sqlserver.query_plan_hash
  • sqlserver.session_id
  • sqlserver.sql_text

In a production environment you should enable sampling. Sampling is just a predicate that filters out random events, typically by using the modulus of the session identifier (e.g. SPID modulo 5 = 0 will sample 20% of activity). Enabling sampling makes sure that you don’t collect too much data (yes, too much data can be a bad thing). During a routine 5 minute data load and stress test, I generated 260MB of event data – be careful about how often you run this and how long you run it for.

To get started, download all of the scripts mentioned in this post.

Reporting on Extended Events

Go ahead and set up the Extended Events session and run a workload against that SQL Server for a few minutes. If you’re analyzing a production server, copy the files to a development SQL Server or to a SQL Server that has less workload – you’ll thank me later. Once you’ve collected some data in the Extended Events session, stop the session, and run the results processing script. On my test VM, processing the data takes a minute and a half to run.

The data processing script creates three tables materialize XML data to speed up processing; shredding XML takes a long time. Relevant data is extracted from each of the different events that were collected and persisted into the processing tables. If you’re going to be doing multiple analysis runs across the data, you may even want to put some indexes on these tables.

Once the data is loaded, run the analysis query. This doesn’t take as long as the processing script, but it does take some time; it’s worth it. The analysis script collects a number of statistics about query performance and waits. Queries are grouped into 1 minute time blocks and metrics around reads, writes, duration, and CPU time are collected. Specifically, each metric has the following statistics built up:

  • Total
  • Average
  • Standard Deviation
  • Minimum
  • Maximum
  • Percentiles – 50th, 75th, 90th, 95th, and 99th

The same thing happens for each wait as well – each wait is time boxed on the minute and then both signal and resource waits are analyzed with the same statistics as query duration metrics.

In both analyses, analysis is performed on a query by query basis. At the end of the analysis we you get a multi-dimensional view of the data by time and query. It should be easy to perform additional analysis on the data to create broader time windows or to analyze the entire dataset at once.

Want to see what the output looks like? Check out these two screenshots:

Query Performance by the Numbers

Query Performance by the Numbers

Query Waits

Query Waits

Why Produce All of These Metrics?

All of these metrics give us insight into how these queries really run; an average just doesn’t help. Standard deviation alone lets us be aware of the variability of a particular metric – high standard deviation on a particular wait type means that we have a lot of variability in how long we wait on a resource. We also collect percentiles of all of these different metrics to help understand the distribution of data.

With this data at our disposal, we can make a better analysis of a workload. Now we can identify variations of a query that are producing bad plans, taking too long, or just reading an excessive amount of data. Or, better yet, if query performance is constant, we know that your code is just plain awful.

How’s that for a parting thought?

↑ Back to top
  1. Why is it 2014 and there isn’t a better way to do this yet? Really, MS SQL?!

    SELECT REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(w.statement)), NCHAR(13), ‘ ‘), NCHAR(10), ‘ ‘), NCHAR(9), ‘ ‘), ‘ ‘,”),’><',''),'’,’ ‘) AS QueryText

    • It’s because computers are terrible and the solution is laughably long in just about any other language. The only difference is that functions don’t suck in anything that isn’t T-SQL.

  2. Thank you Jeremiah its nice to see a actual real world examples. I think the real gold in here is highlighting the importance of understanding how statistics can hide outliers.

    Anyway, question for you. In production, how would propose using this event? So, is it scheduled to run a couple of times a week, adhoc or maybe all the time on a 30% sample rate?
    How about the analysis? Again, adhoc or go through the effort of scripting everything out it automatically generate alerts and reports?

    Basically, I’m really struggling (poor search-engine-fu?) to find decent articles about using extended events over the long term.

    • I’d propose using this in a similar way to using a server side trace. If you were asking me as a production DBA, I would only use this when I had a problem. I’d never leave something like this running all the time because of the potential for high load.

      Because of the load on disk and potential to create a very large set of files, I would only sample this at an acceptable rate for your workload, whatever that might be.

      As far as the analysis, I would run it as frequently as you want, but only run it on any system other than the main production system. As I’m sure you’ve noticed, the XML shredding pushes one CPU to 100% for a long time.

      If you want any kind of monitoring, like it sounds like you do, there are many monitoring packages that can help you find this kind of information – not all of them resort to averages.

      • Wow that was quick! Thank you for that, I didn’t feel right that extended events should be running all the time

        I’m left with one other question, if I may. Every now and again, something beats the hell out one of boxes. Are you aware of any techniques, products, anything that would show what was running/happening on that box leading up to the incident?

        Many thanks.

        • Not a problem.

          Most SQL Server monitoring tools will show you what’s been running at a given moment in time (or at least in a given time slice) – SQL Sentry, Spotlight, Idera DM, and Red Gate SQL Monitor all do it. If you want a hosted solution, Red Gate’s hosted solution and Circonus will show you this kind of information.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>