I came across an interesting case recently where queries were failing in a high transaction SQL Server. We knew that queries were occasionally failing with error severity 20 because SQL Server Agent alerts were periodically firing. However, those built in alerts don’t capture exactly what has failed– they just say “Something failed big. Good luck with that.”
This particular SQL Server instance runs thousands of batch requests each second. The errors would occur every hour or so at the most. This made things a bit tricky. Popping open SQL Server Profiler would have been a hot mess for performance.
What to do to find the query in distress?
The instance in question was running SQL Server 2008R2. This meant that Extended Events were an option. I’m open to using Extended Events, but it’s not generally my first choice. Sure, it’s a cool feature, but it tends to be time consuming to design, script, and test Extended Event sessions and then query the results. Let’s face it, Extended Events are a hornets nest full of angry XML.
In this case, however, Extended Events turned out to be pretty handy. The built-in System Health Extended Events trace already captures information for queries that fail with Severity 20 and higher– so the data was just waiting to be queried.
I Was Born to Write XQuery
That’s a lie. I figured out pretty fast that the System Health trace held the answer I wanted, but my search-engine-fu didn’t help me query exactly what I wanted. Getting the query just right was, in typical Extended Events style, a bit of a headache. I figured I’d share the information here, because you probably weren’t born to write XQuery, either. (If you were, you are probably a demon robot.)
Severity 20 Errors and Higher: System Health
If you need to query high severity errors from System Health, this query is a good place to start. The gist of the query is that you’re looking for rows in the System Health of the type error_reported. We shred out columns to display error severity, error number, error message, and (hopefully) the associated sql text, if it was captured.
Test this outside of production first. Monitor its runtime carefully: it can take a bit of time on a busy, large system. I use the temp table here for a reason– it helps SQL Server with estimates for the second query (row and cost estimates get pretty wacky in some cases if you use a CTE or a derived table).
SELECT CAST(target_data as xml) AS targetdata INTO #system_health_data FROM sys.dm_xe_session_targets xet JOIN sys.dm_xe_sessions xe ON xe.address = xet.event_session_address WHERE name = 'system_health' AND xet.target_name = 'ring_buffer'; SELECT DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), CURRENT_TIMESTAMP), xevents.event_data.value('(@timestamp)', 'datetime2')) AS [err timestamp], xevents.event_data.value('(data[@name="severity"]/value)', 'bigint') AS [err severity], xevents.event_data.value('(data[@name="error_number"]/value)', 'bigint') AS [err number], xevents.event_data.value('(data[@name="message"]/value)', 'nvarchar(512)') AS [err message], xevents.event_data.value('(action[@name="sql_text"]/value)', 'nvarchar(max)') AS [query text], xevents.event_data.query('.') as [event details] FROM #system_health_data CROSS APPLY targetdata.nodes('//RingBufferTarget/event') AS xevents (event_data) WHERE xevents.event_data.value('(@name)', 'nvarchar(256)')='error_reported'; DROP TABLE #system_health_data; GO
Catching Other High Severity Errors: Preparation
The query above is great, but not every error is Severity 20 or higher. What if a query is failing with Severity 16? While we’re at it, it might be good to check for those, too.
Since we went ahead and did the hard work on writing the query to display things in a nice way, it’s not a lot of extra work to adapt it. We do have to set up a custom Extended Events trace, though, because System Health filters for error severity 20 and higher.
Before you ever create a new Extended Events session, test it out on a non-production server first. No, really, just do it.
First, make sure to check out what Extended Events Sessions are running on your SQL Server:
SELECT * FROM sys.dm_xe_sessions; GO
If more than the System Health and sp_server_diagnostics sessions are running, find out what the sessions are being used for and what’s up. You don’t want to trace yourself to death.
Set up the Extended Events Session
Here’s our goals: We’d like to catch all errors higher than a severity of 10. We want to stream the errors to a file stored on our SQL Server asynchronously. We can do that with a script like this. The script defines the Extended Events session and then starts it.
You must change directories for where the files are stored
--Create an extended event session CREATE EVENT SESSION what_queries_are_failing ON SERVER ADD EVENT sqlserver.error_reported ( ACTION (sqlserver.sql_text, sqlserver.tsql_stack, sqlserver.database_id, sqlserver.username) WHERE ([severity]> 10) ) ADD TARGET package0.asynchronous_file_target (set filename = 'S:\XEventSessions\what_queries_are_failing.xel' , metadatafile = 'S:\XEventSessions\what_queries_are_failing.xem', max_file_size = 5, max_rollover_files = 5) WITH (MAX_DISPATCH_LATENCY = 5SECONDS) GO -- Start the session ALTER EVENT SESSION what_queries_are_failing ON SERVER STATE = START GO
If you’re testing this outside of production like you’re supposed to, you may want to generate some errors for testing purposes. You can do that with simple queries like this one. (Good news, if you don’t have AdventureWorks2012, it’ll still throw an error. Win/win?)
use AdventureWorks2012; GO select * from Person.Person where ModifiedDate='20aaa'; GO
This results in a level 16 error:
Msg 241, Level 16, State 1, Line 1
Conversion failed when converting date and/or time from character string.
To make sure your trace is working, you may query from the files that are collecting data. You can query this live, while the files are being written to. You can also copy the files off to another location and query them elsewhere.
You must change directories for where the files are stored
;with events_cte as( select DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), CURRENT_TIMESTAMP), xevents.event_data.value('(event/@timestamp)', 'datetime2')) AS [err_timestamp], xevents.event_data.value('(event/data[@name="severity"]/value)', 'bigint') AS [err_severity], xevents.event_data.value('(event/data[@name="error_number"]/value)', 'bigint') AS [err_number], xevents.event_data.value('(event/data[@name="message"]/value)', 'nvarchar(512)') AS [err_message], xevents.event_data.value('(event/action[@name="sql_text"]/value)', 'nvarchar(max)') AS [sql_text], xevents.event_data from sys.fn_xe_file_target_read_file ('S:\XEventSessions\what_queries_are_failing*.xel', 'S:\XEventSessions\what_queries_are_failing*.xem', null, null) cross apply (select CAST(event_data as XML) as event_data) as xevents ) SELECT * from events_cte order by err_timestamp;
Here’s what a bit of the output looks like on my test VM, which is a hotbed of terrible errors:
Clean up Your Extended Events Trace
This isn’t the heaviest of traces, but it’s always important to clean up after yourself– only trace the minimum amount needed to solve a problem, and turn it off when you’re not using it.
To stop and remove your trace, you’ll need to run queries like this:
-- Stop your Extended Events session ALTER EVENT SESSION what_queries_are_failing ON SERVER STATE = STOP; GO -- Clean up your session from the server DROP EVENT SESSION what_queries_are_failing ON SERVER; GO
Does this Replace Application Logging?
Noooooooooo. Oh, no.
This is a good tool to use periodically to check and make sure logging in your application layer is working well. But the application really should be catching failures, retrying when appropriate, and logging in a way that lets staff know that something is going wrong.
This method doesn’t catch everything you might think of as a “failing query”, either. Application timeouts are a little different. In an execution timeout scenario, the application decides that it’s tired of waiting. It sends a notification to the SQL Server that it’s canceling its query and moving on. That doesn’t get recorded as a failing query in this way.
I don’t think that’s a problem– because it makes a LOT more sense for the application to log errors and timeouts and decide what to do in each case.
Do You Always Need Extended Events?
Oh, heck no!
This can come in handy sometimes, so I’m sharing it. But Extended Events aren’t always the fastest, simplest, or easiest way to solve a problem– particularly if you’re not yet on SQL Server 2012.
If the queries in this post help you out– rock on! But don’t dismiss simpler methods if they get you where you need to go, too. The fanciest way to do something isn’t necessarily the best.