The Difficulty with Deadlocks

Deadlocks can kill an application’s performance. Users will complain about the app being slow or broken. Developers will ask the DBA to fix the problem, DBAs will push the problem back on developers. The next thing you know, the office looks like Lord of the Flies.

What is a Deadlock?

A deadlock occurs when two queries need exclusive access to different tables and each query is waiting for the other to finish. Assume that there are two tables, tA and tB. There are also two queries, Q1 and Q2. The first query, Q1, takes an exclusive lock on tA at the same time that the second query, Q2, takes an exclusive lock on tB. So far, there’s nothing out of the ordinary happening. Q1 then requests exclusive access to tB. At this point we have a block. Q1 must wait for Q2 to release its lock before Q1 can finish. Q2 now requests an exclusive lock on tA. And here we have a deadlock.

Q1 won’t release its lock on tA until it can get a lock on tB. Q2 won’t release its lock on tB until it can get a lock on tA. In order for either query to finish, they need access to the other query’s resources. That’s just not going to happen. This is a deadlock.

In order for the database to keep responding, one of these queries has to go. The query that’s eliminated is called the deadlock victim.

Finding Deadlocks

What is the first sign of a deadlock? Queries that should be fast start taking a long time to respond. That’s the first sign of a deadlock, but that’s also the first sign of a lot of other problems. Another sign of a deadlock is an error (error 1205 to be precise) and a very helpful error message: Transaction (Process ID %d) was deadlocked on {%Z} resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

SQL Server is telling you exactly how to solve the problem – re-run your transaction. Unfortunately, if the cause of the deadlock is still running, odds are that your transaction will fail. You can enable several trace flags to detect deadlocks (trace flag2 1204 and 1222), but they output the deadlock to the SQL Server error log and produce output that is difficult to read and analyze.

Once deadlocks show up, your database administrator might reach for a script to pull deadlocks out of Extended Events. Extended Events are a great source of data for analysis. Although they’re a relatively new feature to SQL Server, they first appeared in SQL Server 2008, Extended Events already provide an incredibly rich set of tools for monitoring SQL Server. Event data can be held in memory (which is the default) or written out to a file. It’s possible that to build a set of monitoring tools that log all deadlocks to a file and then analyze that file after the events happen.

Yesterday's deadlocks are tomorrow's news!
Just like newspapers help us find out what happened yesterday, Extended Events provide a great way to investigate deadlocks that have already occurred. If you don’t have any other monitoring tools in place, Extended Events are a great place to start. Once you start seeing deadlocks, you’ll want to start gathering more information about them. It takes some skill to read the XML from a deadlock graph, but it contains a great deal of information about what happened. You can find out which tables and queries where involved the deadlock process, which process was killed off, and which locks caused the deadlock to occur.

The flip side of the coin is that Extended Events will give you very fine grained information about every deadlock that has already happened. There’s nothing in Extended Events to help you stop deadlocks from happening or even to detect them right when they are happening. Much like a microscope lets you look at a prepared slide in excruciating detail, Extended Events let you look at a single point in time in excruciating detail. You can only find out about things after they happen, not as they happen.

Deadlock Notifications

Wouldn’t be nice if you received notifications of deadlocks as they were happening? Good news! Deadlocks only happen when data is changed; it’s possible to wrap your modification statements inside a template to record any deadlocks that happen. This template takes advantage of some features in SQL Server to allow the deadlock notifications to get sent out asynchronously – the notifications won’t slow down any applications while they interact with SQL Server.

There is a big problem here: every stored procedure that modifies data needs this wrapper. If the wrapper is missed in one place, there won’t be any deadlock information collected from that query. If the wrapper needs to be changed, it has to be changed everywhere. This can be a good thing, of course, because you can target problem queries for reporting or different queries can respond in different ways. Like using Extended Events, this is a very fine grained mechanism for dealing with deadlocks. Action is taken at the level of a single execution of a query and not at the level of our entire application. If we’re going to take care of deadlocks, we want to do it once and fix things across the entire application.

Deadlocks by Design

Both the Extended Events and notification solution are very cunning ways to get information about deadlocks that have already happened. Neither solution helps applications respond to deadlocks as they happen.

Much like a director deciding to fix it in post, monitoring for deadlocks and trying to solve the problem is a reaction to something that should have been done right in the first place. Maybe budget constraints got in the way, maybe the software had to ship by a deadline, maybe there wasn’t expertise on the team to look into these problems. For whatever reason, something made it into production that causes deadlocks. It doesn’t matter what happened, the problem is there; deadlocks are happening.

Error 1205: Catching Deadlocks with Code

Application developers have tool they can use to cope with deadlocks. When SQL Server detects a deadlock and kills of a query, an error is thrown. That error makes its way back up to the software that made the database call. .NET developers can catch the exception and check the Number. (Deadlocks throw an error number of 1205.)

When a deadlock happens, SQL Server will kill off the cheapest transaction. The “cheapest” transaction is the transaction with the lowest cost. It’s getting rid of something that will be easy to run a second time around. Instead of having deadlocks cause problems, developers can easily check the errors that come back from the database server and try again. You can set the deadlock priority; if you don’t have time to fix to the code, you can specify which queries should run at a lower priority.

This is moving the problem up the chain. The users may not see that there is a deadlock, but the application code still needs to deal with it. Things can still be tricky, though. If there’s a long running transaction holding locks and causing deadlocks, no reasonable amount of re-tries will solve the deadlocking problem.

Reacting to Deadlocks with Architecture

The easiest way to eliminate deadlocks is to design the database to avoid deadlocks. It sounds facetious, doesn’t it? Of course the easiest way to avoid deadlocks is to design so they don’t happen!

There are a few architectural patterns to use in an application to avoid deadlocks.

Pattern 1: Using NOLOCK to Stop Deadlocks

NOLOCK for YESOUCH
A common way to stop deadlocks is to use the NOLOCK query hint. NOLOCK users advocate this approach because they believe it does what it says – it eliminates locking.

NOLOCK doesn’t get rid of all locks, just the ones that make your queries return the right results. You see, NOLOCK stops locking during read operations. In effect, it throws the hinted table or index into READ UNCOMMITTED and allows dirty reads to occur. Locks are still necessary for data modification; only one process can update a row at a time.

By using NOLOCK, you’re telling the database that it’s okay to avoid locking for read safety in exchange for still letting deadlocks happen.

Pattern 2: Indexing for Concurrency

In some cases, deadlocks are caused by bookmark lookups on the underlying table. A new index can avoid deadlocks by giving SQL Server an alternate path to the data. There’s no need for the select to read from the clustered index so, in theory, it’s possible to avoid a deadlock in this scenario.

Think about the cost of an index:
* Every time we write to the table, we probably end up writing to every index on the table.
* Every time we update an indexed value, there’s a chance that the index will become fragmented.
* More indexes mean more I/O per write.
* More indexes mean more index maintenance.

To top it off, there’s a good chance that the index that prevents a deadlock may only be used for one query. A good index makes a single query faster. A great index makes many queries faster. It’s always important to weight the performance improvement of a single index against the cost to maintain and index and the storage cost to keep that index around.

Pattern 3: Data Update Order

A simple change to the order of data modifications can fix many deadlocks. This is an easy pattern to say that you’re going to implement. The problem with this pattern is that it’s a very manual process. Making sure that all updates occur in the same order requires that developers or DBAs review all code that access the database both when it’s first written and when any changes are made. It’s not an impossible task, but it will certainly slow down development.

There’s another downside to this approach: in many scenarios, managing update order is simply too complex. Sometimes the correct order isn’t clear. Managing update order is made more difficult because SQL Server’s locking granularity can change from query to query.

In short, carefully controlling update order can work for some queries, but it’s not a wholesale way to fix the problem.

Common Patterns: Common Failures

One of the problems of all three patterns is that they’re all reactionary. Just like the two methods for detecting deadlocks, they get implemented after there is a problem. Users are already upset at this point. There has already been some kind of outage or performance problem that caused the users to complain in the first place. Of course, sometimes you inherit a problem and you don’t have the opportunity to get good design in place. Is there hope?

Whether you’re starting off new design, or combating existing problems, there is a way that you can almost entirely prevent deadlocks from occurring.

Using MVCC to Avoid Deadlocks

MVCC is a shorthand way of saying Multi-Version Concurrency Control. This is a fancy way of hinting at a much broader concept that can be summarized simply: by maintaining copies of the data as it is read, you can avoid locking on reads and move to a world where readers never block writers and writers never block readers.

This probably sounds like a big architectural change, right? Well, not really.

SQL Server 2005 introduced READ COMMITTED SNAPSHOT ISOLATION (RSCI). RCSI uses snapshots for reads, but still maintains much of the same behavior as the READ COMMITTED isolation level. With a relatively quick change (and about a 10 second outage), any database can be modified to make use of RCSI.

When Should You Use RCSI?

If you actually want my opinion on the subject: always. If you’re designing a new application, turn on RCSI from the get go and plan your hardware around living in a world of awesome. TempDB usage will be higher because that’s where SQL Server keeps all of the extra versions. Many DBAs will be worried about additional TempDB utilization, but there are ways to keep TempDB performing well.

The bigger question, of course, is why should I use RCSI?

Use RCSI to Eliminate Locking, Blocking, Deadlocks, Poor Application Performance, and General Shortness of Breath

RCSI may not cure pleurisy, but it’s going to future proof your application. Somewhere down the road, if you’re successful, you’ll have to deal with deadlocks. Turning on RCSI is going to eliminate that concern, or make it so minimal that you’ll be surprised when it finally happens.

A Snapshot of The Future: Looking Past RCSI

RCSI is probably all that most people think they going to need at the start of their architectural thinking. There will be circles and arrows and lines on a whiteboard and someone will say “We need to make sure that the DBAs don’t screw this up.” What they really mean is “Let’s talk to the data guys in a year about how we can make this greased pig go faster.”

Both of these versions can poop and bark.
During the early stages of an application’s life, a lot of activity consists of getting data into the database. Reporting isn’t a big concern because there isn’t a lot of data to report on and a few tricks can be used to make the database keep up with demands. Sooner or later, though, demand will outstrip supply and there will be problems. Someone might notice that long running reports aren’t as accurate as they should be. Numbers are close enough, but they aren’t adding up completely.

Even when you’re using RCSI, versions aren’t held for the duration of a transaction. The different isolation levels correspond to different phenomenon and those phenomenon, under a strict two-phase locking model, correspond to how long locks are held. When using one of the two MVCC implementations (RSCI or snapshots), the isolation levels and their phenomenon correspond to how long versions are kept around.

Using RCSI, or even READ COMMITTED, locks/versions are only held for a single statement. If a query has to read a table multiple times for a report, there’s a chance that there can be minor (or even major) changes to the underlying data during a single transaction. That’s right, even transactions can’t save you and your precious versions.

SNAPSHOT isolation makes it possible to create versions for the duration of a transaction – every time a query reads a row, it’s going to get the same copy of that row, no matter if it reads it after 5 seconds, 5 minutes, or 5 hours. There could be multiple updates going on in the background but the report will still see the same version of the row.

Getting Rid of Deadlocks in Practice

There are manual ways to accomplish eliminate deadlocks, but they require significant effort to design and implement. In many cases deadlocks can be eliminated by implementing either READ COMMITTED SNAPSHOT ISOLATION or SNAPSHOT isolation. Making the choice early in an application’s development, preferably during architectural decisions, can make this change easy, painless, and can be designed into the application from the start, making deadlocks a thing of the past.

Previous Post
Which Sessions Will YOU Send to the SQL PASS Summit?
Next Post
Building a Better BrentOzar.com Blog

32 Comments. Leave new

  • @Jeremiah: Excellent article!

    My SQL servers recently acquired a new behavior that looks like some sort of deadlock on tempdb. It causes my application performance to come to a near standstill.

    When looking for deadlocks I see all my exec requests with a wait of ‘page%latch_ex’ or page%latch_sh’ and the resource they’re waiting on is 2:1:103, which according to DBCC PAGE (2,1,103) is a system object on tempdb. Any idea what would cause this or how to look further into it?

    Reply
    • Ooh, interesting question. If it’s a deadlock, you’ll eventually see SQL Server throw a deadlock error back up to your application code. Are you seeing those errors popping up in the app tier?

      You can use sp_WhoIsActive to examine who is generating locks. There’s even a flag you can provide to sp_WhoIsActive (@get_locks = 1) to show you which locks are being taken out by which process. You’ll have to run sp_WhoIsActive when things are being slow, so catching it can be tricky. You could always capture the output of sp_WhoIsActive to a table in a SQL Agent job and analyze the data there.

      Reply
  • Hello Jeremiah,

    I have solved some nasty deadlock problems with the same solution at least 3 times: by transforming and UPDATE based on several fields of the table to an UPDATE based on the primary key. For instance, change

    UPDATE CustomerKey
    SET KeyValue = ‘…’
    WHERE Name = … AND KeyValue = … AND …

    to:

    SELECT @CustomerKeyId = CustomerKeyId FROM CustomerKey WITH (NOLOCK)
    WHERE Name = … AND KeyValue = … AND …

    UPDATE CustomerKey
    SET KeyValue = ‘…’
    WHERE CustomerKeyId = @CustomerKeyId

    I was wondering why this solves the problem, I believe in the first case a “wider” lock is acquired, but can you explain me the details of it?

    Thanks.

    Reply
    • In the first case, SQL Server may choose to take out range locks on a secondary index and then issue an update lock on the clustered index. When you update based on just the clustered index, you end up only locking a single row at the clustered index, potentially removing many other places for deadlocks to occur.

      Reply
  • Kim Tripp and Neal Graves wrote a good paper comparing Oracle with SQL Server (with respect to isolation levels) http://technet.microsoft.com/en-us/library/ms345124(SQL.90).aspx

    They examined the differences between SQL Server’s SNAPSHOT isolation level and Oracle’s SERIALIZABLE isolation level.

    Do you know if there are any differences in behavior between Oracle’s READ COMMITTED isolation level and SQL Server’s READ COMMITTED with row versioning (RCSI)? I’m sure there are implementation differences, but would an DB Developer notice anything?

    Reply
  • Jeremiah,
    Great post. Since you are think that RCSI always should be used, I would like to know what is your opinion on this link http://www.devx.com/dbzone/Article/32957/0/page/4.
    Thanks in advance.

    Reply
    • You do need to always test and make sure that RCSI won’t cause problems, especially if your database relies on implicit locking behavior. And you should always, always, always choose the appropriate isolation level for the current transaction. Just because I advocate using RCSI, that doesn’t mean you should only use RCSI.

      Reply
  • @Jeremiah
    I know its an old post but i’ve just come across an issue where Extended Events are not capturing recent deadlock info but the trace flag IS capturing the deadlock info in the log.
    I’ve got some code to pull the deadlock info and create a deadlock graph but the recent deadlocks are not there.

    SELECT
    xed.value(‘@timestamp’, ‘datetime’) as Creation_Date,
    xed.query(‘.’) AS Extend_Event
    FROM
    (
    SELECT CAST([target_data] AS XML) AS Target_Data
    FROM sys.dm_xe_session_targets AS xt
    INNER JOIN sys.dm_xe_sessions AS xs
    ON xs.address = xt.event_session_address
    WHERE xs.name = N’system_health’
    AND xt.target_name = N’ring_buffer’
    ) AS XML_Data
    CROSS APPLY Target_Data.nodes(‘RingBufferTarget/event[@name=”xml_deadlock_report”]’) AS XEventData(xed)
    ORDER BY Creation_Date DESC

    Are you aware of this problem?

    Reply
    • Derek – that sounds like a great question for Microsoft, actually. Have you thought about posting reproduction code to Connect.Microsoft.com?

      Reply
      • I am responding to an old post. But the deadlock and isolation level discussion is ever green. If the RCSI is allowed at the database level and used in transactions, doesn’t it make the tempdb grow big? Will this go from deadlock to server shutdown eventually if tempdb is not checked? Thanks in advance for your reply.

        Reply
        • You’re absolutely right, RCSI will use tempdb to store the row versions. But there’s a catch – you’re already using RCSI without even knowing it. Triggers, cursors, and a host of other SQL Server features are already using row versioning without you knowing about it. If you’re at all worried about it, you can monitor these counters on a test installation and enable RCSI to see how the size of tempdb changes. I suspect that your index rebuilds and DBCC checks put more load on tempdb than RCSI.

          In short – what you’ve asked in theoretically possible, but I’ve never seen or heard of it happening.

          Reply
          • Dennis Miller
            January 7, 2015 1:10 pm

            tempDB is only a transitory store of the row version data, though, right? As in, for the life of a transaction? So the trick is to figure out the aggregate number of updates,and deletes (and some inserts) that may happen at any given moment and then size your tempDBs accordingly. By trick, I mean I don’t have the magical formula to figure that out.

          • SQL Server is already tracking that as well, Dennis, via the Version Store perfmon counters. You can learn more about those counters over at How To Tell When tempdb is a Performance Problem.

  • Deadlocks are the bane of my existence at the moment, so I really appreciate all of the info you provided. My company recently implemented Microsoft Dynamics CRM 20011 (long story as to why we implemented an old version) and from the very start of the project deadlocks have been running rampant. I use Quest’s (now Dell) Spotlight for SQL Server to alert for deadlocks and I get buried in alerts. I raised the issue to development but I just get shrugs in response even though I can provide the transactions in question.

    The end user is unaware of the problem since these are mostly business rule engine activities that are largely transparent. The out of the box code even specifies low deadlock priority, so it seems Microsoft just thew their hands up and shipped it knowing there was a design issue.

    My question is whether you have heard of this issue with Dynamics and if there is a best practice to reduce the number of deadlocks. I suggested RCSI, but when tested it didn’t stem the tide of deadlocks appreciably.

    Reply
    • Your guess is as good as mine. Dynamics has some pretty specific requirements about how it’s installed and configure on SQL Server, so I’ve found it best to defer to the product team’s direction on how to configure that particular instance of Dynamics.

      Typically, though, when the deadlocks keep happening during business processes, despite RCSI being used, the deadlocks are occurring because of DML. At that I want to look at indexes that will make those processes faster.

      Without knowing more, though, it’s difficult to say.

      Reply
      • derek garside
        August 6, 2015 8:12 am

        We host CRM 2013 migrated from 2011 on SQL2012 cluster. Yesterday i had over 4k alerts relating to locking and deadlocks all from CRM. We have a ticket raised with Microsoft they have so far responded with this below Plus some SQL server best practises.

        Action plan:
        • Changing the OLEDBTimeout key in the registry of all front end servers.
        In Registry Editor, locate and then click the following registry subkey: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSCRM
        Double-click the OleDbTimeout value and type 600
        Note If the OleDbTimeout value does not exist, create an OleDbTimeout value doing this:
        Right-click MSCRM, point to New, and then click DWORD value, type OleDbTimeout, Edit DWORD Value dialog box, click Decimal, type 600 , click OK.
        • Changing the ExtendedTimeout key in the registry of all front end servers.
        In Registry Editor, locate and then click the following registry subkey: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSCRM
        Double-click the ExtendedTimeout value and type 1000000
        Note If the ExtendedTimeout value does not exist, create an ExtendedTimeout value doing this:
        Right-click MSCRM, point to New, and then click DWORD value, type ExtendedTimeout, Edit DWORD Value dialog box, click Decimal, type 1000000, click OK
        • Ensure you are taking advantage of the ExecuteMultiple while implementing your custom code plugin/workflows.

        Reply
        • Thanks for sharing the advice you got from Microsoft. For anyone else running into these issues – call Microsoft and open a support ticket before applying Derek’s advice. None of this advice is relevant to fixing deadlocks, but it will help with long running queries and long held locks.

          Reply
  • OnError,servername,\,procedure x,{6E9CF106-3D3C-45C6-AF8F-40B8E3419427},{1EBB691B-6DBC-4695-AFF9-8D272DC1312F},6/30/2015 4:30:10 PM,6/30/2015 4:30:10 PM,-1073548784,0x,Executing the query “exec Populate_x ?,?” failed with the following error: “Transaction (Process ID 242) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly..how to resolve this issue,in post u said rerun the transaction how to rerun my transaction.

    Reply
  • The issue I’m running into is I have a SELECT statement executed via an SSIS package querying a replicated table. I’ve been encountering deadlocks, but do to the isolation level and locking levels I’m not sure why they are happening. Any ideas…

    Reply
  • Hello Jeremiah,
    Reindex/update statics job failing due to being victim of a deadlock every time(which is scheduled Monday and Wednesday at 3:45 AM).
    There are few update table jobs are running everyday every 10 min. Is this the reason for dead lock?
    The customer don’t want to stop the update table jobs.
    Error Message:
    Transaction (Process ID 109) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. [SQLSTATE 40001] (Error 1205). The step failed.

    Could you please suggest the solution for the above.

    Reply
  • Garrett Devine
    October 26, 2018 2:48 am

    I have written a stored procedure to capture deadlocks and email them to the DBA. here is how you call it
    EXEC dbo.sp__DeadLockChecker @Minutes = 10, @Send_email = ‘true’
    @Minutes is how far back in the log it checks for errors. SO in my case, I run this every 10 minutes and it checks the last 10 minutes of activity. I have Trace flag 1204 and 1222 set on at start up.

    Reply
  • MAXDOP = 1 setting is causing longer run times and application timeouts. I am not understanding what’s the reason behind it. Could you please advise on how to fix it

    Reply
  • Excellent article!

    Sorry to nitpick but “Good news! Deadlocks only happen when data is changed” is not strictly true. It’s exclusive locks that are the source of deadlocks so you can still cause deadlocks between two sets of SELECT statements that are using WITH (TABLOCKX) or even WITH (ROWLOCK XLOCK) hints within transactions. i.e.: prepartory SELECT statements can still cause deadlocks before getting to the meat and potatoes of the INSERT or UPDATE statements in a transaction.

    For example:

    –SSMS window 1…
    begin tran t1
    select top 10 * from dbo.Table1 with (rowlock xlock) where Deleted=0 order by Table1ID;
    waitfor delay ’00:00:10.000′;
    select top 10 * from dbo.Table2 with (rowlock xlock) where Deleted=0 order by Table2ID;
    rollback tran t1

    –SSMS window 2…
    begin tran t2
    select top 10 * from dbo.Table2 with (rowlock xlock) where Deleted=0 order by Table2ID;
    waitfor delay ’00:00:10.000′;
    select top 10 * from dbo.Table1 with (rowlock xlock) where Deleted=0 order by Table1ID;
    rollback tran t2

    Reply
    • If you’re dumb enough to do a transaction with a select using an exclusive lock hint, you have much, much bigger problems to worry about. Like drinking your coffee without spilling it into the keyboard.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.