How to Troubleshoot SQL Server Connection Timeouts

SQL Server
19 Comments

If your application can sometimes connect to SQL Server – and sometimes not – it can be really tough to troubleshoot. If SQL Server doesn’t even hear the call, it can’t log any errors.

Here’s the questions I ask to get to the root cause:

When it happens, does it happen to all applications? For example, do you have monitoring tools pointed at the SQL Server, and are they able to consistently connect to SQL Server even when the problem is happening?

Does it happen to all application servers? If you have several app or web servers, are they all affected? (If you’ve only got one, now is an excellent time to set up another one for troubleshooting, and balance the load between them.)

Are all queries in the application affected, or just some queries? Sometimes I see long-running queries keep right on going, but only new connections are affected.

Sometimes, your SQL Server just goes on vacation.
Sometimes, your SQL Server just goes on vacation.

Are there any errors logged in the SQL Server or application servers? In one case, we saw that all of the application servers lost network connectivity at the same time, on a regular basis. Turns out there was a bad switch involved.

Is there a pattern to the days/times of the timeouts? Start writing them down or documenting when they happen. For example, in one case, we saw that the days/times were exactly correlated to the security team’s regularly scheduled port scans.

During the timeouts, is the app server able to ping the SQL Server? When all else failed with one troubleshooting triage, we put a free network monitoring tool on the app server to ping the SQL Server every 10 seconds. Sure enough, the next time the app had query timeouts, we were able to prove that even pings weren’t working – thereby ruling out a SQL problem.

Ask those questions, and sometimes you don’t even have to troubleshoot SQL Server at all – the answers tell the whole story.

Previous Post
Do Foreign Keys Matter for Insert Speed
Next Post
Announcing Live Query Execution Plans

19 Comments. Leave new

  • > Sure enough, the next time the app had query timeouts, we were able to prove that even pings weren’t working – thereby ruling out a SQL problem.

    Just last week we had a curious issue where an application has suddenly started failing every 3rd night with apparent database connection issues; but we could find no fault on our end (it was a cluster that is used concurrently with lots of other apps with no issues).

    The cause? Well it seems the well known data centre we use had silently turned on a new VM backup process that is completely invisible to us, and is causing machines to lock up long enough for non-resilient apps to fail. My co-workers discovered this only because a couple of our clusters coincidentally started failing over for no apparent reason… 🙂

    I’m not sure what the takeaway is except maybe if you’re out of options it would help to examine the basic assumptions you’ve made (“the data centre couldn’t possibly have changed anything without us knowing, and surely it would show in the logs if they did!”) and rule them back in.

    Reply
  • Brent – As always, thank you. What free network monitoring tool did you use?

    Cody – We had a similar issue. We converted from XenServer to VMWare and started using VEEAM for VM backups. We have one app that is very finicky, and tends to fail if it loses connection for even the slightest amount of time. We have this app installed on multiple servers and noticed that some, but not all, would fail during the VEEAM db server backup period. When we suspended the db server backups, the issue went away. There was nothing in the SQL Error Logs that indicated a problem, so as far as SQL Server was concerned, nothing bad happened. Fortunately, everything is in-house, and after talking with our sysadmin it was easy to figure out.

    Reply
    • Scott – I’m hesitant to plug one here because it’s not really the point of the post. I’d use the one you’re most comfortable with.

      Reply
    • Scott – This sounds like a VMware issue and can be easily reproduced when you take a snapshot, wait for a while(however long the backup takes) and then consolidate the snapshot. If you google ‘vmware freeze vm snapshot’ or something like that, they even have a KB for that!

      Reply
  • Make sure connection pooling is being used (perfmon is helpful), and in my experience some
    orms/dal tools are MUCH more prone to timeouts than others.

    Reply
  • Thierry Van Durme
    June 1, 2015 3:38 am

    I wonder if anyone’s ever experienced timeouts related to transaction log backups. Some time ago I had a long troubleshooting journey as to why applications sometimes timed out connecting to SQL (multi-instance 2-node physical cluster). I posted my findings here and there but the only response I got was that backups shouldn’t cause timeouts. I’m pretty sure that is the cause though because when I changed the backup schedule with a few minutes (still running every 15 minutes but now at xx:07 instedad of xx:05 etc) the timeouts followed the same pattern.
    We still face this issue now and then (tested on 2005 up to 2008r2) and – even though we don’t face any apparent performance problems on these instances – increasing MAXMEM seems to have a positive influence. So if anyone experiences the same I’m interested to hear your thoughts.

    Something else I was wondering about: sys.dm_os_ring_buffers where ring_buffer_type = ‘RING_BUFFER_CONNECTIVITY’. Sure you see bad password errors (18456) but these are in the errorlog as well. So, anyone have some real-life examples on less trivial situations?

    Chrz

    Reply
    • Thierry – that’s typically due to a dramatically underpowered SQL Server, or doing log backups over an underpowered network connection.

      Reply
    • I’m curious if you checked to see if you were experiencing memory pressure (as you said max mem helped). IME SQL Server can get a bit funky if it’s really desperate for memory.

      No we didn’t see that on a busy system even low on memory. But I have seen systems in general that are just low on memory thrashing the disk and causing connection issues.

      Reply
      • Thierry Van Durme
        June 1, 2015 7:09 am

        Thx Brent & Cody for the replies. The strange thing is that we don’t experience performance problems. No complaints (except for the occasional timeout which app engineers don’t even always notice), DMVs not showing apparent issues, Spotlight on SQL not showing anything unusual nor alerts. At first I thought it was the network because when it happens it’s mostly during off-business hours (hence no complaints probably :)) when a lot of other large backups are running concurrently. But it happens during daytime as well and I would also think increasing MAXMEM would have no effect if it was a network issue. But maybe I’m just being mislead by other factors. To be continued… Thx again!

        Reply
  • Hi Brent,

    we face two issues relating to SQL connectivity (1) pre-login handshake errors where there is a load on BizTalk webservers and on a 2min interval around 60-70 messages are suspended because of pre-login handshake errors. No messages in SQL or windows eventlog, ping, tracert etc coming clean. All ports cleared in firewall and rest times, things work fine. What further to check here?

    (2) The BizTalk SQLServers going out of domain frequently; on checking we find lots of connections (i.e >= 1000) taking up ports while connecting to SQL named instance. At this point, we end up rebooting the passive node and failover. What further to check here?

    Thanks,
    Samjong Keps

    Reply
    • Samjong:

      1 – if it happens regularly on a 2 minute interval, I’d watch the SQL Server’s network connection with Wireshark. Get your network admins to help there, and they can track down whether the SQL Server isn’t seeing the connection requests, or if it’s something the SQL Server is waiting on.

      2 – open a call with MS support. It’s $500, and they work the problem with you until it’s fixed. It’s not that this is an un-solvable problem by any means, but just that it could be any one of a bunch of problems, and for just $500, you get help from the company who wrote both BizTalk AND SQL Server, so it’s a deal.

      Reply
    • Thierry Van Durme
      June 8, 2015 3:26 am

      FYI, maybe it rings a bell.
      We experience the issue as well where BizTalk spawns thousands (>6000) of connections to SQL. We inform the BizTalk engineers and they tell us the issue is with a process (I think these are called orchestrations?) that connects to a third party’s web service and it doesn’t respond. I would say this is a bug in the code. So maybe you have something similar running?

      Also this may be interesting: https://msdn.microsoft.com/en-us/library/aa560610(v=bts.10).aspx

      And as for the Biztalk servers getting out of the domain. We had that one as well. That was caused by domain controllers being rebooted periodically as part of maintenance operations. You may find this helpful http://blogs.msdn.com/b/biztalknotes/archive/2013/08/22/biztalk-hosts-fail-when-domain-controller-s-are-rebooted.aspx

      Hope this helps

      Reply
  • Hello Brent,

    Is it possible for cancelled auto growth of transaction logs to cause a client application to lose connectivity to SQL Server DB? The time of the lost connection within the application logs matched the time auto growth was cancelled for the second time. The first log grow attempt was cancelled or timed out after 60014 milliseconds and the second one was cancelled or timed out after 1014 milliseconds. The third try was finally successful, but it took 90419 milliseconds to grow the log file by 100 mb…

    Does the process of auto growing affect DB connectivity in any way? Also, 100 MB is not too big for auto grow, right? It’s the first time I’ve seen this message and the log file is currently already at 38 GB so I think it will grow by a lot further…

    Thank you.

    Reply
    • K – if you’re having log growths taking 60-90 seconds, adjust your log file size growth smaller.

      Reply
      • Thanks for the super fast response!

        Do you think that might have affected client connectivity to the database?

        Reply
        • K – I have no idea. I’ve never seen log file growths take 60 seconds before, and if I did, I sure as hell wouldn’t keep troubleshooting deeper. I’d stop there.

          It’s like saying, “Do you think a case of gonorrhea would reduce my 400 meter dash time?” Take care of the gonorrhea before you worry about the 400 meter dash time.

          Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.