If your application can sometimes connect to SQL Server – and sometimes not – it can be really tough to troubleshoot. If SQL Server doesn’t even hear the call, it can’t log any errors.
Here’s the questions I ask to get to the root cause:
When it happens, does it happen to all applications? For example, do you have monitoring tools pointed at the SQL Server, and are they able to consistently connect to SQL Server even when the problem is happening?
Does it happen to all application servers? If you have several app or web servers, are they all affected? (If you’ve only got one, now is an excellent time to set up another one for troubleshooting, and balance the load between them.)
Are all queries in the application affected, or just some queries? Sometimes I see long-running queries keep right on going, but only new connections are affected.
Are there any errors logged in the SQL Server or application servers? In one case, we saw that all of the application servers lost network connectivity at the same time, on a regular basis. Turns out there was a bad switch involved.
Is there a pattern to the days/times of the timeouts? Start writing them down or documenting when they happen. For example, in one case, we saw that the days/times were exactly correlated to the security team’s regularly scheduled port scans.
During the timeouts, is the app server able to ping the SQL Server? When all else failed with one troubleshooting triage, we put a free network monitoring tool on the app server to ping the SQL Server every 10 seconds. Sure enough, the next time the app had query timeouts, we were able to prove that even pings weren’t working – thereby ruling out a SQL problem.
Ask those questions, and sometimes you don’t even have to troubleshoot SQL Server at all – the answers tell the whole story.