High Availability Doesn’t Fix Poor Performance

Last Updated 7 years ago

Imagine this: you have a database, and it’s on a SQL Server instance, which is on a server. That server may be physical or it may be virtualized. The database, or the instance, or the server, has a high availability solution implemented. It might be failover clustering; it might be synchronous mirroring; it might be VMware HA.

One day, your queries start to run a little slower. But the performance degradation is slow, and gradual. Weeks or months slip by, and bit by bit, millisecond by millisecond, the queries are taking longer. Before you know it, queries are taking out locks, blocking each other, and leading to timeouts.

What You Want To Happen

Your server is overloaded. CPU usage is high. Available memory is low. I/O times are dragging on. You know there is another server sitting in the data center that is waiting to take on this load. It’s just sitting there, idle!

You want your HA solution to kick in and take over. Why doesn’t it realize that performance has tanked and it needs to come to the rescue?

Your SQL Server Isn’t Down, It’s Just Thinking

The Thinker shutterstock_61948642 Chances are that if there was a disaster and your HA solution kicked in right now, you’d experience the same terrible performance on the other server, too – with the added pain of having downtime to failover.

Why?

High availability solutions are implemented to be there in case of failure. High CPU usage, low memory warnings, or excessive I/O times don’t constitute a failure. As much as you wish the workload could be transferred to the server with lower utilization, it won’t be – you have to tune your workload.

Yes, I’m telling you to roll up your sleeves and start performance tuning.

Do you have monitoring in place so you can check your baselines to see what resources are being used more than normal?

Have you checked the plan cache for the most resource-intensive queries in your server?

Are you performing regular index tuning (using a tool like sp_BlitzIndex®, so you have the right indexes for your workload?

Remember, your HA solution is there to pick up the pieces when something fails – not to be a safety net for poor performance.

1 Comment. Leave new

Allen McGuire
September 26, 2013 8:51 am

On very rare occasions – maybe once a year – I do fail over to address a “hidden” performance condition. If I find myself spending even one hour running resource related queries, running profiler traces through ClearTrace, examining horribly written/long running stored procedure execution plans, etc. – sometimes it’s best to just start fresh and see if the condition rectifies itself. Many DBA’s don’t know how to examine TempDB allocations, for example, which can indicate a query run wild and chewing up gigabytes of TempDB. While a failover won’t necessarily rectify the situation at hand, it gives me a fresh slate in which to start my troubleshooting and often times the issue stands out a bit more.

I actually used this method just the other day when the active node of my server was on fire – 100% CPU, high I/O wait times, etc. I spent *three days* trying to uncover the issue, which included two failovers, many traces, sp_whoisactive analysis, examining every metric under the sun (I use Spotlight also), etc. Turns out recycling each of our load-balanced web servers did the trick. I’m STILL scratching my head on this one – how can IIS cause such havoc on a database server?

In the end I’m better off now – one action I did take was to enable hyperthreading on the cluster nodes and my CPU has never been lower 😉 Never know – sometimes it helps, sometimes it doesn’t.

Thanks for the post Jes!

Reply

High Availability Doesn’t Fix Poor Performance

What You Want To Happen

Your SQL Server Isn’t Down, It’s Just Thinking

1 Comment. Leave new

Leave a Reply Cancel reply

Hi! I’m Brent Ozar.