When we first launched SQL ConstantCare®, the workflow was:
- You sent in data about your SQL Servers daily
- Our systems sent us (not you) recommendations on what to do
- Once a week, we reviewed those recommendations, made sure they made sense, adapted them when necessary, and then forwarded ’em on to you
Over time, we became more confident in the system’s recommendations, and we started sending the daily emails directly to some of our customers. The response was super-positive: people loved being able to get faster feedback when new urgent problems pop up.
So starting today, everybody gets the daily per-server emails. Because they’re fully automated, they should arrive within a few minutes of your regularly scheduled data collection. (I’m still doing the manual analysis emails, but now I’m holding off until you get all the basic bat-in-the-cave recommendations fixed.)
However, there are some problems you don’t wanna hear about every day. For example, you don’t wanna keep hearing that your SQL Server build level is unsupported, or that you need to disable Priority Boost – those things require a restart to fix. No sense in pestering you about that every day. So for big-picture restart-required stuff, we’re only alerting you about those on Mondays. Gives you a chance to plan your week – to figure out whether you wanna try to schedule outages sometime later.
On Mondays, we’ve also added a new section to the emails that covers what you’ve fixed recently:
My goal was to give you something you could forward up the management chain to say, “Here’s what I’ve been working on, and here’s what it means to our environment as a whole.”
Toward the end of the year, we’ll give you a roundup of your overall environment that you’ll be able to use for your annual review, too. Since we’re tracking the state of your servers over time, it’s easy for me to build a report that says, “Here’s what my environment looked like at the start of the year – warts and all – and here’s what it looks like now.”
Next up: identifying bored servers
A few selected users started seeing a new piece of advice last week: your SQL Server is bored, and you could probably run CHECKDB and/or backups more frequently in order to reduce your risk of data loss and downtime.
When that advice went out, I immediately got a lot of replies asking, “Wait – I thought my server was busy? What do you mean it’s bored?” Talking with ’em, a lot of them followed up with questions like, “So if my server isn’t busy, what does a busy server look like?” That makes sense, given that 64% of SQL ConstantCare users have 3 servers or less. (77% have 5 or less.) They just don’t have perspective across a wide estate of SQL Servers.
To explain a busy vs bored server, I start by pointing folks to the video How to Measure Your SQL Server, which introduces the concept of Wait Time Ratio. Say for every hour on the clock, your server spends one hour waiting on, say, storage. That’s a ratio of 1:1 – meaning, 1 hour of waits in 1 hour of clock time.
The more hours your server spends waiting on stuff, the higher your wait time ratio is. Here’s an example of a server that’s really hustling:
For every hour on the clock, this server spends 12-26 hours waiting on stuff. That axis over on the left hand side is really important.
On the other hand, here’s a bored server:
For every hour on the clock, this thing doesn’t even spend one minute waiting on things. In a case like this, if you’re only doing monthly full backups and daily differentials, thinking you’re making your server’s workload easier, you’re not really helping. The server’s bored all night anyway. You might as well be doing full backups daily (assuming a small enough set of databases and enough backup space) in order to get a better recovery time objective.
Wanna see if your server’s bored? Start with sp_BlitzFirst @SinceStartup = 1, which tells you your overall wait times since startup. It’s not a perfect measure – it’s a sum of wait times since the server started up, which includes weekends, holidays, etc – but it’s a good free start.
I run sp_BlitzFirst hourly via a SQL Server agent task, dumping the results to stats tables:
@OutputDatabaseName = ‘dba’,
@OutputSchemaName = ‘dbo’,
@OutputTableName = ‘BlitzFirst’,
@OutputTableNameFileStats = ‘BlitzFirst_FileStats’,
@OutputTableNamePerfmonStats = ‘BlitzFirst_PerfmonStats’,
@OutputTableNameWaitStats = ‘BlitzFirst_WaitStats’,
@OutputTableNameBlitzCache = ‘BlitzCache’,
@OutputTableRetentionDays = 28;
I believe the following view will give the wait time ratios using the data in the WaitStats table. It uses the Lag function, so you need SQL Server 2012 or newer. I excluded CLR_AUTO_EVENT waits since there were some extra huge ones that seemed invalid.
Cast(ComparisonQ.WaitTimeSecThisPeriod / ComparisonQ.ElapsedSec * 100 AS decimal(9, 2)) AS WaitTimeRatio,
FROM ( SELECT CheckDate,
WaitTimeSec – Lag(StatsQ.WaitTimeSec, 1) OVER ( ORDER BY StatsQ.CheckDate ) AS WaitTimeSecThisPeriod,
DateDiff(SECOND, Lag(StatsQ.CheckDate, 1) OVER ( ORDER BY StatsQ.CheckDate ), CheckDate) AS ElapsedSec
FROM ( SELECT Cast(CheckDate As Datetime) As CheckDate,
Cast(Sum([wait_time_ms] / 1000.0) AS decimal(9, 0)) AS WaitTimeSec
WHERE NOT wait_type = ‘CLR_AUTO_EVENT’
GROUP BY Cast(CheckDate As Datetime)
WHERE ComparisonQ.WaitTimeSecThisPeriod > 0
Matthew – wow, you’re reinventing a lot of work there. Check this out:
Thanks for reminding me about the dashboard. I tried it previously, which is likely where the SQL Server agent job came from. I recall I had some troubles using the GUI but I left the logging in place. I’ll give it another try.