Next up in my series of behind-the-scenes posts about our new SQL ConstantCare®, let’s talk about the kinds of SQL Server diagnostic data we’re gathering, why, and what data we’re steering clear of for now. (Psst – read through to the end for a huge, crazy discount today, too.)
We – you and I – are paranoid.
We’re a data professionals, first and foremost. During my consulting intro calls, I’m really proud to say that we don’t need to get a VPN connection, remote access, or a SQL Server account. You know how it is – DBA, Default Blame Acceptor – and these servers are already in bad shape. The last thing folks want is some consultant going rogue and “fixing” something, so we make it clear from the get-go that won’t happen.
Because we’re so paranoid, my first thought when deciding to collect diagnostic data was, “I don’t wanna end up in the news.” I bet you feel the same way. Therefore, we thought about how we could give users as much valuable insight as possible from as little data as possible.
I drew a big, clear box around queries, query plans, and statistics and said, “We’re not going here in v1.” That data can produce awesome performance recommendations, but it often contains personally identifiable information. I know some people are willing to share that data on a plan-by-plan basis – the referral-marketing success of PasteThePlan proves it. I even bet a lot of people are even willing to share a lot of their plans on a regular basis as long as they’re getting valuable performance advice. However, I think we can still provide a lot of value without having that data, and we have plenty of Lean Startup lessons to learn along the way first. For now, rather than automating that part of the analysis, we simply tell customers in their advice, “If you’re able to tune queries or indexes to fix this, here’s the exact parameters to use with sp_BlitzCache and sp_BlitzIndex to find the root cause of this problem.”
We started with the only data to run the most urgent sp_Blitz checks.
To pick the data we’d gather at first, Richie took the list of sp_Blitz checks from priority 1 to 100 – the most urgent stuff – and came up with the list of data requirements to be able to send emails like this:
To see the queries we run to get that data, look at the queries ConstantCare.exe runs by going to Windows Explorer and opening the folder:
The way it works is that ConstantCare.exe connects to the SQL Server, runs these queries, writes the results to an encrypted zip file, and uploads it to a private bucket in Amazon S3.
At this point, you might be thinking, “Why not just run sp_Blitz on the server and harvest the results?” We wanted to design checks that worked with data over time – for example, after we find auto-shrink enabled, and after you turn it back off, how are your wait stats affected? (And yes, we’ve already found over a dozen servers with auto-shrink enabled, hahaha.)
Exposing the queries directly to customers also has a side benefit: later this year, they’ll be able to save override versions of the queries. This way, if there are some parts of the diagnostic data that you don’t wanna send in, you can override that portion of the query (returning nulls, or masked data, or just not return any rows at all.) This presents a little more design and support work for us since missing rows might trigger a false alarm – for example, if no backup data comes in, we don’t wanna say, “OMG, you aren’t backing up your databases!”
Serverless architecture means per-database checks are harder.
We hit a few challenges early:
- In the past, we’ve used sp_MSforeachdb, but that doesn’t work in Azure SQL DB (and we wanted eventual compatibility with that, although not necessarily in v1)
- People have a crazy number of databases per server (think thousands)
- People who have a crazy number of databases also seem to have a crazy number of indexes per database (think thousands per database, times thousands of servers)
- And they also have a crazy amount of backup job executions (especially if they took my advice about backup frequency)
Moving large files around in the cloud isn’t a big deal, but processing them is, especially with serverless. AWS Lambda functions have a 5-minute timeout, and we use Lambda functions for stuff like:
- Importing the client’s JSON data into the database – in the first version of the intake process, we were processing incoming files line-by-line – and hoowee, did that not work with the crazy-number scenarios. After hitting timeout walls, Richie later changed it to pull the whole JSON file into the database as a set, plus we held the index analysis back for a later version.
- Analyzing the data to build email content – just like your homegrown queries look at SQL Server’s DMVs to figure out what’s broken, we’re using Lambda functions to run database queries to look at your DMV data. It’s just that the data is now stored in a central database on our end. Imagine a database server that holds DMV data from lots of customers – we’re gonna run into performance issues as we gain clients, and retain more database history. If your homegrown query takes over 5 minutes to run, you’re the only one who cares – but if our AWS Lambda functions time out over 5 minutes, you won’t get your report, and I care about that.
Watching function runtimes and timeouts has probably been one of the more interesting challenges so far. We have a company Slack channel for SQL ConstantCare® where status updates go and we talk about what we’re doing about ’em. We laugh a lot in there. Also, swearing and gifs.
The whole mission and design of SQL ConstantCare® means these timeout errors are less of a concern – it’s a mentoring product, not a real time monitoring product, and users don’t expect to get an email instantly after their data is sent in. (Plus, data is sent in daily, but for now, analysis & mentoring emails only go out weekly.) The database size is less of a concern, too – during the initial Minimum Viable Product (MVP) phase of the project, we could even delete and redeploy the database from scratch whenever we want, and the product still worked fine.