Building SQL ConstantCare®: The Vision

We’re starting to roll out a new product, SQL ConstantCare®, and now that we’ve got over 50 customers, it’s time to start sharing some behind-the-scenes details about what we’ve built. Over the next several Mondays, I’ll be writing about the decisions we made along the way – architecture, packaging, pricing, support, and more.

Lemme start by explaining something that I’ve noticed a lot in the field. A lot of companies buy SQL Server monitoring software, and then…don’t have the time to learn how to use it. They install it, and then promptly set up Outlook rules to filter all of the email alerts into a folder that they never read.

I’ve even point-blank asked clients, “So you’ve got a monitoring tool, right? Open it up and show me what metrics you’ve looked at in order to troubleshoot this problem.” They open up the app, click around hesitantly, and then eventually confess that they have no idea what they’re doing or what numbers they’re supposed to look at. Even when they have a pretty good handle on SQL Server metrics, they get overwhelmed when they see all the dials and charts.

Monitoring tools are fantastic for highly trained people with plenty of time on their hands.

But most people out there don’t have the luxury of in-depth training and decades of experience. They’ve got too many servers and not enough time. They just want to cut to the chase and be told what tasks they need to do, in prioritized order.

You’re safe in my jazz hands

Admins want mentoring,
not monitoring.

So I wanted to build something that simply:

  • Checked in with you once a week
  • Told you what specific tasks to do, how to do them, and why
  • Reviewed the homework you were supposed to do last week, and what kind of difference it made

I didn’t wanna show you dials, charts, or any metrics whatsoever other than supporting evidence for your homework tasks, like proving that your change was effective and noticeable by end users.

In the cloud, admins want context and cost.

One of the most common questions I get from clients – especially when I’m in-person and people feel more comfortable asking it – is, “How are we doing compared to other shops? Are we managing our servers like everybody else does? Are we over-powered or under-powered?”

It’s easy for us to give clients a rough guesstimate and grade because we see a lot of servers. However we wanted to take it to the next level and say things like:

For SharePoint environments with a similar ~1TB data size and query workload to yours, your server is seriously underpowered, and as a result, you’re seeing slower queries and higher wait times. The sweet spot for ~1TB of SharePoint data seems to be around 8 cores and 64GB RAM. If you switch from an m5.xlarge to an m5.2xlarge, we estimate that the average query duration will drop by 40% without any code or index changes. The VM’s costs will go up by about $616/month.

Or…

This data warehouse is doing very well: your nightly loads are finishing in 90 minutes, you’re doing CHECKDB and full backups daily, and during the day, reports are finishing within 10 seconds. It’s a little over-provisioned. If you wanted to cut your monthly bill, it’d actually be fairly easy since it’s in an Availability Group in an Azure VM. Change the secondary to be a L16 instead of a L32, and during your next maintenance window, fail over to the L16. See how the user experience goes, and if it’s awful, you can always fail back to the L32 secondary replica. If it’s good enough, though, then change the remaining L32 replica down to a L16 too. Between the two replicas, you’ll save about $12K per month.

We also wanted to answer management questions at scale, like:

  • When a new CU comes out, does it backfire? For example, when 2014 SP1 CU6 broke NOLOCK, if we had wait stats data across thousands of servers, it’d be much easier for that to pop out right away – like the very next day after people applied the CU. This is becoming so much more important in these days of fast-paced updates.
  • What are adoption rates of features like In-Memory OLTP or Operational Analytics? Are you really safe investing your limited training time in those features?
  • Are other people using a particular trace flag, like 8048 to prevent CMEMTHREAD, and what before/after performance effects did it have?

These are data problems.
We are data professionals.

But at the same time, sound the alarms: the cloud is involved.

When I started designing this in 2015, the term “the cloud” provoked rabid anger from many data professionals, as in, “I ain’t never gonna let none of mah data into the cloud!” On the other extreme, some folks are perfectly willing to paste their execution plans out for the public to see. There’s a wide cloud comfort range out there.

I totally get it, and I knew from the start that this product wouldn’t be for everyone when it was launched. But I wanted to design something for the next 10 years, not the past 10, and over time, the cloud is going to be the new default starting spot for most data. I was totally okay with launching something that only 1-10% of data professionals would use. After all, to use our mailing list as an example, there’s roughly 100K data professionals out there. If only 1% of them bought into it, that’d still be one hell of a helpful tool.

I’ll talk much more about the collection, data, security, storage, and analysis in subsequent posts, and I’m excited to share it because as a data professional, I wanna set a good standard as to how data gets handled. (Now I bet my GDPR post suddenly makes more sense to you, dear reader, but I wasn’t quite ready to announce SQL ConstantCare® yet back then.) I’m aiming for GDPR compliance even though we’re not selling to the EU – but I’m just not ready to deal with the complexities and legal fees of being one of the first defenders if something goes wrong. These aren’t easy problems – but this is what it means to work with data in the year 2018.

So that was the vision. Over the next several Mondays, I’ll blog about the PaaS database back end, development timeline, minimum viable product, packaging, pricing, security, analysis, and more. Up first next week: why we picked serverless architecture running on AWS Lambda. If you’ve got questions, feel free to ask away – in these behind-the-scenes posts, I try to share the kinds of business decisions I’d find interesting as a reader. I’ll try to answer the questions in comments, and it’ll also help me shape the rest of the posts in this series.

Read on about SQL ConstantCare®’s Serverless Architecture, try it out, or check out the rest of the SQL ConstantCare® series.

Previous Post
[Video] Office Hours 2018/03/07 (With Transcriptions)
Next Post
Announcing 2 New Online Classes: Database DevOps and Practical Real-World Performance Tuning

3 Comments. Leave new

  • So, do you have fancy scripts that analyze the uploaded content and auto-generate these emails? Or DBA_worker_mice who manually analyze the results and type out recommendations? Or somewhere in the middle?

    Reply
    • Such a good question! Richie first started with the rule business logic from sp_Blitz, and turned that into functions that run in AWS Lambda, and then build emails out of those recommendations. In the next month or so, we’ll let those emails go directly to the end users immediately.

      For now, the emails go to us instead, and we’re manually reviewing the output. That way we can make sure the wording lines up with what we’d want to say to someone, and that the priorities make sense. For example, there was a “many execution plans for one query” line in sp_Blitz that does make sense for performance analysis, but it’s just not really important enough to surface in this kind of recommendation. (I’d rather do deeper analysis against the data to figure out if fixing the query or just turning on Forced Parameterization is the right fix, or just ignoring it entirely.)

      We’re also building deeper rules than we could have done with sp_Blitz because now we have information over time. For example, I can query someone’s wait stats over the last trailing 7 days, compare them to the 7 days prior, and find out if something went haywire in the server – and then look for what might have triggered it. We’re doing those manually for now (just like you’d do as a DBA – build it manually first, then figure out how to automate it.)

      Reply
  • […] took on private early access customers, and then in March, publicly announced SQL ConstantCare®. Richie was finally able to show the public what he’d been working on for years. Suddenly now […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}