When you wanna build something, how early do you let the public in? Where do you draw the lines with the features you absolutely have to have – versus the stuff you just want? Code is never really done.
The lines are so blurry these days with labels like Alpha, Beta, Private Preview, Public Preview, Early Access, Limited Availability, Regional Availability, General Availability, etc. Companies wanna get something in your hands as quickly as possible so they can start learning from what you like and what you use. What’s the first thing you actually ship?
MVP: Minimum Viable Product
In Eric Ries’ excellent book The Lean Startup, he talks about how your company should rapidly iterate along:
- Release something as quickly as possible
- Measure how customers use it
- Learn lessons from those measurements, and then
- Go back to step 1
The first time you hit step #1, that’s called the Minimum Viable Product (MVP). Your MVP doesn’t even have to be an app or online service – it could be a manual process, or it could even be just a signup form for a service that doesn’t exist yet. In a sense, you could think of sp_Blitz as the MVP: a script that people could run and get advice about their server.
If you were going to build a system to give people advice about their servers, here’s a few ways you could do it:
- Totally Manual Process: you build a list of queries, they copy/paste the queries into SSMS, copy/paste the results into Excel, email them to you, you analyze the data manually, and you manually email them a reply.
- Automatic Collection, Manual Analysis: you build an application that runs a bunch of queries and dumps the data into files (say Excel). The user emails you the files, and then you analyze them manually, and tell them what to do.
- Automatic Collection, Automatic Analysis: you build an app that runs queries, sends the data to you, and then robots analyze the data and send recommendations to the end user.
Obviously, #3 is a hell of a lot harder than #1, and it ain’t minimum.
You’re probably looking at what SQL ConstantCare® does and thinking, “Wait, he screwed up – he jumped straight to #3.” Well, we’d already done #1 early in our consulting practice, and #2 was our SQL Critical Care® consulting service. (Richie had built a data collection app to make our process faster.)
There’s a lot of gray area in #3, though. We made a lot of brutal decisions about what we would do in order to get the MVP out the door.
What we skipped
We wanted to collect data, put it into a database, and send you emails with advice – but everything about the process was up for debate. If we could put something in your hands faster by making some tough decisions, then we made ’em.
- “Mute” links – I wanna make it as easy as possible for you to permanently mute servers, databases, or specific alerts that you don’t care about. Soon, the emails will have mute links, but for now, we’re having folks just hit reply and tell us what they want muted, and we mute ’em on the back end.
- Self-updating app – really wanted this for v1 because I figured we’d be iterating fast over the collector & queries, but no go. The components are there, but you have to run ConstantCare.exe yourself manually if you want to get the update. Has to do with permissions gotchas with Squirrel, the updating tool we used.
- Windows Service (rather than scheduled task) – an always-running service would let this thing work better in a data center environment, but I’m also hesitant to deal with the support gotchas involved with an always-on self-updating service.
- Team hierarchy – later, I’d love to designate different people for different groups of servers, or different levels of alerts (like production DBA vs developers.) For now, if you want different teams to get emails for different servers, you’ll need to install different instances of the collector.
- Interactive web site with your data & recommendations – didn’t wanna hassle with logins for now.
- Shareable anonymized reports – I’d love to give you the ability to pass your DMV data on to your software vendor or consultant and say, “Here’s what we’re up against – you take a look and tell me what you think.”
- AG/DR-aware checks – if you have an AG, I’d like to be able to identify backup history across all the nodes and tell you if you have gaps in coverage. (As long as we’re getting backup data across all replicas, this should be doable – just takes more query work.)
- Troubleshooting tools – sure, it’d be nice to have a dashboard showing where all the incoming files are at in processing, but it doesn’t make sense to build something like that when our processes are changing so fast during the MVP.
- Automated wait stats analysis – for example, “Your server is waiting on storage, but it’s not because the storage is bad – you just need to change max memory because it’s set incorrectly. Change it to ___, here’s how, and here’s how safe (or unsafe) it is.” Right now, I’m doing this manually for customers, building a manual process of what I’m thinking as I do it.
What we shipped
Users install a desktop app that sends us diagnostic data once a day. Here’s a video showing how you install it – warning, it ain’t pretty:
There’s no GUI, there’s no wizard to add servers for you, and there’s no input validation if you screw up your email address or server names. There’s a lot of places where things can go wrong. I’d love to have an entirely graphical setup process that sweeps the network, suggests the SQL Servers that you would want to monitor, and guesses the right time zones for each server.
I would also like a pony. (Not really. Ponies smell bad. Except this pony.)
But you get what I mean – you go to
war market with the army installer you have, not the army installer you want. Besides, this is a one-time experience for users – the more important experience was getting the emails that were valuable and actionable.
The emails – that’s where the value comes in. Right now, Lambda functions analyze the data and send us emails that look like this:
They’re the same emails that you’ll eventually get directly – but for now, we’re watching them ourselves, making sure the functions are returning the right recommendations, and then manually sending guided recommendations directly to the customers.
Then for some customers with special situations, or where the automated emails aren’t quite the right fit, we send manual emails like this (sanitized customer email sent to myself):
Is that a lot of work? Sure it is – early adopters are basically getting a screamingly cheap deal on our personalized attention via email. If you’re going to automate something big, though, you’ve gotta start by making sure you can do it manually first.
How the MVP is scaling so far
The line is the number of servers we’re analyzing (from customers willing to share their numbers – more on that in the upcoming GDPR post), and the bars are the terabytes of data on those servers. The jump yesterday was the first day of the marketing launch:
Yes, the terabytes of data have gone down more than once as folks have suddenly realized, “Whoa, someone restored a bunch of databases onto a production server, and we forgot to get rid of them. They’ve just been making our maintenance windows take longer every night.” My favorites are when we’ve found AdventureWorks on production servers.
Those server numbers may not look big – we could have easily processed 156 servers’ worth of diagnostic data with a conventional Windows app running in a VM – but check out these hosting costs:
We spent ~$650 last month on hosting, and is projected to come in around $1k this month. That includes development environments, by the way. This is inexpensive enough that I could afford to absorb it for even just a handful of customers if it hadn’t caught on.
As more customers start to diagnose more servers, serverless really pays off. When new customers join, they seem to wait a day or two to set up collection, then they set up collection for just 1-2 servers, see the results, and then suddenly go, “Whoa, I should add a few other servers in too.” We added 49 new customers yesterday, so I wouldn’t be surprised if we were monitoring a total of around 300-500 servers by the end of the week.
In preparation for that, Richie’s been putting a lot of work into tuning data ingestion and processing the rules quickly, staying ahead of the Lambda function timeouts and the growing data set.
As we go, we’re learning stuff, tweaking the system, and figuring out what things users can fix on their own and which ones they need more help with. It’s even driving the blog posts I write – for example, when I see a problem on several servers in a row, and I have the same discussion with customers about it a few times, that means I need to write a post to link folks to. The results are posts like “But I don’t need to back up that database” and Why Multiple Plans for One Query are Bad.
What’s on the roadmap next
Now that we’ve got the MVP out, we’ve been working on:
- Quality checking on the automated emails – we’re pretty close to the point where we’ll start letting the automated emails go straight out, but then once users have fixed the first round of problems (like no backups, obvious server misconfigurations, etc) then the human intervention will kick in later. Right now, with ~50 active paying customers and ~150 SQL Servers, it’s easy enough that I can still keep an eye on this manually, but we’ll be switching over soon.
- Back-patting rules – tying together a recommendation we made, a customer’s successful application of that recommendation, and the difference it made in health or performance metrics.
- Not breaking the build – seriously, I have a nearly 100% failure rate on my commits. Richie has to be gritting his teeth by now every time he sees one of my pull requests.
- Wait stats trending – right now I’m manually trending wait stats, then emailing customers an analysis. We’ll need something more scalable as we go to 1,000 servers and beyond. When you build an MVP, though, you gotta do things that don’t scale.