Our new SQL Constant Care® is an online service that analyzes your SQL Server’s diagnostic data, then gives you personalized advice on what you should do to make your database app faster and more reliable. I blogged about the vision last week, and today let’s talk about the things that drove our architecture decisions way back in 2015.
I’ve worked for software vendors who had to deploy and support applications running on customer desktops. It’s an absolute nightmare – you would not believe the crazy things people try to do with your app.
- You’d collect data from SQL Server. (This was gonna require a local app.)
- We’d analyze the data and build a list of tasks for you to do.
- We’d tell you what those tasks were.
- We’d monitor your progress on accomplishing those tasks, and the difference it made.
Steps 2-4 weren’t synchronous – they were going to take time. Some problems would be completely obvious (“hey, you have slow file growths due to repeated file shrinks, turn off this specific maintenance job step”) but others might get flagged for human intervention. As we improved the code for steps 2-4, more and more scenarios would be handled automatically – and this is the part of the software that we expected to rapidly iterate over. We didn’t wanna deploy this kind of logic on-premises every time we got smarter. Steps 2-4 were gonna be in the cloud.
The cloud also made sense for performance and scale requirements. I didn’t expect a large number of people to enroll right away, but I wanted to plan for the future – especially a future where I could drive the per-user and per-server costs lower. I wanted the processing cost to basically drop to free, thereby enabling me to do fun stuff like a “Free Server Friday” where anyone could send in data.
In a future like that, what kind of loads might I have to deal with? Think users times the number of servers:
- Our mailing list has ~100K people, but let’s say just 1% of them enrolled.
- For server count, in our annual salary survey, the median number of servers managed is 20, but let’s say they send in data about 5 of those servers.
- That means we’d be processing tons of diagnostic data for 5,000 servers at a time (and if we did freemium or free days, possibly much more)
Scale-out, queue-based processing was a requirement. The scale-out portion might be able to wait until v2, but queue-based was an absolute must.
This seemed like the perfect app for serverless architecture design.
Wait, what’s serverless architecture?
Serverless doesn’t mean there are no servers.
Serverless just means the servers aren’t your problem.
Serverless, also known as function-as-a-service, means that you write small units of code (functions) that get triggered whenever events happen (like when a file lands in a folder.) Mike Roberts’ Serverless Architectures article is a good place to start learning more.
We wouldn’t have to build, troubleshoot, and patch our own servers, nor would we have to worry about performance capacity. As we got more incoming data, AWS would just run our functions on more servers. You pay by the millisecond that your code runs. Even if that cost was relatively expensive, it still called to me as a tiny business owner because I flat out couldn’t afford to replicate the support and sysadmin infrastructure that would be required for a conventional software-as-a-service. If I took a risk on serverless, and spent more up front on development, I might be able to build something that scaled easier later.
And serverless isn’t expensive, by the way – your first million AWS Lambda requests per month are free, and $0.20 per million requests thereafter. Pricing gets a little tricky as you configure the memory your function needs, but even still, at the kinds of scale we’re talking about, it’s way, way cheaper than buying a single server, let alone hiring a sysadmin.
When designing a serverless app, you think about tiny services and queues. In SQL ConstantCare®, that means:
- The client runs the collector app, which pushes data into a cloud file service (Amazon S3). At this point, the synchronous client-facing work is done, and the rest happens asynchronously via functions and queues.
- As files arrive, S3 would automatically add related records to a queue for processing.
- Functions would launch for each file, importing them into a database.
- As each function performed operations on the incoming files, like adding them to a database or checking data for business logic rules, they’d add records to the next queues.
- Eventually a queue entry would trigger a function to send an email with the list of tasks for the client.
If we got overwhelmed with files, fine – they’d chug along asynchronously. If a function broke, fine – we’d troubleshoot it asynchronously without clients seeing errors. Granted, their emails might sometimes take longer than others to process – but who cares? This is mentoring, not real time monitoring.
The combination of what I wanted to build, plus the brand-new serverless thing, just seemed like the absolute perfect match.
But serverless was one hell of a risk.
The process and architecture diagrams were chock full of icons for brand new AWS products. To give you some perspective on when the decision was made, and how risky it was:
- 2014-Nov – AWS Lambda announced
- 2015-Nov – I swill the serverless Kool-Aid
- 2016-Feb – Google Cloud Functions goes into private early release
- 2016-March – Richie starts working on PasteThePlan, our first serverless project – using AWS because it was the only game in town that we could access
- 2016-March – Microsoft Azure Functions goes into private early release
- 2016-Aug – Mike Robert’s fantastic introduction to serverless appears
- 2016-Sept – PasteThePlan goes live, Richie starts working on SQL ConstantCare – at the time, we just called it “the service” because we didn’t have a brand yet. Based on the AWS Lambda experience with PasteThePlan, we were hooked, and continued moving forward with that. We did switch to the Serverless framework, though.
- 2016-Nov – Microsoft Azure Functions goes into general availability
- 2017-March – Google Cloud Functions goes into general availability
To put it another way, I decided to go serverless before there was any competition in the market. If AWS Lambda had died, or if Google or Microsoft unveiled something dramatically better, or even if AWS themselves offered something way better, I ran the risk of flushing a lot of money down the toilet.
To reduce risk, hire brilliant people and trust them.
When we hired notorious dataveloper Richie Rump (@Jorriss), I said something to the effect of, “Here’s what I want to build, I want us to use serverless to build it, and here’s a list of reasons why. However, you’re going to be in a wild West of uncertainty because the tools are so new – so if at any point you think we need to switch back to conventional architectures, I’m fine with that. It’s your call.” I knew development would take longer – much, much longer – but because there wasn’t really anything like this in the market, and I didn’t think there would be soon, I could afford to take some time.
Richie built PasteThePlan with serverless architecture first as a learning experience, and wrote about that here. It worked out extremely well – the hosting costs were dirt cheap, and support was even cheaper. For example, when the Meltdown/Spectre attacks hit, the AWS update was straightforward: they fixed everything automatically without customers lifting a finger.
The platform wasn’t the only risk, either: because I could only afford to hire one developer, I had all my eggs in Richie’s basket. If he got hit by the lotto, got a better job, or just got tired of working for me, the project would be set way the hell back. All I could do is keep giving Richie whatever he needed to make the project successful. That, and take him aside from time to time and tell him how good his basket looked.
In 2018, looking back, using serverless for SQL ConstantCare® feels like it was a smart decision. We got lucky – Lambda caught on, AWS kept investing in it, the tooling got better, and no new competitors emerged in our market. Sure, there’s some survivor bias here: we might have been able to go live earlier, cheaper, by building it in a more conventional web app. However, I think as our marketing ramps up over the summer, we’ll be glad we built it to handle bursty demands. (We’re only diagnosing 76 servers a day right now, for perspective – I’m not marketing it hard yet, just letting people find their way in and we’re learning as adoption grows.)
In upcoming posts, I’ll talk about what data we chose to collect, where we store it, what we’re learning from it, how we bundled and priced the services, and more. If you’ve got questions, feel free to ask away – in these behind-the-scenes posts, I try to share the kinds of business decisions I’d find interesting as a reader. Next up: the product, packaging, and pricing.