If you’re looking for SQL Server news, just skip this post. I’m going to talk about general business & capacity planning type stuff here today.
Our SQL ConstantCare® customers install an app that polls diagnostic data from their SQL Servers and Azure SQL Databases, upload it to us daily, and then automatically get advice on what they should do next.
We use Amazon Aurora PostgreSQL as the back end, and I wrote about that design decision last year. I followed up later with a discussion of our hosting costs. Throughout 2018 and most of 2019, the production servers were:
- Two r4.xlarges in a cluster (for high availability)
- 2 physical cores, 4 with hyperthreading on
- 30GB RAM
That served us really well, except for the times when Richie needed to vacuum out deleted rows. PostgreSQL keeps deleted rows inside the user database (just like SQL Server 2019’s new Accelerated Database Recovery), and the automatic vacuuming processes can’t always keep up, especially with app designs like ours. We only keep a rolling 30 days of data, and we roll through billions of rows per month in some of the tables, so there’s a lot of deleting going on.
When Richie needs to do big vacuums manually, he spins up the biggest RDS instance available: the r5.24xlarge with 96 cores, 768GB RAM, and 14,000 Mbps storage throughput. (One of the slick advantages of Aurora’s architecture is that since the compute and storage are completely separate, you can amp up to insane performance quickly without waiting for storage to catch up.) At only $14 per hour, that’s worth it to make maintenance tasks fly temporarily.
However, our workloads are growing.
We’ve been adding new features that gather more data and run more queries against it, like our index recommendations and multi-server emails. I found it incredibly hilarious that our most resource-intensive Postgres query by far was the query that analyzed your indexes and told you what to add. And you know what that query needed? Indexes. Of course.
By October, our reserved instances had expired. With Amazon, you can reserve your database server capacity for 1-3 years, even prepay in advance if you want, and get deeper discounts. In the hosting cost discussion post, I talked about how I’d decided to reserve a pair of db.r4.xlarge instances for production for 1 year – but that reservation was up. I didn’t want to make a decision back then, though, because…
We’ve been adding users and servers: the population had been fairly stable through 2018/2019 because I didn’t promote SQL ConstantCare® too hard, and like any SaaS product, we had a churn of subscribers. New people would sign up, and older users would let their subscription lapse once they’d taken care of the basic issues on their servers. However, with the annual Black Friday sale, I knew a big population jump would come – I just didn’t know if it would be big or fast.
So we decided to wait out through November to see how the sales & installations went. Now, in early December, we’re approaching 500 users sending in data on almost 4,000 servers. I expect those numbers to grow over the coming year as more Black Friday sale buyers start to leverage their newly purchased goodies. Over 1,000 folks have active subscriptions that include SQL ConstantCare as a benefit, and we’ve seen that folks tend to monitor just a server or two at first, then gradually add more of their servers over time as they see the value in the service. I wouldn’t be surprised if we were analyzing over 10,000 servers by this time next year.
So what should we use next?
Richie and I talked through our options:
Switch to Azure SQL DB or Azure PostgreSQL: I’m only mentioning this because I know you’re gonna suggest it in the comments, but no, this didn’t make any financial sense. Changing back ends or cloud providers would have been an expense with no appreciable benefit to end users.
Switch to Aurora Serverless: it’s still PostgreSQL-compatible, so no code changes required, but the Serverless version can automatically scale up and down very quickly. This would be really cool for a business like ours where the database server is idle for hours at a time, then gets hammered when lots of users send in data on a timer:
That graph is from our new upsized primary – it used to be banging up against 100% for hours at a time during peak loads – but you get the idea. Serverless seems like a good idea here, but the Aurora Serverless limitations had just enough gotchas that we couldn’t quite migrate production to it yet. (Once a couple of those limitations are gone, though, we might be able to cost-effectively deploy SQL ConstantCare® in the EU. Fingers crossed.)
Stay with on-demand (not reserved) instances: this didn’t make sense because the savings are so large on reserved instances. You can easily save 35%-45% if you’re willing to lock in rates for a while:
Buy reservations for larger instances: instead of the r4.xlarge (4 cores, 30GB RAM), we could commit to larger instance sizes:
- r5.xlarge: 4 cores, 32GB RAM = $5,540/year for 2 reserved instances (current state)
- r5.2xlarge: 8 cores, 64GB RAM = $11,082/year
- r5.4xlarge: 16 cores, 128GB RAM = $22,162/year
- r5.12xlarge: 48 cores, 384GB RAM = $66,486/year
Since there’s nothing in between the 4xlarge and 12xlarge, that made my decision process pretty easy: we went for a pair of 4xlarge reservations for the next year. It lets Richie focus on building features rather than troubleshooting Postgres performance problems that neither of us are particularly good at, and it also let me add more queries into my Power BI dashboard that I use for analyzing client data quickly. That even drove improvements to the way I do my consulting process, and I’ll be updating that product early next year.
The database is still by far, far and away the most expensive thing in our entire infrastructure. It’s not even close: serverless apps are just so, so affordable to host compared to fixed assets like database servers. I love seeing the same problem that my clients see: the decision about when it makes more sense to hire a performance tuner versus just throw hardware at it. SQL ConstantCare®’s asynchronous queue-based serverless architecture definitely helps keep database costs low, but there’s going to come a time in the next couple/few years when we’ll have to bring in a serious Postgres performance tuner to help push our bill back down. (And no, I’m not gonna be the person who does it!)