Our SQL Server monitoring product, SQL ConstantCare®, uses Amazon Aurora on the back end. I blogged about our choice of database back in 2018, and at the time, I said:
I know, I know. The Microsoft geek in you – and me – wants us to use SQL Server on the back end, but here’s the deal: SQL Server is not self-managing or cheap. (After all, you and I make very good livings keeping SQL Server up and running, right?) Whenever possible in this product, I wanted to use something inexpensive that somebody else managed for us.
Well, like anything else in the cloud, the costs creep up on you slowly over time.
It took me a really long time to notice because most of our Amazon hosting costs are due to the lab VMs for our Mastering training classes. Students get pretty beefy VMs to run their labs, and I run a lot of training classes. I really should have had separate AWS accounts for each of our business uses – these days, the best practice is to isolate your environments across a lot of accounts. Instead, I was using tags in AWS to break things up by budget.
One day, it hit me out of nowhere as I was looking at Billing Explorer: the AWS Aurora RDS Postgres costs had gotten pretty expensive over time.
We were spending about $60/day, but we hadn’t upsized our instances. What was going on?
A peek into our bill revealed a nasty surprise:
A few lines down, there’s a pretty doggone large number of IOs: we’d done 8.3 biiiiiiillion iops, ringing in at an extra $1,663.68 that month.
We jumped into AWS Performance Insights, a monitoring tool that’s included free with RDS. It gives you a really nice point-and-click GUI to analyze queries by CPU, reads, writes, and more. I expected to find a single ugly query, but lots of them were doing a lot of reads, and it took me an hour or two of digging to realize…
All of the read-intensive queries were calling the same view.
It wasn’t even a large view: it was just a view that fetched the prior collection number & date for a given client’s server. If we were analyzing your metrics, we needed to join to your prior collection so we could see what changed. The tables in the view had less than 100K rows – it was just that it was getting a pretty bad execution plan – well, bad relative to the size of the small query.
Each time we hit the view, we were incurring a few hundred milliseconds of time, which doesn’t sound like much – but we called that view a lot.
We debated tuning the view and its indexes, but Richie leapt into action and turned it into a table instead. After all, when we process a client’s data, we only need to fetch the prior collection ID & date once – it’s not like they’re adding additional past collections. Time only goes forward.
You can see the day when Richie implemented the table:
Our IO costs dropped by fifty bucks a day just with that one change.
Performance tuning in the cloud is about the tipping point.
No, not the tipping point of index seeks vs table scans: the point where increasing ongoing costs mean that you should stop what you’re doing and focus on eliminating some technical debt.
When we’re in the cloud, it’s up to us, the data professionals, to:
- Review our monthly bill from time to time
- Understand what IT operations are generating new costs (like in our case, read IOs)
- Dig into which queries are causing those IT operations (like in our case, an inefficient view)
- Figure out the most cost-effective fix for those operations
- And then maybe most importantly for your career, produce the fix, and take credit for the new lower costs
Good data professionals can pay for themselves in reduced hosting costs. I know because I do it for other folks all the time – I just needed to do it for myself, too. Eagle-eyed readers will note that the dates on these screenshots are February. Yes, this happened a couple months ago – but what prompted me to write the post is that our costs have started creeping back up again, hahaha, so it’s time for me to go through and do another round of checks!