When the business decides it’s time to give up fixing the old server because it’s not fast enough or reliable enough, stop. The worst thing you can do is jump directly to the hardware vendor’s site, check a few boxes, and order a new shiny server that ends up with exactly the same problems as the last server.
1. Get the business’s RPO and RTO goals in writing. These two numbers determine whether you build a cluster, an AlwaysOn Availability Group, a log shipping backup, or a pair of tin cans connected with string. Learn what RPO and RTO are here.
2. Find the right SQL Server feature for your RPO/RTO goals. For example, if you’re not allowed to lose any data, and you need to fail over in less than 1 minute, then your options include failover clustered instances, AlwaysOn AGs with synchronous mirroring, or synchronous database mirroring. Use Page 2 of the HA/DR worksheet linked in the last post to figure that part out, or watch the HA/DR architecture modules of the Senior DBA Class.
3. Review the current server’s wait stats. To help figure out what kind of hardware you need for tomorrow, look at the pains you’re experiencing today. For example, if you’re suffering from PAGEIOLATCH% waits because you can’t cache enough data in RAM, the next server may need significantly more memory (or faster storage, or time spent tuning indexes & queries.) If you’re suffering from LCK% waits because the apps are holding locks on their side, though, you may not see relief no matter what kind of hardware you throw at it.
4. Benchmark your current hardware’s speed. I’ve written about how to check performance on a new SQL Server, but this is also really useful when you’re transitioning away from an old server. For example, take your backup & CHECKDB job runtimes, and the new server should be able to perform at least that quickly (if not more so, depending on what your bottleneck was in the above step, and which kind of bottleneck you’re trying to fix with the new server.)
5. Get the server’s proposed expiration date. Servers are like milk cartons: both need an expiration date printed clearly on the outside. Ask the business users, “How long does this server need to last?” They usually say, “Forever,” and start laughing, but once the laughing stops, here’s how to handle the discussion.
6. Ask the developers, “What features are we adding during that time?” I don’t expect them to have specific line-by-line answers about which features they’re adding over the next 3 years, but I do expect them to have rough architecture ideas. Are we going to add some kind of big data analysis projects? Machine learning in the database? R, Python, or Java running inside it? Storing new kinds of data, like tracking every web site click?
7. Ask the business, “How much will load grow during that time?” Are we expecting to acquire our next several larger competitors, or maybe conquer sales in a new country? Or are we just expecting to tread water and only deal with incremental growth?
For all of the above questions, the more specific the answers are, the more precise your server sizing can be.
In reality, they’re very rarely specific.
And that’s fine!
But it just means that your server design needs to be more flexible: we need to design something that we expect will change rapidly over time as we learn more about the answers above. That’s one of the reasons why I really love building new environments as Availability Groups up in the cloud: it’s so much easier to add in new, more powerful replicas, join them to the cluster, and then get rid of the old ones. Your server design can leverage the flexibility of the cloud to handle new features or workloads, and you can throw cloud at performance issues.