Oh sure, our free First Responder Kit has a SQL Server setup checklist that walks you through a few things you need to do before you install, plus a few things to do immediately afterwards. That’s good. It’s actually pretty darned good.
But it’s not the secret checklist.
Gather round and cover your monitor for a minute because I’m about to share the good stuff.
1. Hardware and Windows prep. Apply the latest bios and firmware to the hardware. Install Windows, then apply all patches. Install and configure your server vendor’s hardware monitoring, and the monitoring tools your sysadmins use. For clusters, run the Validation Wizard, make sure all tests pass, and save a copy of the validation report.
2. Smoke test it. Run CrystalDiskMark against all of the storage (local and SAN) and make sure it meets your expectations. (See my Building the Fastest Servers session for more details.) If it’s a physical server, test it by removing and/or disabling the network, storage, and power cables individually, and make sure that all of the server’s connections are fully redundant. If it’s a virtual server, do those same things to the host. Yes, your VM admins are going to be concerned – but honestly, they haven’t tested any of this stuff in the last year either. They’re just assuming it’s going to work, and you know what they say about assuming: you make an ass out of you and Uma Thurman.
3. Install SQL Server and the latest CUs. Here’s where our setup checklist comes in handy, but you’re not done. Restore your existing databases and time how long that takes. It’s your one chance to know how long a production database restore will take on the new hardware, and that helps you find out if you can meet your RPO/RTO goals. Configure your maintenance jobs like with Ola Hallengren’s scripts, and time how long they take. Try performance tuning the backups by striping them across multiple files – often splitting the backups across 4 files will make them complete 4x faster. Try performance tuning the index rebuilds by doing sorts in TempDB and see if that helps.
4. Smoke test SQL Server. Set up your high availability and disaster recovery features like log shipping or AlwaysOn Availability Groups, and fail the databases over between production and DR. Test a planned failover and document all of the steps involved along with how much time they take. Include copious screen shots. Your goal is to make a planned failover easy enough that any of your company’s admins can do it without knowing SQL Server. (Hey, the more you document, the more time you can take off for vacations.) Then try the same with an unplanned failover – yank the power cables out of the production servers (or hosts) and do a failover. This is usually a more manual process with more steps.
5. White-board out the failure possibilities. Diagram out all of the moving parts in the architecture – shared drives, cluster IPs, service accounts – and think about what would happen if each part fails. What would failure symptoms look like? What troubleshooting steps would help you discover that this was the real root cause? If you’re not sure how you’d recognize a failure of that component, now is your chance to force the failure and plan out your troubleshooting fails.
Only then do you start to go live on the new infrastructure.
This sounds like a lot of work, and it is. I’m not saying this as a consultant who wants to bill you a bunch of money – this isn’t usually the kind of work that I tackle myself. Instead, I work with you to build a list like this, and then you go off and execute the plan. You’re already comfortable with each of these tasks in theory – you just need to do them hands-on. Don’t just assume that the hardware will react a certain way when there’s a failure – go test it, find out for sure, and when disaster strikes, you’ll react with calm confidence.