The past month has been a real challenge, filled with 70-80 hour weeks and clawing back from one disaster after another.
In late June, IBM recommended that we upgrade the firmware on our DS4800 SAN controllers to fix some problems. We obliged, and two days later, our Exchange cluster rebooted without warning. We’d been having some other problems with those servers anyway, so we didn’t connect the dots. A few days later, the reboot happened again, and then again.
Meanwhile, in DBAville, my brand new SQL cluster started failing over from the active node to the passive node for no apparent reason. The servers didn’t reboot, just simply failed SQL over from one node to the other, and it appeared that my stuff was having different symptoms than the Exchange cluster. I worked the issue on my own, and just as I started tracing it back to disk problems…
One of our VMware LUNs got corrupted, taking down 19 servers at once. We pulled an all-weekender to rebuild and restore all of the servers before the next business day. We started working that issue with VMware, and they came up with a good action plan. We were midway through implementing it when it happened again, albeit with slightly different symptoms. There went another weekend rebuilding and restoring servers.
By this time, we had a pretty good idea that all three problems were related to clustered stuff on the DS4800 SAN controller, but it was too late. I built a new standalone SQL server (without clustering) and hooked it up to a different SAN, the DS4500. It was immediately stable, and I breathed a sigh of relief because I had time to troubleshoot the DS4800 problem.
We brought in IBM, and they recommended that we upgrade the firmware on our IBM SAN switches and another SAN – the DS4500 where I’d just moved my SQL stuff. The switch firmware upgrade went fine, but the DS4500 firmware upgrade went straight to hell – the SAN was unusable. That took down dozens of servers and brought IBM engineers into our office on a weekend to figure out how to fix it. In the mess, I lost my SQL box. Thankfully, it was a quick rebuild and I had good backups, but the SAN was unusable for a day, and we lost a lot of time troubleshooting with IBM and LSI.
Thursday, LSI engineers showed up due to a lucky coincidence. About a month ago, when our problems started, IBM had the foresight to schedule a health check. It almost turned into an autopsy. LSI’s staff fixed a lot of the problems quickly, and they have a pretty good answer on the Exchange reboots. They’re recommending a software tweak and IBM is replacing the 4800 controller cards, and if these two things don’t work, IBM is giving us another DS4800 and new servers to build a new Exchange cluster on a new SAN from scratch to see if that fixes the problem.
Unfortunately, that means my weekends at the office haven’t ended yet. This weekend we’re doing the controller cards and the software changes.
Today, I’m taking my first full day off in weeks. I already felt much better last night, just hanging out and eating dinner with Erika and Ernie.