Ugly month of outages

The past month has been a real challenge, filled with 70-80 hour weeks and clawing back from one disaster after another.

In late June, IBM recommended that we upgrade the firmware on our DS4800 SAN controllers to fix some problems. We obliged, and two days later, our Exchange cluster rebooted without warning. We’d been having some other problems with those servers anyway, so we didn’t connect the dots. A few days later, the reboot happened again, and then again.

Meanwhile, in DBAville, my brand new SQL cluster started failing over from the active node to the passive node for no apparent reason. The servers didn’t reboot, just simply failed SQL over from one node to the other, and it appeared that my stuff was having different symptoms than the Exchange cluster. I worked the issue on my own, and just as I started tracing it back to disk problems…

One of our VMware LUNs got corrupted, taking down 19 servers at once. We pulled an all-weekender to rebuild and restore all of the servers before the next business day. We started working that issue with VMware, and they came up with a good action plan. We were midway through implementing it when it happened again, albeit with slightly different symptoms. There went another weekend rebuilding and restoring servers.

By this time, we had a pretty good idea that all three problems were related to clustered stuff on the DS4800 SAN controller, but it was too late. I built a new standalone SQL server (without clustering) and hooked it up to a different SAN, the DS4500. It was immediately stable, and I breathed a sigh of relief because I had time to troubleshoot the DS4800 problem.

We brought in IBM, and they recommended that we upgrade the firmware on our IBM SAN switches and another SAN – the DS4500 where I’d just moved my SQL stuff. The switch firmware upgrade went fine, but the DS4500 firmware upgrade went straight to hell – the SAN was unusable. That took down dozens of servers and brought IBM engineers into our office on a weekend to figure out how to fix it. In the mess, I lost my SQL box. Thankfully, it was a quick rebuild and I had good backups, but the SAN was unusable for a day, and we lost a lot of time troubleshooting with IBM and LSI.

Thursday, LSI engineers showed up due to a lucky coincidence. About a month ago, when our problems started, IBM had the foresight to schedule a health check. It almost turned into an autopsy. LSI’s staff fixed a lot of the problems quickly, and they have a pretty good answer on the Exchange reboots. They’re recommending a software tweak and IBM is replacing the 4800 controller cards, and if these two things don’t work, IBM is giving us another DS4800 and new servers to build a new Exchange cluster on a new SAN from scratch to see if that fixes the problem.

Unfortunately, that means my weekends at the office haven’t ended yet. This weekend we’re doing the controller cards and the software changes.

Today, I’m taking my first full day off in weeks.  I already felt much better last night, just hanging out and eating dinner with Erika and Ernie.

Previous Post
In line at Apple South Beach
Next Post
Registered for the SQLpass Summit

9 Comments. Leave new

  • Your DS4800 woes sound suspiciously like the problems we have been having over the past two weeks on our DS4800 with Solaris 10 servers. What version of the DS4800 firmware are you running? We are at the latest: v06.23.05.00.

    Just today (8/21), the LSI support engineer I’ve been working with admitted that they had identified a problem with this firmware rev that results in bad problems (control resets, logical disks being offline, etc.) under heave I/O load.

    Matt

  • Yep, our problems began 2 days after we upgraded to v6.23.05.00! I’d love to hear more about the firmware rev problems. I’ll be pinging our LSI contact, Veronika, about this one.

  • Our LSI support engineer said he’d let me know something by noon ET on Wed, 8/22. I’ll let you know.

    Under heavy I/O load, we’ll have Logical Disks (LUNs) go offline which causes the systems to panic. It’s been very frustrating, but I’m optimistic that we are finally making some headway.

  • I heard from our LSI support engineer that they are working with IBM. No further details yet.

  • I’m curious if you ever got resolution with IBM. Our 4800 is having this exact problem, and we’re getting no where with IBM.

  • As a matter of fact, yeah. Matt C and I kept in touch, and it turned out that there was indeed a firmware problem in the 4800. They put out a new revision of the firmware just for that problem, and it appeared to fix the issue. Get on the latest version of the 4800 firmware, and that should correct it.

    Of course, in order to get there, you have to flash all the drives and drawers, which means no I/O for a couple/few hours, so it’s a pain in the butt. We haven’t had an issue since, though.

  • No wonder why critical shops need to plan and implement a DR solution. I can’t imagine being hours down. Scary stuff you mention there.

    • Heh – yeah, we had awesome DR solutions, and for the critical servers we were able to fail over to our DR datacenter. Not all applications are mission-critical, though, and just because the business won’t implement a DR plan on those doesn’t mean you can simply abandon them. You still have to rebuild them, plus troubleshoot the original issue simultaneously.

      This particular company had the best DR plans I’d ever seen, complete with multiple role swaps to the DR facility each year, operating out of the DR datacenter for production for a week. Pretty thorough.

      • Had a lot of problems with IBM storage and controllers over the years. If it was my decision, I would never have stored anything more important than last week’s grocery shopping list on them. It’s just not worth it.

Menu
{"cart_token":"","hash":"","cart_data":""}