Last weekend I had a very unpleasant situation. A server reboot made an entire machine unusable. It was very difficult issue, costing an unplanned outage. Actually, we had all series of unfortunate events that day. I got aged by few years, but also learn there is always a solution, event in a hopeless situation. Read the entire story to avoid a similar problem.Spectre/Meltdown issue is sort of great marketing story. Suddenly everyone asks if my system has this vulnerability. Because of that, multiple old servers need to be replaced and all the others will get necessary patches. Due to this fact I had to plan and perform major firmware upgrade on one of our small systems S814. The server firmware wasn’t concurrent, therefore a lot of activities had to be coordinated but the server outage was planned and required a cold restart.
I can make a bet that multiple companies do not perform a firmware upgrade, using excuse ‘we can’t effort a server outage’. Last security issues (WannaCry) shown that the result of such strategy can be miserable for entire company. Therefore, I’m always a big supporter to keep things up to date.
So, how could happen that power off and on the server isolated the machine?
My S814 for historic reasons use internal SAS controllers with split backplane configuration. In another words, there are two separate disk controllers, delivering internal disks for VIOSes. The VIOSes boot from internal mirrored disks. All hosts above, run on fully virtualized IO (NPIV, SEA). Thus, lost of VIOSes results in unusable server, and this is exactly what happened.
The problem was not caused by new server firmware or wrong update procedure, it was caused because there is a bug in SAS adapter firmware which may (or may not) activate during cold power off/on the server. This happened on both adapters at the same time.
Moreover, the adapters get completely “bricked”, it was impossible to do anything with them. The only solution is to replace the adapters.
I have 57DC – PCIe3 x8 cache SAS RAID internal adapter 6 Gb. But the same problem is for 57D7, 57D8, 2CCA, 2CCD, 2CD2. In the microcode history there is very innocent sentence coming with release 15511800 – “Possible “false” Over Current detection which may cause an adapter failure on a cold (power off) boot. This may result in an adapter resource status of failed or unknown and/or system may log SRCBA180020 SRCB200308C SRCB2004158″
It is marked as HIPER, but I would never expect that such thing may completely isolate a server, without any possibility to recover. In another words, even multi million super redundant configuration may be potentially impacted, if this bug hasn’t been patched in time.
What happened and how it was recognized? After server IPL, slots C14 and C15 were internal controllers are located were reported in the HMC as Unknown. It wasn’t possible to check adapter type, nothing. Also, there was no single error generated by a server. Nothing in the HMC Serviceable Events. The only error was reported in ASMI with ID SRCBA180020.
As I wrote of begging of this post, it was series of unfortunate events, so the story continue. After 3-4 hours with IBM technical support on the phone, and hopeless multiple IPLs with/without hardware discovery, I have been told that the problem has been delegated to IBM CE which should contact me, and replace affected adapters. What was a surprise when he had contacted and said that particular adapters are not available locally, and it will take another 9 hrs to get them on site. That sounds as real disaster while a production systems were down.
I was thinking about possible solutions and came up with an idea to connect and zone an external storage. Apparently we run all lpars from the external storage, so the infrastructure was there. I asked my SAN colleagues to provision a single LUN to each VIO and we started VIO restore procedure from NIM server. After 2-3 hours I had all my VIOS configuration restored, running from the external storage, which allow to resume production Lpars.
IBM CE replied after 11 hours that he finally have the broken cards, but by this time we already resume the production , and SAS replacement require a frame shutdown which we can’t allow any time soon.
I think I will stay with VIOS running from ext storage forever.
If you still read my post,
- go to your VIOS and perform command lsmcode -A from the root shell. If SAS microcode is older than 15511800, plan the upgrade today.
- do regular VIOS backup (you never know when it will be necessary)
- if you still do not have NIM, get one. This is the fastest VIOS recovery method
You see that many things went wrong this day. For the first time in my professional career, I had feeling to throw away my PC from the window and just run away…. 🙂
This is why I always advise people to be their own parts depot — you can’t trust your business to anyone else’s support contract. If you got to be running, YOU have to have spare hw. Got ten identical machines? Buy #11. Got one? yes, buy two.
Had this exact same problem happen on the same HW BEFORE the new firmware was available…spare box got us back up and running.
It’s not just about rapid repair, either. It’s also about, „What happens if …” or „how do we …”. When you have spare hardware, you have the ability to test things on non-live systems.