Long story short – how regular VIOS update ended with few weeks investigations completed with the serious VIOS bug and the APAR (IJ25390). The latest VIO releases contain a serious bug which may affect all VIOS/SEA client LPARs on IBM Power Servers.
I always try to kept the rule regarding patches, PTFs, updates OS versions “n-1”. Do not install the latest version on your “backyard”, let others do, and discover the bugs around. This time I broke the rule and decide to go to the latest release prior upgrade Virtual I/O Server to release 3.1. This cost me few weeks of investigations, extra work, and maintain a test lab for IBM.
I assume many of you need to go thru VIOS 3.1 upgrade in coming months (I will cover the procedure in next post soon), for now, you can read about the trap which you may fall in if you run ‘unlucky configuration’.
There are multiple way how to get on 3.1, my goal was – take the simplest one “for VIOS dummy”, where IBM delivers some extra commands (for instance: viosupgrade) to get you thru the process in the easy way. Unfortunately “viosupgrade” command is only available from version VIOS 184.108.40.206 upwards. Therefore, I decided to download very last one – 220.127.116.11 and move forward.
First, few advises/best practice prior Virtual I/O Server upgrade:
- Run regular VIOS backup and be aware how to restore it in case of troubles. (If you don’t have a NIM server it might be complicated)
- Upgrade VIO release day by day (Saturday and Sunday). Do not install/upgrade both instances in the same day. If the VIO release contains bugs, and the upgrade was done on both instances. You are screwed. The only solution might be a VIO restore which costs 30 min – 6 hrs VIO outage (depending on your restore scenario). Remember, that in the worst situation when virtual resources can’t be provided from any Virtual I/O Server, the production LPARs can’t operate.
- If Shared Ethernet Adapter (SEA) is used, always perform manual SEA failover prior the upgrade, in order to verify that all TCP traffic can be properly handled by another VIO instance. If not, you can very easy reverted back and investigate why the second VIO can’t properly handle TCP traffic . It can be very dangerous when SEA in Load Sharing configuration is used, when multiple VLANs are handled by different instance. Or in simple Active-Backup SEA configuration (may happen that your network engineer decided to disable LAN port because it doesn’t routed any traffic for months). Failover can be done with command chdev -dev entXX -attr ha_mode=standby
The issue was detected on IBM i LPARs but according to comments all type of LPARs (any OS) is affected. In my scenario all IBM i LPARs stop responding to commands. It was very untypical behavior, ping from a workstation works, new 5250 session was possible to open, but submitting any IBM i commands resulted with 5250 session freeze. In another words, from session A, I execute command WRKACTJOB, and the session freeze, while it was still possible to open a new session B or..C… and so on. However, any command executed resulted with a freeze job.
According to IBM this issue affects the SEA traffic with VLAN tagging runs on POWER8 or POWER9 (POWER7 does not support Platform LargeSend”). In another words, it is rather “advanced” SEA configuration, thus no all VIOS configuration would see the impact.
So, as you see that is a serious issue, which isolate entire Power Server frame. The tech support was immediately involved, and after few weeks, several TCP dumps send over, IBM came up with the information that the problem is caused by a new feature “Platform Largesend” which was introduced in release 18.104.22.168/21 and 22.214.171.124/61.
This makes following releases affected:
If you want to stop reading the post see APAR IJ25390 or IJ25397
If you experience the similar problem, ask for fix IJ25397sFa.200604.epkg.Z. The fix was delivered to me directly, I do not see the fix listed in any of existing releases.
If you want to know more, see below:
If you experienced the problem, you may try this workaround. Some parameter must be changed on IBM i. This is an online operation (anyway your IBM i host is in trouble already, so it can’t be worst). Login thru the console (LAN/HMC) and start SST.
- Enter userid/password
- Option 1. Start a service tool
- Option 4. Display/Alter/Dump
- Option 1 is available for Display/Alter storage
- Option 2. Licensed Internal Code (LIC) data
- Option 14. Advanced analysis
- Option 1 on top blank line and type in setlargesendoffload
- on options put in -rsc CMNxx -flag OFF where CMNxx is the resource in question, and then vary the LIND off and back on.
It should resume TCP communication, and you IBM i client can resume operation. However, IPL will reset the setup and the LPAR will be in the same position as before. Certainly it is a temporary workaround.
Second workaround (and this is probably what “fix” is doing) is to disable the parameter on the virtual adapters which are part of the SEA configuration. If the SEA runs in Load Sharing mode, very likely you have multiple virtual adapters, these are listed in the SEA attribute as “virt_adapters” lsdev -dev entXX(SEA) -attr virt_adapters.
There is a new parameter introduced in the latest VIO release called “platform_lso – Platform LSO Enable “, again, this is the parameter on the virtual Ethernet device, not on the SEA.
We need to disable this parameter for all virtual adapters configured in the SEA (in our example ent5, ent7). However , in order to do so, you need to:
- Remove device: rmdev -l ent7, rmdev -l ent5
- Modify parameter chdev -l ent7 -a platform_lso=no, chdev -l ent5 -a platform_lso=no
- Run configuration manager cfgdev
- set the parameter to be changed after reboot: chdev -l ent5 -a platform_lso=no -P,chdev -l ent7 -a platform_lso=no -P
- perfom VIOS reboot.
Of course, both commands will affect running TCP traffic (if you have any).
According to tech support, following VIOS filesets levels contain the error:
- devices.vdevice.IBM.l-lan.rte 126.96.36.199
- devices.vdevice.IBM.l-lan.rte 188.8.131.52
- devices.vdevice.IBM.l-lan.rte 184.108.40.206
- devices.vdevice.IBM.l-lan.rte 220.127.116.11
- devices.vdevice.IBM.l-lan.rte 18.104.22.1682
- devices.vdevice.IBM.l-lan.rte 22.214.171.124
As you see above, the same bug is populated in the latest 3.1 VIO releases. Which worry me in multiple reasons.
- Very likely there was no proper testing prior the new VIO version was released
- I did not find any detail information in 126.96.36.199 description that some new parameter was added or new functionality introduced
- I don’t think the fix I got, is fixing the problem, in my opinion it just disable the parameters. Therefore, in some near future VIOS team may try again to “improve TCP flow” by enabling this function soon.
Anyway, I end up with VIOS upgrade on version 188.8.131.52 which does not have the bug, but it is safe to me. I run workload on it for more than a year on another server without any problems. For the upgrade purpose I also do not use the latest 184.108.40.206x release, I took more safe way. First jump to 220.127.116.11 and than to 3.1 following with installation 18.104.22.168 service pack. It is more work, but it is certainly more safe.
Very useful link where all serious VIOS software bugs are documented here