Bug 255930 - ocs_fc Lost all connected devices after some use.
Summary: ocs_fc Lost all connected devices after some use.
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-scsi (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-16 18:30 UTC by Arne Steinkamm
Modified: 2021-05-17 17:13 UTC (History)
1 user (show)

See Also:


Attachments
Message file with all described problems. See Bug reports for time stamps (41.12 KB, application/x-bzip)
2021-05-16 18:30 UTC, Arne Steinkamm
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arne Steinkamm 2021-05-16 18:30:53 UTC
Created attachment 225001 [details]
Message file with all described problems. See Bug reports for time stamps

I connected a HP Proliant 380 Gen9 server with emulex fc HBAs to two simple fc setups and attached a NetApp FlashFiler EF550 unit. To get the most out of ZFS I assigned all 24 flash modules without using the EF550 RAID features to the proliant.

I use geom_multipath to handle the redundant connections to the flash filer and made a ZFS Pool with 3 x 7-disk raidz-1, one spare, one log and one cache disks.

The read/write speed is good (2.5 GB/s according to zpool iostat) but after
minutes of heavy use I got
kernel: ocs_fc0: ocs_initiator_io: device LOST 0 messages and all fc connected disks are gone.

I found no way to recover out of this error situation other than reboot, panic (zfs is not happy about the situation) or hardware reset.

Further obervations:

- reported topologies and link speeds are correct.

- ef550 replaced with identical spare unit: no change

- changed fc ports: no effect

- used different emulex cards (alone, mixed): no effect, problem happens with any combination of installed emulex cards

- tried qlogic cards (driver: isp(4)): No problems, works 100% stable but slightly slower io performance.

- tried 12.1-RELEASE, 12.2-RELEASE and 13.0-RELEASE. Last one with generic kernel without any changes. Every time lost all fc devices.

- Boot with disabled switch fc ports:
  After portenable of the brokades' ports the fc links went up, no automatic attachment of the disks.
  A camcontrol rescan all was not successfull, thousands of "device not ready" messages flooded the console.
  The only way to get the flash modules online is to boot the server with working fc setup.

- Bumping the emulex cards to the newest available firmware had no visible effect.

- Playing with the HBA related BIOS settings
  "HP Shared Memory Feature", "Brocade FA-PWWN" and "PLOGT Retry Timer" had no visible effect.


More details of the last try with 13.0-RELEASE generic:

uname -a:
FreeBSD vwcnctd00fs003.dev.kpdm01.group.vwg 13.0-RELEASE FreeBSD 13.0-RELEASE #0 releng/13.0-n244733-ea31abc261f: Fri Apr  9 04:24:09 UTC 2021     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

pciconf -lv:

ocs_fc0@pci0:8:0:0:     class=0x0c0400 rev=0x01 hdr=0x00 vendor=0x10df device=0xe300 subvendor=0x1590 subdevice=0x0214
    vendor     = 'Emulex Corporation'
    device     = 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc1@pci0:8:0:1:     class=0x0c0400 rev=0x01 hdr=0x00 vendor=0x10df device=0xe300 subvendor=0x1590 subdevice=0x0214
    vendor     = 'Emulex Corporation'
    device     = 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc2@pci0:129:0:0:   class=0x0c0400 rev=0x30 hdr=0x00 vendor=0x10df device=0xe200 subvendor=0x103c subdevice=0x197f
    vendor     = 'Emulex Corporation'
    device     = 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel
ocs_fc3@pci0:129:0:1:   class=0x0c0400 rev=0x30 hdr=0x00 vendor=0x10df device=0xe200 subvendor=0x103c subdevice=0x197f
    vendor     = 'Emulex Corporation'
    device     = 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter'
    class      = serial bus
    subclass   = Fibre Channel

HP device names:
HPE SN1200E 16Gb 2p FC HBA Product Part Number: Q0L14-63001 Assembly Number 870002-001
HP SN1100E 16Gb 2P FC HBA  Product Part Number: C8R39-60001 Assembly Number: 719212-001

The EF550 has two independent controllers both connected to all flash module bays. Each controller has two FC ports.
This ports are connected to two independent brocade fc switches (no interlink fibre).
One port of each emulex card is connected to one of the fc switches.
The other port of each emulex card is not in use (connected to an enterprise fabric network independent from my laborotry setup, but ports are disabled on the switch site).
Using only on of the emulex cards does not change the effect. I tryed all permutations possible.

To get valid data for this bug report I installed 13.0-release with minimal setup:


/boot/device.hints:
hint.ocs_fc.0.initiator="1"
hint.ocs_fc.2.initiator="1"
hint.ocs_fc.0.topology="1"
hint.ocs_fc.2.topology="1"
hint.ocs_fc.0.speed="16000"
hint.ocs_fc.2.speed="16000"

/etc/sysctl.conf:
dev.ocs_fc.1.port_state=offline
dev.ocs_fc.3.port_state=offline


In the attached messages File you will find this:

May 15 19:21:43 - 19:29:22
First boot and configuring network connectivity on the shell.

May 15 19:44:24 Enabling FC ports on both brocades

May 15 19:47:21 camcontrol rescan all (all rescans successful according to camcontrol)

May 15 19:59:15 reboot --- Now with enabled FC links. It will find the flash modules

May 15 20:06:36 kldload geom_multipath.ko

geom_multipath finds four preconfigured links to each flash module. This is correct.

No I did a zpool import zone and startet a couple of test tools
Output of zpool iostat zone 1:
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zone        14.7T   486G  40.1K      0  2.61G      0
zone        14.7T   486G  39.0K    436  2.58G  1.94M
zone        14.7T   486G  41.6K      0  2.60G      0
zone        14.7T   486G  39.4K      0  2.60G      0
zone        14.7T   486G  39.4K      0  2.62G      0
zone        14.7T   486G  40.7K      0  2.57G      0
zone        14.7T   486G  39.9K    420  2.54G  1.94M
zone        14.7T   486G  39.5K      0  2.58G      0
zone        14.7T   486G  39.6K      0  2.64G      0
zone        14.7T   486G  39.3K      0  2.57G      0
zone        14.7T   486G  39.4K      0  2.62G      0
...


May 15 20:15:15 The problem starts

May 15 20:16:18 attempt of a camcontrol rescan with no success

My short term solution is to use QLogic cards with the isp driver which works without any changes necessary 100% stable.