after a network disruption we see some servers having issues reestablishing iscsi block devices. this manifests itself with the following iscsictl -L output: Target name Target portal State iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.15:3260 Connected: iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.16:3260 Connected: iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.17:3260 Connected: da3 iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.18:3260 Connected: da2 As it can be seen iscsi initiator was able to reconnect da3 and da2. the other 2 devices are missing. it is not possible to logout of these portals using command "iscsictl -R -p 10.4.1.15:3260". It is neither possible to add the path again manually. Note: This is the same block device seen over 4 different paths. Stuck in this situation we see messages in the kernel log: Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting Note: This messages only appear for one of the two defective target portals. We have the following iscsi sysctl's set: kern.iscsi.fail_on_shutdown: 1 kern.iscsi.fail_on_disconnection: 1 kern.iscsi.maxtags: 255 kern.iscsi.login_timeout: 60 kern.iscsi.iscsid_timeout: 60 kern.iscsi.ping_timeout: 5 We see this behaviour with different iscsi target vendors (Pure Storage, NetApp eSeries) on multiple different FreeBSD 11.3 hosts. The only solution we have found so far is a reboot of the affected host.
Do you use jumbo frames ? Could be lack of mbuf_jumbo_9k (vmstat -z should tell you). If so, a workaround is to decrease the MTU until 9K mbufs are not more used. On my systems it gives a 4072 bytes MTU. Of course it's just a workaround, as decreasing MTU increases overhead...
(In reply to Ben RUBSON from comment #1) no, we do not use jumbo frames: [root@xxx:~] # ifconfig | grep mtu ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 enc0: flags=0<> metric 0 mtu 1536 pfsync0: flags=0<> metric 0 mtu 1500 pflog0: flags=0<> metric 0 mtu 33160 lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 lagg0.10: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 lagg0.800: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 [root@xxx:~] # vmstat -z | grep mbuf_jumbo_9k mbuf_jumbo_9k: 9216, 2420071, 0, 0, 0, 0, 0
A commit references this bug: Author: meta Date: Wed Jan 15 04:30:48 UTC 2020 New revision: 523079 URL: https://svnweb.freebsd.org/changeset/ports/523079 Log: sysutils/getssl: Update to 2.14 2.14 * Rebased master onto APIv2 and added Content-Type: application/jose+json PR: 242621 Submitted by: sanpei Approved by: maintainer timeout Changes: head/sysutils/getssl/Makefile head/sysutils/getssl/distinfo
Can you enable debug messages (sysctl kern.iscsi.debug=10) and paste dmesg from when it happens? Also, instead of rebooting, can you instead try 'iscsictl -M -e off' and then 'iscsictl -M -e on'?
(In reply to Edward Tomasz Napierala from comment #4) we still have issues with this under 12.3-RELEASE-p1. meanwhile i came across such a situation again and have some more information now: when this situation occurs 'iscsictl -M -e off' and then 'iscsictl -M -e on' does not help at all. even a normal `shutdown -r now` is at one point of the shutdown routine not progressing anymore and somehow waiting for iscsi devices. so we are forced to reset the server at this stage.