Bug 242621

Summary: iscsi initiator does not provide devices after reconnection
Product: Base System Reporter: info
Component: kernAssignee: freebsd-scsi (Nobody) <scsi>
Status: New ---    
Severity: Affects Only Me CC: ben.rubson, i.dani, trasz
Priority: ---    
Version: 11.3-RELEASE   
Hardware: amd64   
OS: Any   

Description info 2019-12-13 11:08:47 UTC
after a network disruption we see some servers having issues reestablishing iscsi block devices. this manifests itself with the following iscsictl -L output:

Target name                          Target portal    State
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.15:3260   Connected: 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.16:3260   Connected: 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.17:3260   Connected: da3 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.18:3260   Connected: da2 

As it can be seen iscsi initiator was able to reconnect da3 and da2. the other 2 devices are missing. it is not possible to logout of these portals using command "iscsictl -R -p 10.4.1.15:3260". It is neither possible to add the path again manually. Note: This is the same block device seen over 4 different paths.

Stuck in this situation we see messages in the kernel log:

Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting
Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting

Note: This messages only appear for one of the two defective target portals.

We have the following iscsi sysctl's set:

kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 60
kern.iscsi.iscsid_timeout: 60
kern.iscsi.ping_timeout: 5


We see this behaviour with different iscsi target vendors (Pure Storage, NetApp eSeries) on multiple different FreeBSD 11.3 hosts. The only solution we have found so far is a reboot of the affected host.
Comment 1 Ben RUBSON 2019-12-13 15:08:05 UTC
Do you use jumbo frames ?
Could be lack of mbuf_jumbo_9k (vmstat -z should tell you).
If so, a workaround is to decrease the MTU until 9K mbufs are not more used.
On my systems it gives a 4072 bytes MTU.
Of course it's just a workaround, as decreasing MTU increases overhead...
Comment 2 info 2019-12-13 15:17:15 UTC
(In reply to Ben RUBSON from comment #1)

no, we do not use jumbo frames:

[root@xxx:~] # ifconfig | grep mtu
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
enc0: flags=0<> metric 0 mtu 1536
pfsync0: flags=0<> metric 0 mtu 1500
pflog0: flags=0<> metric 0 mtu 33160
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lagg0.10: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lagg0.800: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

[root@xxx:~] # vmstat -z | grep mbuf_jumbo_9k
mbuf_jumbo_9k:         9216, 2420071,       0,       0,       0,   0,   0
Comment 3 commit-hook freebsd_committer freebsd_triage 2020-01-15 04:31:17 UTC
A commit references this bug:

Author: meta
Date: Wed Jan 15 04:30:48 UTC 2020
New revision: 523079
URL: https://svnweb.freebsd.org/changeset/ports/523079

Log:
  sysutils/getssl: Update to 2.14

  2.14
    * Rebased master onto APIv2 and added Content-Type: application/jose+json

  PR:		242621
  Submitted by:	sanpei
  Approved by:	maintainer timeout

Changes:
  head/sysutils/getssl/Makefile
  head/sysutils/getssl/distinfo
Comment 4 Edward Tomasz Napierala freebsd_committer freebsd_triage 2020-01-16 10:23:03 UTC
Can you enable debug messages (sysctl kern.iscsi.debug=10) and paste dmesg from when it happens?  Also, instead of rebooting, can you instead try 'iscsictl -M -e off' and then 'iscsictl -M -e on'?
Comment 5 info 2022-03-28 21:53:55 UTC
(In reply to Edward Tomasz Napierala from comment #4)

we still have issues with this under 12.3-RELEASE-p1. meanwhile i came across such a situation again and have some more information now:

when this situation occurs 'iscsictl -M -e off' and then 'iscsictl -M -e on' does not help at all. even a normal `shutdown -r now` is at one point of the shutdown routine not progressing anymore and somehow waiting for iscsi devices. so we are forced to reset the server at this stage.