Bug 242621 - iscsi initiator does not provide devices after reconnection
Summary: iscsi initiator does not provide devices after reconnection
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-scsi mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-13 11:08 UTC by info
Modified: 2020-01-16 10:23 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description info 2019-12-13 11:08:47 UTC
after a network disruption we see some servers having issues reestablishing iscsi block devices. this manifests itself with the following iscsictl -L output:

Target name                          Target portal    State
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.15:3260   Connected: 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.16:3260   Connected: 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.17:3260   Connected: da3 
iqn.2010-06.com.purestorage:flasharray.XXX 10.4.1.18:3260   Connected: da2 

As it can be seen iscsi initiator was able to reconnect da3 and da2. the other 2 devices are missing. it is not possible to logout of these portals using command "iscsictl -R -p 10.4.1.15:3260". It is neither possible to add the path again manually. Note: This is the same block device seen over 4 different paths.

Stuck in this situation we see messages in the kernel log:

Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting
Dec 13 12:01:24.567 hostname kernel: WARNING: 10.4.1.15:3260 (iqn.2010-06.com.purestorage:flasharray.XXX): no ping reply (NOP-In) after 5 seconds; reconnecting

Note: This messages only appear for one of the two defective target portals.

We have the following iscsi sysctl's set:

kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 60
kern.iscsi.iscsid_timeout: 60
kern.iscsi.ping_timeout: 5


We see this behaviour with different iscsi target vendors (Pure Storage, NetApp eSeries) on multiple different FreeBSD 11.3 hosts. The only solution we have found so far is a reboot of the affected host.
Comment 1 Ben RUBSON 2019-12-13 15:08:05 UTC
Do you use jumbo frames ?
Could be lack of mbuf_jumbo_9k (vmstat -z should tell you).
If so, a workaround is to decrease the MTU until 9K mbufs are not more used.
On my systems it gives a 4072 bytes MTU.
Of course it's just a workaround, as decreasing MTU increases overhead...
Comment 2 info 2019-12-13 15:17:15 UTC
(In reply to Ben RUBSON from comment #1)

no, we do not use jumbo frames:

[root@xxx:~] # ifconfig | grep mtu
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
enc0: flags=0<> metric 0 mtu 1536
pfsync0: flags=0<> metric 0 mtu 1500
pflog0: flags=0<> metric 0 mtu 33160
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lagg0.10: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
lagg0.800: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

[root@xxx:~] # vmstat -z | grep mbuf_jumbo_9k
mbuf_jumbo_9k:         9216, 2420071,       0,       0,       0,   0,   0
Comment 3 commit-hook freebsd_committer 2020-01-15 04:31:17 UTC
A commit references this bug:

Author: meta
Date: Wed Jan 15 04:30:48 UTC 2020
New revision: 523079
URL: https://svnweb.freebsd.org/changeset/ports/523079

Log:
  sysutils/getssl: Update to 2.14

  2.14
    * Rebased master onto APIv2 and added Content-Type: application/jose+json

  PR:		242621
  Submitted by:	sanpei
  Approved by:	maintainer timeout

Changes:
  head/sysutils/getssl/Makefile
  head/sysutils/getssl/distinfo
Comment 4 Edward Tomasz Napierala freebsd_committer 2020-01-16 10:23:03 UTC
Can you enable debug messages (sysctl kern.iscsi.debug=10) and paste dmesg from when it happens?  Also, instead of rebooting, can you instead try 'iscsictl -M -e off' and then 'iscsictl -M -e on'?