Hello, I'm facing an issue where iscsictl does not want to remove devices. Here is how I can reproduce this. ### Initiator : # iscsictl -Aa iscsictl then reports the 17 targets as connected, perfect. ### Target : Let's switch down the network interface # ifconfig mlxen1 down ### Initiator : iscsictl reports the 17 targets as disconnected, perfect. ### Target : Let's switch up the network interface # ifconfig mlxen1 up ### Initiator : iscsictl reports the 17 targets as connected, however, for 4 devices, I get the following : 09:59:43 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out waiting for iscsid(8) for 11 seconds; reconnecting 09:59:54 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out waiting for iscsid(8) for 11 seconds; reconnecting 09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): handoff on already connected session 07:59:57 srv1 iscsid[1372]: 192.168.2.2 (iqn.2012-06.srv2:lg2): ISCSIDHANDOFF: Device busy 09:59:57 srv1 iscsid[581]: child process 1372 terminated with exit status 1 09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection error; reconnecting 09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): got CAM status 0x8 09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): fatal error, failed to attach to device 10:00:07 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping reply (NOP-In) after 10 seconds; reconnecting 10:00:08 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping reply (NOP-In) after 10 seconds; reconnecting ### Target : 09:58:50 srv2 kernel: mlxen1: link state changed to DOWN 09:58:50 srv2 kernel: mlx4_en: mlxen1: Link Down 09:58:53 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection 09:58:53 srv2 last message repeated 16 times 09:59:49 srv2 kernel: mlx4_en: mlxen1: Link Up 09:59:49 srv2 kernel: mlxen1: link state changed to UP 09:59:49 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1' 09:59:59 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): connection error; dropping connection 09:59:59 srv2 last message repeated 3 times ### Initiator : # iscsictl -Ra # iscsictl -L Target name Target portal State iqn.2012-06.srv2:sW1 192.168.2.2 Connected: da18 iqn.2012-06.srv2:sW2 192.168.2.2 Connected: da23 iqn.2012-06.srv2:rT3 192.168.2.2 Connected: da17 iqn.2012-06.srv2:lg2 192.168.2.2 Connected: da21 As you can see, the 4 problematic devices remain "connected", nodes exist in /dev/, but they are unusable. Each time I "iscsictl -Ra", I get the following on initiator side : 10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection error; reconnecting 10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW1): connection error; reconnecting 10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:rT3): connection error; reconnecting 10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW2): connection error; reconnecting No logs however on target side, even if I start ctld with -d. The only workaround I found is to reboot, or to change the target name to properly reconnect... # uname -v FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC Thank you for your support, Best regards, Ben
On initiator, netstat also returns the following : tcp4 0 0 192.168.2.1.49715 192.168.2.2.3260 CLOSED tcp4 0 0 192.168.2.1.49708 192.168.2.2.3260 CLOSED tcp4 0 0 192.168.2.1.49707 192.168.2.2.3260 CLOSED tcp4 0 0 192.168.2.1.49706 192.168.2.2.3260 CLOSED
Quite difficult to reproduce, but I managed to do it again. This time the problematic target (only one /17) is seen as disconnected. ### Initiator : # iscsictl -Ra # iscsictl -L Target name Target portal State iqn.2012-06.srv2:ch3 192.168.2.2 Disconnected # netstat -an | grep 3260 // nothing returned ### Initiator logs : 12:38:02 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): login timed out after 11 seconds; reconnecting 12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): handoff on already connected session 12:38:06 srv1 iscsid[7395]: 192.168.2.2 (iqn.2012-06.srv2:ch3): ISCSIDHANDOFF: Device busy 12:38:06 srv1 iscsid[581]: child process 7395 terminated with exit status 1 12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): connection error; reconnecting 12:38:06 srv1 kernel: da8 at iscsi63 bus 0 scbus1 target 0 lun 0 12:38:06 srv1 kernel: da8: <FREEBSD CTLDISK 0001> s/n G22020PY1EZHAch detached 12:38:16 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): no ping reply (NOP-In) after 10 seconds; reconnecting ### Target logs : 12:37:41 srv2 kernel: mlxen1: link state changed to DOWN 12:37:41 srv2 kernel: mlx4_en: mlxen1: Link Down 12:37:46 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection 12:37:46 srv2 last message repeated 16 times 12:38:05 srv2 kernel: mlx4_en: mlxen1: Link Up 12:38:05 srv2 kernel: mlxen1: link state changed to UP 12:38:05 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1' 12:38:09 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): connection error; dropping connection On initiator side, I also get the same log messages as above for the other devices, but they correctly reconnect.
(In reply to Ben RUBSON from comment #0) >The only workaround I found is to reboot I even have to hard-reboot, because of this never ending : iscsi_terminate_sessions: waiting for sessions to terminate
One strange thing I noticed. (I put all things that could be interesting from my troubleshooting) As soon as I put the network interface down, I get the following message on target side, one per target : 17:01:00 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection Then, on initiator side, I get these messages for each target : Aug 19 17:01:07 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2 (iqn.2012-06.srv2:hm4): connection failed, destroying devices Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2 (iqn.2012-06.srv2:hm4): freezing Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2 (iqn.2012-06.srv2:hm4): deregistering SIM At this moment, on initiator side, one iscsid process per target appears. 10 seconds later, on initiator side, I get these messages for each target : Aug 19 17:01:18 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:hm4): login timed out after 11 seconds; reconnecting Aug 19 17:01:18 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2 (iqn.2012-06.srv2:hm4): connection failed, destroying devices And at the same time, a second iscsid process per target appears, so that I get 2 iscsid processes per target : # ps auxxw | grep iscsid: root 866 0.0 0.0 16632 2144 - I 4:58pm 0:00.00 iscsid: 192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid) root 881 0.0 0.0 16632 2144 - I 4:58pm 0:00.00 iscsid: 192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid) (...) However sounds like there is a limit to 30 processes, as for 17 targets I would have expected 34 processes, but I only get 30. If I put the NIC up before the second process is created, I only get one reconnection message per target in target logs. If I put the NIC up after the second process is created, I get a lot more reconnection messages in target logs, between 40 and 50 for 17 targets. Do we expect these additional processes ? I think we would only expect one process / one reconnection message per target ? Seems strange to have all these "duplicated" connection retries. Another related question to the "30" processes found : Is there any limit to 30 targets ? I found a maxproc option in ctl.conf (default to 30) but I don't exactly know what it means (I tested values of 1 to 50 without seeing any change). No option found however on initiator side. I noticed that we can reproduce this bug easier when we "stress" the devices : disconnect network as soon as targets are reconnected, and reconnect it as soon as they are disconnected. Additionally to this, I had 8 kernel crashes, initator or target, each time with the same address / pointer : kernel: Fatal trap 12: page fault while in kernel mode kernel: fault virtual address = 0x1e8 kernel: instruction pointer = 0x20:0xffffffff80936933 I also got a stacktrace, but did not get it's pointer address. http://img4.hostingpics.net/pics/707217211990.png I'm also trying to get a full dump. However I'm not sure this kernel crash issue is related to the reconnection issue, perhaps there are 2 issues. # uname -v FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC A lot of info ! I hope we will be able to correct these issues. Many thanks, Ben
I made further troubleshooting. Regarding the number of iscsid processes which increases, I've found timeout settings which permit to always have only one process per target : sysctl kern.iscsi.login_timeout=85 sysctl kern.iscsi.iscsid_timeout=5 sysctl kern.iscsi.ping_timeout=5 These are based on net.inet.tcp.keepinit which is 75000 by default (75 seconds). Regarding the "30" processes limit, it can be tuned with the "-d" option to iscsid (through /etc/rc.d/iscsid for example). And the most important, regarding the bug : when devices do not want to correctly reconnect, I found that it is because iscsi is stuck in the following : cam_sim_free(is->is_sim, TRUE /*free_devq*/); However I don't know why.
(In reply to Ben RUBSON from comment #5) >And the most important, regarding the bug : >when devices do not want to correctly reconnect, I found that it is because >iscsi is stuck in the following : >cam_sim_free(is->is_sim, TRUE /*free_devq*/); >However I don't know why. I put some printf in cam_sim.c and found that iscsi is stuck by : msleep(sim, sim->mtx, PRIBIO, "simfree", 0); cam_sim_release() is never fired, so msleep is blocking iscsi.
I tried 11-RC1 and was not able to reproduce the issue. With 10.3, after about 10 NIC toggles, some devices fail. With 11-RC1, more than one thousand toggles later, all my devices are still reconnecting correctly.
I'm facing this issue in 11-RC2 where I have 2 iscsi devices which are disconnected, don't want to reconnect, and I can't manage to remove them : # iscsictl Target name Target portal State iqn.2012-06.g2:rT3 192.168.2.2 Disconnected iqn.2012-06.g2:ch1 192.168.2.2 Disconnected # iscsictl -Ra # iscsictl -v Session ID: 6 Initiator name: iqn.2012-06.b2g1 Initiator portal: Initiator alias: Target name: iqn.2012-06.g2:rT3 Target portal: 192.168.2.2 Target alias: G22020PY1EZHArT User: Secret: Mutual user: Mutual secret: Session type: Normal Enable: Yes Session state: Disconnected Failure reason: Header digest: None Data digest: None DataSegmentLen: 131072 MaxBurstLen: 262144 FirstBurstLen: 131072 ImmediateData: Yes iSER (RDMA): No Offload driver: None Device nodes: da11 Session ID: 10 Initiator name: iqn.2012-06.b2g1 Initiator portal: Initiator alias: Target name: iqn.2012-06.g2:ch1 Target portal: 192.168.2.2 Target alias: G22000PY1EVVAch User: Secret: Mutual user: Mutual secret: Session type: Normal Enable: Yes Session state: Disconnected Failure reason: Header digest: None Data digest: None DataSegmentLen: 131072 MaxBurstLen: 262144 FirstBurstLen: 131072 ImmediateData: Yes iSER (RDMA): No Offload driver: None Device nodes: da12 # ls -l /dev/da11 /dev/da12 ls: /dev/da11: No such file or directory ls: /dev/da12: No such file or directory Any help please ? Thank you very much, Ben
Mmmmhh these 2 devices were spares in a zpool, but strangely seen as available. I removed these 2 spares from the zpool and then the 2 iscsi devices disappeared properly from the system. OK, not related then to the 10.3 bug above.
If I'm the only one interested in this bug report, it can then be closed, as I now run FreeBSD 11. Feel free then to re-open it if needed.
I got a similar issue on: FreeBSD filer1.prod.lan 10.3-RELEASE-p7 FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 with: root@filer1:/etc/rc.d # sysctl -a|grep -i 'iscsi' kern.iscsi.fail_on_shutdown: 1 kern.iscsi.fail_on_disconnection: 1 kern.iscsi.maxtags: 255 kern.iscsi.login_timeout: 5 kern.iscsi.iscsid_timeout: 5 kern.iscsi.ping_timeout: 5 kern.iscsi.debug: 1 On the initiator side: WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping reply (NOP-Out) after 5 seconds; dropping connection WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection On the target side: WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi2:0:0:0): READ(10). CDB: 28 00 08 03 02 7a 00 00 01 00 (da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 49 e7 00 00 20 00 iscsi2:0:(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host 0:(da2:0): iscsi1:0:Retrying command 0:0): Retrying command (da3:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4a e7 00 00 20 00 (da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4a 07 00 00 20 00 (da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi2:0:iscsi1:0:0:0:0): 0): Retrying command Retrying command da3 at iscsi2 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached da2 at iscsi1 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da3:iscsi2:0:0:0): Periph destroyed (da2:iscsi1:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 0 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4c a7 00 00 20 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 e0 f3 d4 00 00 04 00 (da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command 0): Retrying command (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4d 07 00 00 20 00 (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4b c7 00 00 20 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command Retrying command da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da2:iscsi2:0:0:0): Periph destroyed (da3:iscsi1:0:0:0): Periph destroyed da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 0 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 1 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00 (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): Retrying command (da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 10 02 00 (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): Retrying command da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da2:iscsi1:0:0:0): Periph destroyed da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 0 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00 (da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:iscsi2:0:0:0): Retrying command da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da3:iscsi2:0:0:0): Periph destroyed WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00 (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): Retrying command da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da2:iscsi1:0:0:0): Periph destroyed da2 at iscsi1 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 0 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): login timed out after 6 seconds; reconnecting da3 at iscsi2 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 1 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting da2 at iscsi1 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da2:iscsi1:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): Error 5, Periph was invalidated WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 (da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:iscsi2:0:0:0): Retrying command da3 at iscsi2 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da2:iscsi1:0:0:0): Periph destroyed (da3:iscsi2:0:0:0): Periph destroyed da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 0 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 1 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): handoff on already connected session Oct 16 01:02:04 filer1 iscsid[88600]: 10.20.30.32 (iqn.2016-08.lan.prod:target1): ISCSIDHANDOFF: Device busy WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error; reconnecting da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da3:iscsi2:0:0:0): Periph destroyed da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 1 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 20 20 00 (da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi1:0:0:0): Retrying command da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da2:iscsi1:0:0:0): Periph destroyed WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00 (da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:iscsi2:0:0:0): Retrying command da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da3:iscsi2:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 0 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error; reconnecting da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da3:iscsi1:0:0:0): Periph destroyed (da2:iscsi2:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi1 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 0 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors)
Some more info: on filer2: root@filer2:/home/jcigar # cat /etc/ctl.conf auth-group ag0 { chap xxx xxx } portal-group pg0 { discovery-auth-group no-authentication listen 10.20.30.32 } target iqn.2016-08.lan.prod:target0 { auth-group ag0 portal-group pg0 lun 0 { blocksize 4096 path /dev/gpt/f2data0 } } target iqn.2016-08.lan.prod:target1 { auth-group ag0 portal-group pg0 lun 0 { blocksize 4096 path /dev/gpt/f2data1 } } root@filer2:/home/jcigar # ifconfig bge1 bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether d0:bf:9c:3a:b7:25 inet 10.20.30.32 netmask 0xffffff00 broadcast 10.20.30.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex>) status: active root@filer2:/home/jcigar # pciconf -lv bge1 bge1@pci0:2:0:1: class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme BCM5720 Gigabit Ethernet PCIe' class = network subclass = ethernet root@filer2:/home/jcigar # grep 'bge1' /var/run/dmesg.boot bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem 0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1 on pci2 bge1: APE FW version: NCSI v1.3.12.0 bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E miibus1: <MII bus> on bge1 bge1: Using defaults for TSO: 65518/35/2048 bge1: Ethernet address: d0:bf:9c:3a:b7:25 bge1: link state changed to DOWN bge1: link state changed to UP on filer1.prod.lan: root@filer1:/home/jcigar # cat /etc/iscsi.conf t0 { TargetAddress = 10.20.30.32 TargetName = iqn.2016-08.lan.prod:target0 AuthMethod = CHAP chapIName = xxx chapSecret = xxx } t1 { TargetAddress = 10.20.30.32 TargetName = iqn.2016-08.lan.prod:target1 AuthMethod = CHAP chapIName = xxx chapSecret = xxx } root@filer1:/home/jcigar # ifconfig bge1 bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> ether d0:bf:9c:3a:c6:e1 inet 10.20.30.31 netmask 0xffffff00 broadcast 10.20.30.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex,master>) status: active root@filer1:/home/jcigar # pciconf -lv bge1 bge1@pci0:2:0:1: class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00 hdr=0x00 vendor = 'Broadcom Corporation' device = 'NetXtreme BCM5720 Gigabit Ethernet PCIe' class = network subclass = ethernet root@filer1:/home/jcigar # grep 'bge1' /var/run/dmesg.boot bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem 0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1 on pci2 bge1: APE FW version: NCSI v1.3.12.0 bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E miibus1: <MII bus> on bge1 bge1: Using defaults for TSO: 65518/35/2048 bge1: Ethernet address: d0:bf:9c:3a:c6:e1 bge1: link state changed to DOWN bge1: link state changed to UP
The "WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping reply (NOP-Out) after 5 seconds; dropping connection" and "WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting" messages suggest a network problem.
This is also what I thought at first, but there is no error messages related to the network interfaces in the logs, and all Ierrs, Idrop, and Coll counters are zeros. There is a dedicated interface for the iSCSI traffic which I turned on/off quite a lot to test a failover script, and the symptoms are exactly the ones described by Ben Rubson in the post above (altough I only have 4 disks in total). One more thing, I'm managing everything with SaltStack and in some cases ctld on the target and iscsid on the initiator are restarted at the same time, could it be because of that? I forgot to paste my rc.conf filers, but I don't think there is something wrong ..: root@filer1:/home/jcigar # cat /etc/rc.conf dumpdev="AUTO" ifconfig_bge0="DHCP" ip6addrctl_policy="ipv4_prefer" ipv6_network_interfaces="none" keymap="be.iso" mountd_enable="YES" mountd_flags="-r -S -h 192.168.10.15" nfs_server_enable="YES" nfs_server_flags="-t -h 192.168.10.15" nfsuserd_enable="YES" nfsuserd_flags="-domain prod.lan" nfsv4_server_enable="YES" ntpdate_flags="-b -4" ntpdate_hosts="ntp.belnet.be" salt_minion_enable="YES" sendmail_enable="NO" sendmail_msp_queue_enable="NO" sendmail_outbound_enable="NO" sendmail_submit_enable="NO" zfs_enable="YES" hostname="filer1.prod.lan" ifconfig_bge1="inet 10.20.30.31 netmask 255.255.255.0 mtu 9000" ifconfig_bge0_alias0="inet vhid 54 advskew 50 pass xxx alias 192.168.10.15/32" sshd_enable="YES" ntpdate_enable="YES" iscsid_enable="YES" iscsictl_enable="YES" iscsictl_flags="-Aa" root@filer2:/home/jcigar # cat /etc/rc.conf dumpdev="AUTO" ifconfig_bge0="DHCP" ip6addrctl_policy="ipv4_prefer" ipv6_network_interfaces="none" keymap="be.iso" ntpdate_flags="-b -4" ntpdate_hosts="ntp.belnet.be" salt_minion_enable="YES" sendmail_enable="NO" sendmail_msp_queue_enable="NO" sendmail_outbound_enable="NO" sendmail_submit_enable="NO" zfs_enable="YES" hostname="filer2.prod.lan" ifconfig_bge1="inet 10.20.30.32 netmask 255.255.255.0 mtu 9000" ifconfig_bge0_alias0="inet vhid 54 advskew 100 pass xxx alias 192.168.10.15/32" sshd_enable="YES" ntpdate_enable="YES" ctld_enable="YES"
I re-open this bug for Julien.
Thank you Ben. Can you confirm (once again) that you haven't had this problem anymore in 11.0-RELEASE? If so I could just upgrade then .. :)
I switched to 11 at the time of 11-RC1 as I did not manage to reproduce the bug with it. I did not try to explicitly reproduce it with RC2, RC3, 11.0-RELEASE and 11.0-RELEASE-p1, but it did not occur by itself in these versions :) Sounds like this bug only affects 10.3.
Problem appeared again today, after ~15 days of uptime, always on FreeBSD filer1.prod.lan 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24 18:49:24 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 01 ef ec 90 00 00 01 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 01 ef ec 8e 00 00 01 00 (da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command 0): Retrying command da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da2:iscsi2:0:0:0): Periph destroyed (da3:iscsi1:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): Retrying command da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da2:iscsi2:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): login timed out after 6 seconds; reconnecting da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 0 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): Retrying command da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da2:iscsi2:0:0:0): Periph destroyed da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 1 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): handoff on already connected session WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached (da3:iscsi1:0:0:0): Periph destroyed da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 0 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) After a zpool online, and with vfs.zfs.scrub_delay = 0 and vfs.zfs.resilver_delay = 0 I issued a zpool scrub and again I had a timeout: WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9a 67 00 00 01 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 98 3c 00 00 01 00 (da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command 0): Retrying command (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9f e4 00 00 01 00 (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a8 cc 00 00 01 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command Retrying command (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 a2 42 00 00 01 00 (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 6e 00 00 20 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command Retrying command (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 4e 00 00 20 00 (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 8e 00 00 20 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command Retrying command (da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 6e 00 00 20 00 (da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a9 b2 00 00 01 00 (da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host (da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host (da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command Retrying command da3 at iscsi1 bus 0 scbus4 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached da2 at iscsi2 bus 0 scbus3 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached (da3:iscsi1:0:0:0): Periph destroyed (da2:iscsi2:0:0:0): Periph destroyed da2 at iscsi1 bus 0 scbus4 target 0 lun 0 da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da2: Serial Number MYSERIAL 0 da2: 150.000MB/s transfers da2: Command Queueing enabled da2: 1840144MB (471076881 4096 byte sectors) da3 at iscsi2 bus 0 scbus3 target 0 lun 0 da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device da3: Serial Number MYSERIAL 1 da3: 150.000MB/s transfers da3: Command Queueing enabled da3: 1840144MB (471076881 4096 byte sectors) I've raised those timeouts a little bit: kern.iscsi.login_timeout: 30 kern.iscsi.iscsid_timeout: 30 kern.iscsi.ping_timeout: 30 and see if it makes any difference
this is what I have on the target: root@filer2:/home/jcigar # sysctl -a|grep -i 'iscsi' kern.iscsi.fail_on_shutdown: 1 kern.iscsi.fail_on_disconnection: 1 kern.iscsi.maxtags: 255 kern.iscsi.login_timeout: 30 kern.iscsi.iscsid_timeout: 30 kern.iscsi.ping_timeout: 30 kern.iscsi.debug: 1 kern.cam.ctl.iscsi.maxcmdsn_delta: 256 kern.cam.ctl.iscsi.login_timeout: 60 kern.cam.ctl.iscsi.ping_timeout: 5 kern.cam.ctl.iscsi.debug: 1 and the following in the error logs: WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection
Also I noticed some Idrop packets on the iscsi interface of the initiator side: root@filer1:/home/jcigar # netstat -I bge1 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll bge1 9000 <Link#2> d0:bf:9c:3a:c6:e1 77396222 0 107 45800806 0 0 bge1 - 10.20.30.0 10.20.30.31 77394879 - - 73726273 - - and: root@filer1:/home/jcigar # vmstat -z | grep -i 'mbuf_jumbo_9k:' mbuf_jumbo_9k: 9216, 74075, 256, 161,51922039, 214, 0 could it be related to jumbo frames? (on the target side counters are 0)
I was also using jumbo frames with 10.3, but unfortunately I did not look at dropped packets when I had the issues. Note that with 11, I still use jumbo frames and never encounter this issue again.
Yep it's strange, I will definitively upgrade to 11 when possible .. Other infos: root@filer1:/home/jcigar # netstat -m 1285/6575/7860 mbufs in use (current/cache/total) 1024/3526/4550/500010 mbuf clusters in use (current/cache/total/max) 1024/3178 mbuf+clusters out of packet secondary zone in use (current/cache) 0/267/267/250005 4k (page size) jumbo clusters in use (current/cache/total/max) 256/181/437/74075 9k jumbo clusters in use (current/cache/total/max) 0/0/0/41667 16k jumbo clusters in use (current/cache/total/max) 4673K/11392K/16066K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/214/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile root@filer1:/home/jcigar # sysctl kern.ipc | fgrep nmb kern.ipc.nmbufs: 3200070 kern.ipc.nmbjumbo16: 166668 kern.ipc.nmbjumbo9: 222225 kern.ipc.nmbjumbop: 250005 kern.ipc.nmbclusters: 500010
Ok.. I definitively think it's an issue with jumbo frames, I got another disconnection and again the "requests for jumbo clusters denied" (9k) counter increased: 0/304/0 requests for jumbo clusters denied (4k/9k/16k) any idea?
Any mention of nmbjumbo9 in /var/log/messages? Specifically do you see: kern.ipc.nmbjumbo9 limit reached If so have you tried bumping kern.ipc.nmbjumbo9?
(In reply to Steven Hartland from comment #24) I don't have any mention of nmbjumbo9 in /var/log/messages, I haven't raised kern.ipc.nmbjumbo9 yet, currently it's: root@filer1:/var/log # sysctl kern.ipc.nmbjumbo9 kern.ipc.nmbjumbo9: 222225 but it does not seem necessary to raise it if I can trust netstat -m: root@filer1:/var/log # netstat -m 1282/6533/7815 mbufs in use (current/cache/total) 1024/3564/4588/500010 mbuf clusters in use (current/cache/total/max) 1024/3431 mbuf+clusters out of packet secondary zone in use (current/cache) 0/161/161/250005 4k (page size) jumbo clusters in use (current/cache/total/max) 256/120/376/74075 9k jumbo clusters in use (current/cache/total/max) 0/0/0/41667 16k jumbo clusters in use (current/cache/total/max) 4672K/10485K/15157K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/304/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile (the box isn't heavily loaded) what I don't understand if why there are requests for 4k jumbo clusters denied as there are plenty free .. ? could it be related to this change: https://svnweb.freebsd.org/base/head/sys/dev/bge/if_bge.c?r1=276299&r2=276750 ?
I don't know if it's related but I also noticed some incorrect cksum: 00:14:03.380734 IP (tos 0x0, ttl 64, id 19369, offset 0, flags [DF], proto TCP (6), length 100, bad cksum 0 (->9e84)!) 10.20.30.31.24995 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd (incorrect -> 0xfff9), seq 1205926059:1205926107, ack 1518444144, win 16384, options [nop,nop,TS val 1388742470 ecr 2474468733], length 48 00:14:03.380742 IP (tos 0x0, ttl 64, id 19370, offset 0, flags [DF], proto TCP (6), length 100, bad cksum 0 (->9e83)!) 10.20.30.31.55645 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd (incorrect -> 0x3912), seq 3104128225:3104128273, ack 1948017382, win 16384, options [nop,nop,TS val 1388742470 ecr 4001288677], length 48 00:14:03.381060 IP (tos 0x0, ttl 64, id 54596, offset 0, flags [DF], proto TCP (6), length 100) 10.20.30.32.iscsi-target > 10.20.30.31.55645: Flags [P.], cksum 0x4c6a (correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 4001289988 ecr 1388742470], length 48 00:14:03.381070 IP (tos 0x0, ttl 64, id 54597, offset 0, flags [DF], proto TCP (6), length 100) 10.20.30.32.iscsi-target > 10.20.30.31.24995: Flags [P.], cksum 0x0d62 (correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 2474470019 ecr 1388742470], length 48 ... but I guess it's because TSO is enabled on the interfaces ?
checksums a likely just due to hw checksum offloading.
I'm assuming vmstat agrees with netstat -m e.g. vmstat -z | head -n 1; vmstat -z | grep mbuf It could be that the disconnect is the cause of mbuf issue and not the other way round. You'd have to catch it when the issue is occurring to confirm. If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to see if the problem goes away?
If its dropping in the driver due to failure go get a jumbo mbuf you should see this in the output netstat -i.
Y(In reply to Steven Hartland from comment #29) Yes it seems so: root@filer1:/var/log # vmstat -z | head -n 1; vmstat -z | grep mbuf ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_packet: 256, 3200070, 1024, 3431,94881716, 0, 0 mbuf: 256, 3200070, 257, 3103,566359041, 0, 0 mbuf_cluster: 2048, 500010, 4455, 133,15367798, 0, 0 mbuf_jumbo_page: 4096, 250005, 0, 161,102798544, 0, 0 mbuf_jumbo_9k: 9216, 74075, 256, 120,52036055, 304, 0 mbuf_jumbo_16k: 16384, 41667, 0, 0, 0, 0, 0 mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0 > You'd have to catch it when the issue is occurring to confirm not so easy :) but every time I have a WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 30 seconds; reconnecting I have a small increase in the "requests for 9k jumbo clusters denied" counter > If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to see if the problem goes away? The machine has only 8GB of RAM (and ZFS), I think I'll first test with a 1500 MTU and -tso
(In reply to Steven Hartland from comment #30) Yes I already checked the error counters, but it's 0: root@filer1:/var/log # netstat -I bge1 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll bge1 9000 <Link#2> d0:bf:9c:3a:c6:e1 77755195 0 152 46072898 0 0 bge1 - 10.20.30.0 10.20.30.31 77753810 - - 74150354 - - ... so I guess it's not a driver issue
an update on this: I turned off jumbo frames and TSO on the replication interface and the issue hasn't been repeated until now...! filer1% uptime 10:15AM up 18 days, 13:35, 1 user, load averages: 0.07, 0.09, 0.08 filer1% vmstat -z|head -1;vmstat -z|grep -i 'mbuf' ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_packet: 256, 3200070, 1024, 3403,527205862, 0, 0 mbuf: 256, 3200070, 6, 2752,456285246, 0, 0 mbuf_cluster: 2048, 500010, 4427, 355,18287308, 0, 0 mbuf_jumbo_page: 4096, 250005, 0, 71, 2682330, 0, 0 mbuf_jumbo_9k: 9216, 74075, 0, 0, 0, 0, 0 mbuf_jumbo_16k: 16384, 41667, 0, 0, 0, 0, 0 mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0 filer1% netstat -m 1031/6154/7185 mbufs in use (current/cache/total) 1024/3758/4782/500010 mbuf clusters in use (current/cache/total/max) 1024/3403 mbuf+clusters out of packet secondary zone in use (current/cache) 0/71/71/250005 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/74075 9k jumbo clusters in use (current/cache/total/max) 0/0/0/41667 16k jumbo clusters in use (current/cache/total/max) 2305K/9338K/11644K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile filer1% netstat -I bge1 Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll bge1 1500 <Link#2> xx:xx:xx:xx:xx:xx 497577171 0 0 262946639 0 0 bge1 - 10.20.30.0 10.20.30.31 497575590 - - 262942638 - - filer1% ifconfig bge1 bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=c009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWTSO,LINKSTATE> ether xx:xx:xx:xx:xx:xx inet 10.20.30.31 netmask 0xffffff00 broadcast 10.20.30.255 nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect (1000baseT <full-duplex,master>) status: active (I'll let you know if the problem occurs again)
No problems for more than a month now.. I tend to conclude that the MTU 9000 was the root cause of this problem (?)
Did you try to increase some sysctls with jumbo frames on, as proposed by Steven ?
(In reply to Ben RUBSON from comment #35) yes, .. but only slightly as the machine has only 8GB of RAM
And no improvements ?
(In reply to Ben RUBSON from comment #37) it only delayed the problem, .. but it would be interesting if you could test on your side..!
Strange, as if mbufs were "consumed" but not released. And when devices are disconnected, can you reconnect to them properly ? Or do you need to change their target name or to reboot, as explained in my first message of this bug report ?
I had to reboot..
OK, so same issue as mine (for which I opened this bug report), with 10.3. Which I was not able to reproduce with 11. Any plan to upgrade to 11 to be safe ?
also, looks similar to bug #215727
(In reply to Ben RUBSON from comment #41) as soon as I have a little time .. :)
From what I see, it's a networking problem, not a bug in iSCSI.