Bug 211990

Summary:	iscsi fails to reconnect and does not release devices
Product:	Base System	Reporter:	Ben RUBSON <ben.rubson>
Component:	kern	Assignee:	freebsd-bugs (Nobody) <bugs>
Status:	Closed Works As Intended
Severity:	Affects Some People	CC:	julien, scsi, smh, trasz
Priority:	---
Version:	10.3-RELEASE
Hardware:	Any
OS:	Any

Description Ben RUBSON 2016-08-19 09:41:14 UTC

Hello,

I'm facing an issue where iscsictl does not want to remove devices.
Here is how I can reproduce this.

### Initiator :
# iscsictl -Aa
iscsictl then reports the 17 targets as connected, perfect.

### Target :
Let's switch down the network interface
# ifconfig mlxen1 down

### Initiator :
iscsictl reports the 17 targets as disconnected, perfect.

### Target :
Let's switch up the network interface
# ifconfig mlxen1 up

### Initiator :
iscsictl reports the 17 targets as connected, however, for 4 devices, I get the following :
09:59:43 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out waiting for iscsid(8) for 11 seconds; reconnecting
09:59:54 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out waiting for iscsid(8) for 11 seconds; reconnecting
09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): handoff on already connected session
07:59:57 srv1 iscsid[1372]: 192.168.2.2 (iqn.2012-06.srv2:lg2): ISCSIDHANDOFF: Device busy
09:59:57 srv1 iscsid[581]: child process 1372 terminated with exit status 1
09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection error; reconnecting
09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): got CAM status 0x8
09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): fatal error, failed to attach to device
10:00:07 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping reply (NOP-In) after 10 seconds; reconnecting
10:00:08 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping reply (NOP-In) after 10 seconds; reconnecting

### Target :
09:58:50 srv2 kernel: mlxen1: link state changed to DOWN
09:58:50 srv2 kernel: mlx4_en: mlxen1: Link Down
09:58:53 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection
09:58:53 srv2 last message repeated 16 times
09:59:49 srv2 kernel: mlx4_en: mlxen1: Link Up
09:59:49 srv2 kernel: mlxen1: link state changed to UP
09:59:49 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1'
09:59:59 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): connection error; dropping connection
09:59:59 srv2 last message repeated 3 times

### Initiator :
# iscsictl -Ra
# iscsictl -L
Target name            Target portal    State
iqn.2012-06.srv2:sW1   192.168.2.2      Connected: da18 
iqn.2012-06.srv2:sW2   192.168.2.2      Connected: da23 
iqn.2012-06.srv2:rT3   192.168.2.2      Connected: da17 
iqn.2012-06.srv2:lg2   192.168.2.2      Connected: da21 

As you can see, the 4 problematic devices remain "connected", nodes exist in /dev/, but they are unusable.

Each time I "iscsictl -Ra", I get the following on initiator side :
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW1): connection error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:rT3): connection error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW2): connection error; reconnecting

No logs however on target side, even if I start ctld with -d.

The only workaround I found is to reboot, or to change the target name to properly reconnect...

# uname -v
FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC 

Thank you for your support,

Best regards,

Ben

Comment 1 Ben RUBSON 2016-08-19 10:04:57 UTC

On initiator, netstat also returns the following :
tcp4       0      0 192.168.2.1.49715      192.168.2.2.3260       CLOSED
tcp4       0      0 192.168.2.1.49708      192.168.2.2.3260       CLOSED
tcp4       0      0 192.168.2.1.49707      192.168.2.2.3260       CLOSED
tcp4       0      0 192.168.2.1.49706      192.168.2.2.3260       CLOSED

Comment 2 Ben RUBSON 2016-08-19 12:16:02 UTC

Quite difficult to reproduce, but I managed to do it again.
This time the problematic target (only one /17) is seen as disconnected.

### Initiator :
# iscsictl -Ra
# iscsictl -L
Target name                          Target portal    State
iqn.2012-06.srv2:ch3                 192.168.2.2      Disconnected
# netstat -an | grep 3260
// nothing returned

### Initiator logs :
12:38:02 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): login timed out after 11 seconds; reconnecting
12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): handoff on already connected session
12:38:06 srv1 iscsid[7395]: 192.168.2.2 (iqn.2012-06.srv2:ch3): ISCSIDHANDOFF: Device busy
12:38:06 srv1 iscsid[581]: child process 7395 terminated with exit status 1
12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): connection error; reconnecting
12:38:06 srv1 kernel: da8 at iscsi63 bus 0 scbus1 target 0 lun 0
12:38:06 srv1 kernel: da8: <FREEBSD CTLDISK 0001> s/n G22020PY1EZHAch detached
12:38:16 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): no ping reply (NOP-In) after 10 seconds; reconnecting

### Target logs :
12:37:41 srv2 kernel: mlxen1: link state changed to DOWN
12:37:41 srv2 kernel: mlx4_en: mlxen1: Link Down
12:37:46 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection
12:37:46 srv2 last message repeated 16 times
12:38:05 srv2 kernel: mlx4_en: mlxen1: Link Up
12:38:05 srv2 kernel: mlxen1: link state changed to UP
12:38:05 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1'
12:38:09 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): connection error; dropping connection

On initiator side, I also get the same log messages as above for the other devices, but they correctly reconnect.

Comment 3 Ben RUBSON 2016-08-19 14:36:22 UTC

(In reply to Ben RUBSON from comment #0)
>The only workaround I found is to reboot

I even have to hard-reboot, because of this never ending :
iscsi_terminate_sessions: waiting for sessions to terminate

Comment 4 Ben RUBSON 2016-08-19 22:19:35 UTC

One strange thing I noticed.
(I put all things that could be interesting from my troubleshooting)

As soon as I put the network interface down, I get the following message on target side, one per target :
17:01:00 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no ping reply (NOP-Out) after 5 seconds; dropping connection

Then, on initiator side, I get these messages for each target :
Aug 19 17:01:07 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2 (iqn.2012-06.srv2:hm4): connection failed, destroying devices
Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2 (iqn.2012-06.srv2:hm4): freezing
Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2 (iqn.2012-06.srv2:hm4): deregistering SIM

At this moment, on initiator side, one iscsid process per target appears.

10 seconds later, on initiator side, I get these messages for each target :
Aug 19 17:01:18 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:hm4): login timed out after 11 seconds; reconnecting
Aug 19 17:01:18 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2 (iqn.2012-06.srv2:hm4): connection failed, destroying devices

And at the same time, a second iscsid process per target appears, so that I get 2 iscsid processes per target :
# ps auxxw | grep iscsid:
root  866    0.0  0.0 16632  2144  -  I     4:58pm   0:00.00 iscsid: 192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid)
root  881    0.0  0.0 16632  2144  -  I     4:58pm   0:00.00 iscsid: 192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid)
(...)
However sounds like there is a limit to 30 processes, as for 17 targets I would have expected 34 processes, but I only get 30.

If I put the NIC up before the second process is created, I only get one reconnection message per target in target logs.
If I put the NIC up after the second process is created, I get a lot more reconnection messages in target logs, between 40 and 50 for 17 targets.

Do we expect these additional processes ?
I think we would only expect one process / one reconnection message per target ?
Seems strange to have all these "duplicated" connection retries.

Another related question to the "30" processes found :
Is there any limit to 30 targets ?
I found a maxproc option in ctl.conf (default to 30) but I don't exactly know what it means (I tested values of 1 to 50 without seeing any change).
No option found however on initiator side.

I noticed that we can reproduce this bug easier when we "stress" the devices : disconnect network as soon as targets are reconnected, and reconnect it as soon as they are disconnected.



Additionally to this, I had 8 kernel crashes, initator or target, each time with the same address / pointer :
kernel: Fatal trap 12: page fault while in kernel mode
kernel: fault virtual address	= 0x1e8
kernel: instruction pointer	= 0x20:0xffffffff80936933

I also got a stacktrace, but did not get it's pointer address.
http://img4.hostingpics.net/pics/707217211990.png

I'm also trying to get a full dump.
However I'm not sure this kernel crash issue is related to the reconnection issue, perhaps there are 2 issues.

# uname -v
FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC



A lot of info !
I hope we will be able to correct these issues.

Many thanks,

Ben

Comment 5 Ben RUBSON 2016-08-20 17:29:52 UTC

I made further troubleshooting.

Regarding the number of iscsid processes which increases, I've found timeout settings which permit to always have only one process per target :
sysctl kern.iscsi.login_timeout=85
sysctl kern.iscsi.iscsid_timeout=5
sysctl kern.iscsi.ping_timeout=5
These are based on net.inet.tcp.keepinit which is 75000 by default (75 seconds).

Regarding the "30" processes limit, it can be tuned with the "-d" option to iscsid (through /etc/rc.d/iscsid for example).

And the most important, regarding the bug :
when devices do not want to correctly reconnect, I found that it is because iscsi is stuck in the following :
cam_sim_free(is->is_sim, TRUE /*free_devq*/);
However I don't know why.

Comment 6 Ben RUBSON 2016-08-21 10:27:03 UTC

(In reply to Ben RUBSON from comment #5)
>And the most important, regarding the bug :
>when devices do not want to correctly reconnect, I found that it is because
>iscsi is stuck in the following :
>cam_sim_free(is->is_sim, TRUE /*free_devq*/);
>However I don't know why.
I put some printf in cam_sim.c and found that iscsi is stuck by :
msleep(sim, sim->mtx, PRIBIO, "simfree", 0);
cam_sim_release() is never fired, so msleep is blocking iscsi.

Comment 7 Ben RUBSON 2016-08-21 20:14:57 UTC

I tried 11-RC1 and was not able to reproduce the issue.

With 10.3, after about 10 NIC toggles, some devices fail.
With 11-RC1, more than one thousand toggles later, all my devices are still reconnecting correctly.

Comment 8 Ben RUBSON 2016-09-06 08:48:08 UTC

I'm facing this issue in 11-RC2 where I have 2 iscsi devices which are disconnected, don't want to reconnect, and I can't manage to remove them :

# iscsictl 
Target name                          Target portal    State
iqn.2012-06.g2:rT3                   192.168.2.2      Disconnected
iqn.2012-06.g2:ch1                   192.168.2.2      Disconnected

# iscsictl -Ra

# iscsictl -v
Session ID:       6
Initiator name:   iqn.2012-06.b2g1
Initiator portal: 
Initiator alias:  
Target name:      iqn.2012-06.g2:rT3
Target portal:    192.168.2.2
Target alias:     G22020PY1EZHArT
User:             
Secret:           
Mutual user:      
Mutual secret:    
Session type:     Normal
Enable:           Yes
Session state:    Disconnected
Failure reason:   
Header digest:    None
Data digest:      None
DataSegmentLen:   131072
MaxBurstLen:      262144
FirstBurstLen:    131072
ImmediateData:    Yes
iSER (RDMA):      No
Offload driver:   None
Device nodes:     da11 

Session ID:       10
Initiator name:   iqn.2012-06.b2g1
Initiator portal: 
Initiator alias:  
Target name:      iqn.2012-06.g2:ch1
Target portal:    192.168.2.2
Target alias:     G22000PY1EVVAch
User:             
Secret:           
Mutual user:      
Mutual secret:    
Session type:     Normal
Enable:           Yes
Session state:    Disconnected
Failure reason:   
Header digest:    None
Data digest:      None
DataSegmentLen:   131072
MaxBurstLen:      262144
FirstBurstLen:    131072
ImmediateData:    Yes
iSER (RDMA):      No
Offload driver:   None
Device nodes:     da12 

# ls -l /dev/da11 /dev/da12
ls: /dev/da11: No such file or directory
ls: /dev/da12: No such file or directory

Any help please ?

Thank you very much,

Ben

Comment 9 Ben RUBSON 2016-09-06 08:56:59 UTC

Mmmmhh these 2 devices were spares in a zpool, but strangely seen as available.
I removed these 2 spares from the zpool and then the 2 iscsi devices disappeared properly from the system.
OK, not related then to the 10.3 bug above.

Comment 10 Ben RUBSON 2016-09-07 09:55:42 UTC

If I'm the only one interested in this bug report, it can then be closed, as I now run FreeBSD 11.
Feel free then to re-open it if needed.

Comment 11 Julien Cigar 2016-10-17 09:18:12 UTC

I got a similar issue on:

FreeBSD filer1.prod.lan 10.3-RELEASE-p7 FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

with:

root@filer1:/etc/rc.d # sysctl -a|grep -i 'iscsi'
kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 5
kern.iscsi.iscsid_timeout: 5
kern.iscsi.ping_timeout: 5
kern.iscsi.debug: 1

On the initiator side:

WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping reply (NOP-Out) after 5 seconds; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection

On the target side:

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 08 03 02 7a 00 00 01 00 
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 49 e7 00 00 20 00 
iscsi2:0:(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
0:(da2:0): iscsi1:0:Retrying command
0:0): Retrying command
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4a e7 00 00 20 00 
(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4a 07 00 00 20 00 
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi2:0:iscsi1:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da3:iscsi2:0:0:0): Periph destroyed
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4c a7 00 00 20 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 e0 f3 d4 00 00 04 00 
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4d 07 00 00 20 00 
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4b c7 00 00 20 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da2:iscsi2:0:0:0): Periph destroyed
(da3:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00 
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 10 02 00 
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00 
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da3:iscsi2:0:0:0): Periph destroyed
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00 
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): login timed out after 6 seconds; reconnecting
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Error 5, Periph was invalidated
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da2:iscsi1:0:0:0): Periph destroyed
(da3:iscsi2:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): handoff on already connected session
Oct 16 01:02:04 filer1 iscsid[88600]: 10.20.30.32 (iqn.2016-08.lan.prod:target1): ISCSIDHANDOFF: Device busy
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error; reconnecting
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da3:iscsi2:0:0:0): Periph destroyed
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 20 20 00 
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da2:iscsi1:0:0:0): Periph destroyed
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00 
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da3:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error; reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da3:iscsi1:0:0:0): Periph destroyed
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

Comment 12 Julien Cigar 2016-10-17 11:54:15 UTC

Some more info:

on filer2:

root@filer2:/home/jcigar # cat /etc/ctl.conf 

auth-group ag0 {
    chap xxx xxx
}

portal-group pg0 {
    discovery-auth-group no-authentication
    listen 10.20.30.32
}

target iqn.2016-08.lan.prod:target0 {
    auth-group ag0
    portal-group pg0

    lun 0 {
        blocksize 4096
        path /dev/gpt/f2data0
    }
}

target iqn.2016-08.lan.prod:target1 {
    auth-group ag0
    portal-group pg0

    lun 0 {
        blocksize 4096
        path /dev/gpt/f2data1
    }
}

root@filer2:/home/jcigar # ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
        ether d0:bf:9c:3a:b7:25
        inet 10.20.30.32 netmask 0xffffff00 broadcast 10.20.30.255 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active

root@filer2:/home/jcigar # pciconf -lv bge1
bge1@pci0:2:0:1:        class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00 hdr=0x00
    vendor     = 'Broadcom Corporation'
    device     = 'NetXtreme BCM5720 Gigabit Ethernet PCIe'
    class      = network
    subclass   = ethernet

root@filer2:/home/jcigar # grep 'bge1' /var/run/dmesg.boot 
bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem 0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1 on pci2
bge1: APE FW version: NCSI v1.3.12.0
bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E
miibus1: <MII bus> on bge1
bge1: Using defaults for TSO: 65518/35/2048
bge1: Ethernet address: d0:bf:9c:3a:b7:25
bge1: link state changed to DOWN
bge1: link state changed to UP

on filer1.prod.lan:

root@filer1:/home/jcigar # cat /etc/iscsi.conf 

t0 {
    TargetAddress   = 10.20.30.32
    TargetName      = iqn.2016-08.lan.prod:target0
    AuthMethod      = CHAP
    chapIName       = xxx
    chapSecret      = xxx
}

t1 {
    TargetAddress   = 10.20.30.32
    TargetName      = iqn.2016-08.lan.prod:target1
    AuthMethod      = CHAP
    chapIName       = xxx
    chapSecret      = xxx
}

root@filer1:/home/jcigar # ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
        ether d0:bf:9c:3a:c6:e1
        inet 10.20.30.31 netmask 0xffffff00 broadcast 10.20.30.255 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex,master>)
        status: active

root@filer1:/home/jcigar # pciconf -lv bge1
bge1@pci0:2:0:1:        class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00 hdr=0x00
    vendor     = 'Broadcom Corporation'
    device     = 'NetXtreme BCM5720 Gigabit Ethernet PCIe'
    class      = network
    subclass   = ethernet

root@filer1:/home/jcigar # grep 'bge1' /var/run/dmesg.boot 
bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem 0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1 on pci2
bge1: APE FW version: NCSI v1.3.12.0
bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E
miibus1: <MII bus> on bge1
bge1: Using defaults for TSO: 65518/35/2048
bge1: Ethernet address: d0:bf:9c:3a:c6:e1
bge1: link state changed to DOWN
bge1: link state changed to UP

Comment 13 Edward Tomasz Napierala freebsd_committer

2016-10-18 13:57:24 UTC

The "WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping reply (NOP-Out) after 5 seconds; dropping connection" and "WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting" messages suggest a network problem.

Comment 14 Julien Cigar 2016-10-18 14:26:03 UTC

This is also what I thought at first, but there is no error messages related to the network interfaces in the logs, and all Ierrs, Idrop, and Coll counters are zeros.

There is a dedicated interface for the iSCSI traffic which I turned on/off quite a lot to test a failover script, and the symptoms are exactly the ones described by Ben Rubson in the post above (altough I only have 4 disks in total).

One more thing, I'm managing everything with SaltStack and in some cases ctld on the target and iscsid on the initiator are restarted at the same time, could it be because of that?

I forgot to paste my rc.conf filers, but I don't think there is something wrong ..:

root@filer1:/home/jcigar # cat /etc/rc.conf
dumpdev="AUTO"
ifconfig_bge0="DHCP"
ip6addrctl_policy="ipv4_prefer"
ipv6_network_interfaces="none"
keymap="be.iso"
mountd_enable="YES"
mountd_flags="-r -S -h 192.168.10.15"
nfs_server_enable="YES"
nfs_server_flags="-t -h 192.168.10.15"
nfsuserd_enable="YES"
nfsuserd_flags="-domain prod.lan"
nfsv4_server_enable="YES"
ntpdate_flags="-b -4"
ntpdate_hosts="ntp.belnet.be"
salt_minion_enable="YES"
sendmail_enable="NO"
sendmail_msp_queue_enable="NO"
sendmail_outbound_enable="NO"
sendmail_submit_enable="NO"
zfs_enable="YES"

hostname="filer1.prod.lan"

ifconfig_bge1="inet 10.20.30.31 netmask 255.255.255.0 mtu 9000"

ifconfig_bge0_alias0="inet vhid 54 advskew 50 pass xxx alias 192.168.10.15/32"

sshd_enable="YES"

ntpdate_enable="YES"

iscsid_enable="YES"
iscsictl_enable="YES"
iscsictl_flags="-Aa"

root@filer2:/home/jcigar # cat /etc/rc.conf
dumpdev="AUTO"
ifconfig_bge0="DHCP"
ip6addrctl_policy="ipv4_prefer"
ipv6_network_interfaces="none"
keymap="be.iso"
ntpdate_flags="-b -4"
ntpdate_hosts="ntp.belnet.be"
salt_minion_enable="YES"
sendmail_enable="NO"
sendmail_msp_queue_enable="NO"
sendmail_outbound_enable="NO"
sendmail_submit_enable="NO"
zfs_enable="YES"

hostname="filer2.prod.lan"

ifconfig_bge1="inet 10.20.30.32 netmask 255.255.255.0 mtu 9000"

ifconfig_bge0_alias0="inet vhid 54 advskew 100 pass xxx alias 192.168.10.15/32"

sshd_enable="YES"

ntpdate_enable="YES"
ctld_enable="YES"

Comment 15 Ben RUBSON 2016-10-18 14:34:37 UTC

I re-open this bug for Julien.

Comment 16 Julien Cigar 2016-10-18 14:54:30 UTC

Thank you Ben. Can you confirm (once again) that you haven't had this problem anymore in 11.0-RELEASE? If so I could just upgrade then .. :)

Comment 17 Ben RUBSON 2016-10-18 15:02:04 UTC

I switched to 11 at the time of 11-RC1 as I did not manage to reproduce the bug with it.
I did not try to explicitly reproduce it with RC2, RC3, 11.0-RELEASE and 11.0-RELEASE-p1, but it did not occur by itself in these versions :)
Sounds like this bug only affects 10.3.

Comment 18 Julien Cigar 2016-11-10 09:57:56 UTC

Problem appeared again today, after ~15 days of uptime, always on FreeBSD filer1.prod.lan 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24 18:49:24 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 01 ef ec 90 00 00 01 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 01 ef ec 8e 00 00 01 00 
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da2:iscsi2:0:0:0): Periph destroyed
(da3:iscsi1:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): Retrying command
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): login timed out after 6 seconds; reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00 
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): Retrying command
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): handoff on already connected session
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error; reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
(da3:iscsi1:0:0:0): Periph destroyed
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

After a zpool online, and with vfs.zfs.scrub_delay = 0 and vfs.zfs.resilver_delay = 0 I issued a zpool scrub and again I had a timeout:

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In) after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9a 67 00 00 01 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 98 3c 00 00 01 00 
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9f e4 00 00 01 00 
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a8 cc 00 00 01 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 a2 42 00 00 01 00 
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 6e 00 00 20 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 4e 00 00 20 00 
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 8e 00 00 20 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 6e 00 00 20 00 
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a9 b2 00 00 01 00 
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL   0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL   1 detached
(da3:iscsi1:0:0:0): Periph destroyed
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL   0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL   1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

I've raised those timeouts a little bit:
kern.iscsi.login_timeout: 30
kern.iscsi.iscsid_timeout: 30
kern.iscsi.ping_timeout: 30

and see if it makes any difference

Comment 19 Julien Cigar 2016-11-10 10:05:51 UTC

this is what I have on the target:

root@filer2:/home/jcigar # sysctl -a|grep -i 'iscsi'
kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 30
kern.iscsi.iscsid_timeout: 30
kern.iscsi.ping_timeout: 30
kern.iscsi.debug: 1
kern.cam.ctl.iscsi.maxcmdsn_delta: 256
kern.cam.ctl.iscsi.login_timeout: 60
kern.cam.ctl.iscsi.ping_timeout: 5
kern.cam.ctl.iscsi.debug: 1

and the following in the error logs:
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection error; dropping connection

Comment 20 Julien Cigar 2016-11-10 11:18:24 UTC

Also I noticed some Idrop packets on the iscsi interface of the initiator side:

root@filer1:/home/jcigar # netstat -I bge1
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
bge1   9000 <Link#2>      d0:bf:9c:3a:c6:e1 77396222     0   107 45800806     0     0
bge1      - 10.20.30.0    10.20.30.31       77394879     -     - 73726273     -     -

and:

root@filer1:/home/jcigar # vmstat -z | grep -i 'mbuf_jumbo_9k:'
mbuf_jumbo_9k:         9216,  74075,     256,     161,51922039, 214,   0

could it be related to jumbo frames?

(on the target side counters are 0)

Comment 21 Ben RUBSON 2016-11-10 11:27:48 UTC

I was also using jumbo frames with 10.3, but unfortunately I did not look at dropped packets when I had the issues.
Note that with 11, I still use jumbo frames and never encounter this issue again.

Comment 22 Julien Cigar 2016-11-10 11:47:22 UTC

Yep it's strange, I will definitively upgrade to 11 when possible ..

Other infos:

root@filer1:/home/jcigar # netstat -m
1285/6575/7860 mbufs in use (current/cache/total)
1024/3526/4550/500010 mbuf clusters in use (current/cache/total/max)
1024/3178 mbuf+clusters out of packet secondary zone in use (current/cache)
0/267/267/250005 4k (page size) jumbo clusters in use (current/cache/total/max)
256/181/437/74075 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41667 16k jumbo clusters in use (current/cache/total/max)
4673K/11392K/16066K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/214/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

root@filer1:/home/jcigar # sysctl kern.ipc | fgrep nmb
kern.ipc.nmbufs: 3200070
kern.ipc.nmbjumbo16: 166668
kern.ipc.nmbjumbo9: 222225
kern.ipc.nmbjumbop: 250005
kern.ipc.nmbclusters: 500010

Comment 23 Julien Cigar 2016-11-10 22:25:33 UTC

Ok.. I definitively think it's an issue with jumbo frames, I got another disconnection and again the "requests for jumbo clusters denied" (9k) counter increased:

0/304/0 requests for jumbo clusters denied (4k/9k/16k)

any idea?

Comment 24 Steven Hartland freebsd_committer

2016-11-10 22:47:23 UTC

Any mention of nmbjumbo9 in /var/log/messages?

Specifically do you see:
kern.ipc.nmbjumbo9 limit reached

If so have you tried bumping kern.ipc.nmbjumbo9?

Comment 25 Julien Cigar 2016-11-10 23:12:13 UTC

(In reply to Steven Hartland from comment #24)

I don't have any mention of nmbjumbo9 in /var/log/messages, I haven't raised kern.ipc.nmbjumbo9 yet, currently it's:

root@filer1:/var/log # sysctl kern.ipc.nmbjumbo9
kern.ipc.nmbjumbo9: 222225

but it does not seem necessary to raise it if I can trust netstat -m:

root@filer1:/var/log # netstat -m
1282/6533/7815 mbufs in use (current/cache/total)
1024/3564/4588/500010 mbuf clusters in use (current/cache/total/max)
1024/3431 mbuf+clusters out of packet secondary zone in use (current/cache)
0/161/161/250005 4k (page size) jumbo clusters in use (current/cache/total/max)
256/120/376/74075 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41667 16k jumbo clusters in use (current/cache/total/max)
4672K/10485K/15157K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/304/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

(the box isn't heavily loaded)

what I don't understand if why there are requests for 4k jumbo clusters denied as there are plenty free .. ?

could it be related to this change: https://svnweb.freebsd.org/base/head/sys/dev/bge/if_bge.c?r1=276299&r2=276750 ?

Comment 26 Julien Cigar 2016-11-10 23:15:40 UTC

I don't know if it's related but I also noticed some incorrect cksum:

00:14:03.380734 IP (tos 0x0, ttl 64, id 19369, offset 0, flags [DF], proto TCP (6), length 100, bad cksum 0 (->9e84)!)
    10.20.30.31.24995 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd (incorrect -> 0xfff9), seq 1205926059:1205926107, ack 1518444144, win 16384, options [nop,nop,TS val 1388742470 ecr 2474468733], length 48

00:14:03.380742 IP (tos 0x0, ttl 64, id 19370, offset 0, flags [DF], proto TCP (6), length 100, bad cksum 0 (->9e83)!)
    10.20.30.31.55645 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd (incorrect -> 0x3912), seq 3104128225:3104128273, ack 1948017382, win 16384, options [nop,nop,TS val 1388742470 ecr 4001288677], length 48

00:14:03.381060 IP (tos 0x0, ttl 64, id 54596, offset 0, flags [DF], proto TCP (6), length 100)
    10.20.30.32.iscsi-target > 10.20.30.31.55645: Flags [P.], cksum 0x4c6a (correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 4001289988 ecr 1388742470], length 48

00:14:03.381070 IP (tos 0x0, ttl 64, id 54597, offset 0, flags [DF], proto TCP (6), length 100)
    10.20.30.32.iscsi-target > 10.20.30.31.24995: Flags [P.], cksum 0x0d62 (correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 2474470019 ecr 1388742470], length 48

... but I guess it's because TSO is enabled on the interfaces ?

Comment 27 Steven Hartland freebsd_committer

2016-11-10 23:17:50 UTC

checksums a likely just due to hw checksum offloading.

Comment 28 Steven Hartland freebsd_committer

2016-11-10 23:19:43 UTC

checksums a likely just due to hw checksum offloading.

Comment 29 Steven Hartland freebsd_committer

2016-11-10 23:30:47 UTC

I'm assuming vmstat agrees with netstat -m e.g.
vmstat -z | head -n 1; vmstat -z | grep mbuf

It could be that the disconnect is the cause of mbuf issue and not the other way round. You'd have to catch it when the issue is occurring to confirm.

If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to see if the problem goes away?

Comment 30 Steven Hartland freebsd_committer

2016-11-10 23:42:21 UTC

If its dropping in the driver due to failure go get a jumbo mbuf you should see this in the output netstat -i.

Comment 31 Julien Cigar 2016-11-10 23:47:16 UTC

Y(In reply to Steven Hartland from comment #29)

Yes it seems so:

root@filer1:/var/log # vmstat -z | head -n 1; vmstat -z | grep mbuf
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256, 3200070,    1024,    3431,94881716,   0,   0
mbuf:                   256, 3200070,     257,    3103,566359041,   0,   0
mbuf_cluster:          2048, 500010,    4455,     133,15367798,   0,   0
mbuf_jumbo_page:       4096, 250005,       0,     161,102798544,   0,   0
mbuf_jumbo_9k:         9216,  74075,     256,     120,52036055, 304,   0
mbuf_jumbo_16k:       16384,  41667,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

> You'd have to catch it when the issue is occurring to confirm

not so easy :) but every time I have a WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 30 seconds; reconnecting

I have a small increase in the "requests for 9k jumbo clusters denied" counter

> If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to see if the problem goes away?

The machine has only 8GB of RAM (and ZFS), I think I'll first test with a 1500 MTU and -tso

Comment 32 Julien Cigar 2016-11-10 23:49:55 UTC

(In reply to Steven Hartland from comment #30)

Yes I already checked the error counters, but it's 0:

root@filer1:/var/log # netstat -I bge1
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
bge1   9000 <Link#2>      d0:bf:9c:3a:c6:e1 77755195     0   152 46072898     0     0
bge1      - 10.20.30.0    10.20.30.31       77753810     -     - 74150354     -     -

... so I guess it's not a driver issue

Comment 33 Julien Cigar 2016-11-30 09:19:38 UTC

an update on this: I turned off jumbo frames and TSO on the replication interface and the issue hasn't been repeated until now...!

filer1% uptime
10:15AM  up 18 days, 13:35, 1 user, load averages: 0.07, 0.09, 0.08

filer1% vmstat -z|head -1;vmstat -z|grep -i 'mbuf'
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256, 3200070,    1024,    3403,527205862,   0,   0
mbuf:                   256, 3200070,       6,    2752,456285246,   0,   0
mbuf_cluster:          2048, 500010,    4427,     355,18287308,   0,   0
mbuf_jumbo_page:       4096, 250005,       0,      71, 2682330,   0,   0
mbuf_jumbo_9k:         9216,  74075,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  41667,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

filer1% netstat -m
1031/6154/7185 mbufs in use (current/cache/total)
1024/3758/4782/500010 mbuf clusters in use (current/cache/total/max)
1024/3403 mbuf+clusters out of packet secondary zone in use (current/cache)
0/71/71/250005 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/74075 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41667 16k jumbo clusters in use (current/cache/total/max)
2305K/9338K/11644K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

filer1% netstat -I bge1
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
bge1   1500 <Link#2>      xx:xx:xx:xx:xx:xx 497577171     0     0 262946639     0     0
bge1      - 10.20.30.0    10.20.30.31       497575590     -     - 262942638     -     -

filer1% ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWTSO,LINKSTATE>
        ether xx:xx:xx:xx:xx:xx
        inet 10.20.30.31 netmask 0xffffff00 broadcast 10.20.30.255 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex,master>)
        status: active

(I'll let you know if the problem occurs again)

Comment 34 Julien Cigar 2016-12-15 09:40:54 UTC

No problems for more than a month now.. I tend to conclude that the MTU 9000 was the root cause of this problem (?)

Comment 35 Ben RUBSON 2017-01-04 14:02:40 UTC

Did you try to increase some sysctls with jumbo frames on, as proposed by Steven ?

Comment 36 Julien Cigar 2017-01-05 09:18:23 UTC

(In reply to Ben RUBSON from comment #35)

yes, .. but only slightly as the machine has only 8GB of RAM

Comment 37 Ben RUBSON 2017-01-05 09:24:07 UTC

And no improvements ?

Comment 38 Julien Cigar 2017-01-05 09:35:56 UTC

(In reply to Ben RUBSON from comment #37)

it only delayed the problem, .. but it would be interesting if you could test on your side..!

Comment 39 Ben RUBSON 2017-01-05 14:04:40 UTC

Strange, as if mbufs were "consumed" but not released.
And when devices are disconnected, can you reconnect to them properly ?
Or do you need to change their target name or to reboot, as explained in my first message of this bug report ?

Comment 40 Julien Cigar 2017-01-06 09:57:22 UTC

I had to reboot..

Comment 41 Ben RUBSON 2017-01-06 10:00:01 UTC

OK, so same issue as mine (for which I opened this bug report), with 10.3.
Which I was not able to reproduce with 11.
Any plan to upgrade to 11 to be safe ?

Comment 42 Julien Cigar 2017-01-06 10:01:42 UTC

also, looks similar to bug #215727

Comment 43 Julien Cigar 2017-01-06 10:03:25 UTC

(In reply to Ben RUBSON from comment #41)

as soon as I have a little time .. :)

Comment 44 Edward Tomasz Napierala freebsd_committer

2017-04-11 19:55:34 UTC

From what I see, it's a networking problem, not a bug in iSCSI.