Bug 236659 - kernel page fault panic when chelsio cxgbei iscsi offload engine is enabled in ctld
Summary: kernel page fault panic when chelsio cxgbei iscsi offload engine is enabled i...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Navdeep Parhar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-20 11:51 UTC by Markus Wild
Modified: 2019-09-22 06:09 UTC (History)
3 users (show)

See Also:


Attachments
core.txt (110.13 KB, text/plain)
2019-03-20 11:51 UTC, Markus Wild
no flags Details
this is "where" within kgdb8 (6.29 KB, text/plain)
2019-03-20 11:51 UTC, Markus Wild
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Wild 2019-03-20 11:51:04 UTC
Created attachment 203000 [details]
core.txt

I'm setting up an HP DL380 G7 with a Chelsio T520-CR card as a zfs iscsi storage server. The server is connected
multipath with a vsphere 5.5 test server. After a while with iSCSI traffic, the freebsd server panics. It seems
to be caused by a certain time of operation, not primarily by the amount of traffic, since I can frequently finish
one run of CrystalMark on a VM on the vsphere server before I get the panic. Also, just keeping the server idle (but
with the vsphere server reconnecting after the crash) can cause the panic after 20-30min of uptime. 

I'm only seeing the panic when I enable "offload cxgbei" in the ctl.conf portal-group, just having toe enabled on the
interfaces will not cause the panic (at least not in the timeframe I monitored the server). The server crashes whether
I have dev.t5nex.0.toe.ddp enabled or not (it's currently enabled, it was originally not).

These are configuration settings:
/etc/rc.conf:
ifconfig_cxl0="mtu 9000 toe up"
ifconfig_cxl1="mtu 9000 toe up"
cloned_interfaces="vlan1513 vlan1514"

# scsi-a
ifconfig_vlan1513="inet 10.140.3.2/24 vlan 1513 vlandev cxl0 up"
# iscsi-b
ifconfig_vlan1514="inet 10.140.4.2/24 vlan 1514 vlandev cxl1 up"

ctld_enable="YES"                # CAM Target Layer / iSCSI target daemon.

kld_list="t5fw_cfg if_cxl t4_tom cxgbei"

/etc/ctl.conf:
portal-group pg0 {
        discovery-auth-group no-authentication
        listen 10.140.3.2
        listen 10.140.4.2

        offload cxgbei
}
Comment 1 Markus Wild 2019-03-20 11:51:48 UTC
Created attachment 203001 [details]
this is "where" within kgdb8
Comment 2 Markus Wild 2019-03-21 12:50:39 UTC
Since the panic happens in rt_tables_get_rnh_ptr (table=0, fam=2) at /usr/src/sys/net/route.c:193, which uses a VIMAGE wrapper macro, I compiled a custom
kernel using:

include GENERIC

nooptions       VIMAGE


and I'm since stress testing the server, not a single panic. I got one 
complete stall of iscsi communication though logged with messages like:

[...]
Mar 21 11:38:14 zrhcz-nas5 kernel: WARNING: 10.140.4.15 (iqn.1998-01.com.vmware:5c80f357-6a29-a734-fef3-001b21d8e459-24252974): rece
ived PDU with CmdSN 1750169998, while expected 4482310
Mar 21 11:38:14 zrhcz-nas5 kernel: WARNING: 10.140.4.15 (iqn.1998-01.com.vmware:5c80f357-6a29-a734-fef3-001b21d8e459-24252974): rece
ived Data-Out PDU with DataSN 4, while expected 2; dropping connection
[...]

and vsphere losing the datastore, but after setting dev.t5nex.0.toe.ddp=0
this incident hasn't returned so far. 

My conclusion: the chelsio driver isn't quite VIMAGE clean yet.