Bug 208267 - if_sfxge unstable causes panic at ifconfig sfxge0 up
Summary: if_sfxge unstable causes panic at ifconfig sfxge0 up
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-BETA2
Hardware: amd64 Any
: --- Affects Many People
Assignee: Andrew Rybchenko,St.Petersburg Russia
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-24 19:41 UTC by nonesuch
Modified: 2016-05-03 06:56 UTC (History)
2 users (show)

See Also:
arybchik: maintainer-feedback+
arybchik: mfc-stable9-
arybchik: mfc-stable10+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nonesuch 2016-03-24 19:41:43 UTC
On 10.3-PRERELEASE built from 10-STABLE as of the 23 of March 2016. The sfxge driver is unstable.

I wanted to bring up a lagg of two sfxge nics. I loaded the module

dev03:~ # kldload sfxge

sfxge0: <Solarflare SFC9100 family> port 0x8100-0x81ff mem 0xc9800000-0xc9ffffff,0xca004000-0xca007fff irq 64 at device 0.0 on pci13
sfxge0: Using MSI-X interrupts
sfxge0: Ethernet address: 00:0f:53:35:fa:90
sfxge0: Solarflare Flareon Ultra 7000 Series 10G Adapter
sfxge1: <Solarflare SFC9100 family> port 0x8000-0x80ff mem 0xc9000000-0xc97fffff,0xca000000-0xca003fff irq 68 at device 0.1 on pci13
sfxge1: Using MSI-X interrupts
sfxge1: Ethernet address: 00:0f:53:35:fa:91
sfxge1: Solarflare Flareon Ultra 7000 Series 10G Adapter

dev03:~# ifconfig sfxge0 up
panic: assertion failed at /usr/src/sys/modules/sfxge/../../dev/sfxge/common/hunt_rx.c:751
cpuid = 12
KDB: stack backtrace:
#0 0xffffffff8098e7a0 at kdb_backtrace+0x60
#1 0xffffffff809514f6 at vpanic+0x126
#2 0xffffffff809513c3 at panic+0x43
#3 0xffffffff81e382c7 at ef10_rx_qcreate+0x1f7
#4 0xffffffff81e28a94 at efx_rx_qcreate+0xb4
#5 0xffffffff81e1d1ff at sfxge_rx_start+0x22f
#6 0xffffffff81e189f8 at sfxge_start+0xb8
#7 0xffffffff81e185e1 at sfxge_if_ioctl+0x3c1
#8 0xffffffff80a1771f at ifioctl+0xaef
#9 0xffffffff809a9005 at kern_ioctl+0x255
#10 0xffffffff809a8d00 at sys_ioctl+0x140
#11 0xffffffff80d56def at amd64_syscall+0x40f
#12 0xffffffff80d3c05b at Xfast_syscall+0xfb
Uptime: 1m19s

Dump failed. Partition too small.
Comment 1 nonesuch 2016-03-25 01:43:50 UTC
here is the backtrace 

savecore: reboot after panic: assertion failed at /usr/src/sys/modules/sfxge/../../dev/sfxge/common/hunt_rx.c:751
savecore: writing core to ./vmcore.0


Loaded symbols for /boot/kernel/fdescfs.ko.symbols
Reading symbols from /boot/kernel/sfxge.ko.symbols...done.
Loaded symbols for /boot/kernel/sfxge.ko.symbols
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff80951152 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff80951535 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff809513c3 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff81e382c7 in ef10_rx_qcreate (enp=<value optimized out>, index=<value optimized out>, label=<value optimized out>, 
    type=<value optimized out>, esmp=<value optimized out>, n=<value optimized out>, id=584, eep=<value optimized out>, 
    erp=<value optimized out>) at /usr/src/sys/modules/sfxge/../../dev/sfxge/common/hunt_rx.c:751
#5  0xffffffff81e28a94 in efx_rx_qcreate (enp=0xfffff8024de45000, index=32, label=32, type=EFX_RXQ_TYPE_DEFAULT, esmp=0x0, n=0, 
    id=<value optimized out>, eep=<value optimized out>, erpp=0x0)
    at /usr/src/sys/modules/sfxge/../../dev/sfxge/common/efx_rx.c:540
#6  0xffffffff81e1d1ff in sfxge_rx_start (sc=0xfffffe0026084000) at /usr/src/sys/modules/sfxge/../../dev/sfxge/sfxge_rx.c:1037
#7  0xffffffff81e189f8 in sfxge_start (sc=0xfffffe0026084000) at /usr/src/sys/modules/sfxge/../../dev/sfxge/sfxge.c:233
#8  0xffffffff81e185e1 in sfxge_if_ioctl (ifp=0xfffff8013312b800, command=<value optimized out>, data=<value optimized out>)
    at /usr/src/sys/modules/sfxge/../../dev/sfxge/sfxge.c:394
#9  0xffffffff80a1771f in ifioctl (so=<value optimized out>, cmd=<value optimized out>, data=0xfffffe201d7f28e0 "sfxge0", 
    td=<value optimized out>) at /usr/src/sys/net/if.c:2403
#10 0xffffffff809a9005 in kern_ioctl (td=0xfffff8014844a4b0, fd=<value optimized out>, com=0) at file.h:321
#11 0xffffffff809a8d00 in sys_ioctl (td=0xfffff8014844a4b0, uap=0xfffffe201d7f2a40) at /usr/src/sys/kern/sys_generic.c:718
#12 0xffffffff80d56def in amd64_syscall (td=0xfffff8014844a4b0, traced=0) at subr_syscall.c:141
#13 0xffffffff80d3c05b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396
#14 0x00000008013e9fca in ?? ()
Comment 2 Andrew Rybchenko,St.Petersburg Russia freebsd_committer 2016-03-25 11:25:48 UTC
Looking at the code I guess the system has more than 32 CPUs.

Could you try to load the driver with limited number of RSS channels (most likely 32 will work)
# kenv hw.sfxge.0.max_rss_channels=16
# kenv hw.sfxge.1.max_rss_channels=16
Comment 3 nonesuch 2016-03-29 16:35:57 UTC
All
  Updating the loader to use

hw.sfxge.0.max_rss_channels=16
hw.sfxge.1.max_rss_channels=16


As well as disabling HT in the bios fixed the issue. 

I propose the following "fix"

1. Update the man page to denote 

     hw.sfxge.N.max_rss_channels
             The maximum number of allocated RSS channels for the Nth adapter.
             If set to 0 or unset, the number of channels is determined by the
             number of CPU cores. This does not scale beyond 32 cores. You need 
             to manually set the hw.sfxge.N.max_rss_channels <=32 before loading 
             driver in a box with more then 32 cores.
Comment 4 Andrew Rybchenko,St.Petersburg Russia freebsd_committer 2016-03-30 07:05:02 UTC
Thanks a lot for the confirmation.

I have a real fix which allows to scale to more than 32 CPUs. Under review at Solarflare.

I hope to publish it this week.
Comment 5 Andrew Rybchenko,St.Petersburg Russia freebsd_committer 2016-04-27 10:19:15 UTC
Sorry, for delay

https://reviews.freebsd.org/D6121
Comment 6 commit-hook freebsd_committer 2016-04-28 06:21:40 UTC
A commit references this bug:

Author: arybchik
Date: Thu Apr 28 06:20:43 UTC 2016
New revision: 298735
URL: https://svnweb.freebsd.org/changeset/base/298735

Log:
  sfxge(4): do not use RxQ index as label

  Labels are limitted by 32 on EF10. It is not sufficient on powerful hosts.
  Since only one RxQ is running over each EvQ, zero label may be used.

  Reviewed by:    gnn
  Sponsored by:   Solarflare Communications, Inc.
  MFC after:      2 days
  PR:             208267
  Differential Revision:  https://reviews.freebsd.org/D6121

Changes:
  head/sys/dev/sfxge/sfxge_ev.c
  head/sys/dev/sfxge/sfxge_rx.c
Comment 7 commit-hook freebsd_committer 2016-04-30 06:36:13 UTC
A commit references this bug:

Author: arybchik
Date: Sat Apr 30 06:35:20 UTC 2016
New revision: 298836
URL: https://svnweb.freebsd.org/changeset/base/298836

Log:
  MFC r298735

  sfxge(4): do not use RxQ index as label

  Labels are limitted by 32 on EF10. It is not sufficient on powerful hosts.
  Since only one RxQ is running over each EvQ, zero label may be used.

  Reviewed by:    gnn
  Sponsored by:   Solarflare Communications, Inc.
  PR:             208267
  Differential Revision:  https://reviews.freebsd.org/D6121

Changes:
_U  stable/10/
  stable/10/sys/dev/sfxge/sfxge_ev.c
  stable/10/sys/dev/sfxge/sfxge_rx.c
Comment 8 Andrew Rybchenko,St.Petersburg Russia freebsd_committer 2016-04-30 06:59:15 UTC
Fixed in head and stable/10.
Comment 9 nonesuch 2016-05-02 16:44:06 UTC
Confirmed to be working no issues noted. 

sfxge0: Using MSI-X interrupts
sfxge0: Ethernet address: 00:0f:53:35:fa:90
sfxge0: Solarflare Flareon Ultra 7000 Series 10G Adapter
sfxge1: <Solarflare SFC9100 family> port 0x8000-0x80ff mem 0xc9000000-0xc97fffff,0xca000000-0xca003fff irq 68 at device 0.1 on pci131
sfxge1: Using MSI-X interrupts
sfxge1: Ethernet address: 00:0f:53:35:fa:91
sfxge1: Solarflare Flareon Ultra 7000 Series 10G Adapter
sfxge0: link state changed to UP
sfxge0: promiscuous mode enabled
sfxge0: promiscuous mode disabled

dev.sfxge.1.vpd.SN: 7501013053711520471XXXXX
dev.sfxge.1.vpd.EC: PCBR3:CCSA1
dev.sfxge.1.vpd.PN: SFN7x22F
dev.sfxge.1.txq.33.stats.tx_netdown_drops: 0
dev.sfxge.1.txq.33.stats.tx_put_overflow: 0
dev.sfxge.1.txq.33.stats.tx_get_non_tcp_overflow: 0
dev.sfxge.1.txq.33.stats.tx_get_overflow: 0
dev.sfxge.1.txq.33.stats.tx_drops: 0
dev.sfxge.1.txq.33.stats.tx_collapses: 0
dev.sfxge.1.txq.33.stats.tso_pdrop_no_rsrc: 0
dev.sfxge.1.txq.33.stats.tso_pdrop_too_many: 0
dev.sfxge.1.txq.33.stats.tso_long_headers: 0
dev.sfxge.1.txq.33.stats.tso_packets: 0
dev.sfxge.1.txq.33.stats.tso_bursts: 0
dev.sfxge.1.txq.33.dpl.put_hiwat: 0
dev.sfxge.1.txq.33.dpl.get_hiwat: 0
dev.sfxge.1.txq.33.dpl.get_non_tcp_count: 0
dev.sfxge.1.txq.33.dpl.get_count: 0
Comment 10 Andrew Rybchenko,St.Petersburg Russia freebsd_committer 2016-05-03 06:56:13 UTC
Many thanks for the verification and confirmation