Bug 213257 - Crash in IGB driver with ALTQ
Summary: Crash in IGB driver with ALTQ
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.3-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2016-10-06 20:28 UTC by dustinwenz
Modified: 2017-03-15 07:43 UTC (History)
11 users (show)

See Also:
loos: mfc-stable9-
loos: mfc-stable10+
loos: mfc-stable11-


Attachments
FreeBSD 11-RELEASE-p1 kernel crash ALTQ/IGB (17.15 KB, image/png)
2016-10-08 08:54 UTC, mach
no flags Details
Add ALTQ_IGB kernel config option (1.09 KB, patch)
2017-02-21 20:01 UTC, freebsd
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description dustinwenz 2016-10-06 20:28:36 UTC
I see intermittent panics under moderate network load that began up sometime between 10-STABLE r295115 and r303709. I realize that's several months worth of changes, but regression testing this one is very time consuming. I suspect it has something to do with r303174 "If ALTQ is defined in the kern conf, switch to Legacy Mode", but I do not know for certain.

The kernel is built with these options (I am using PF with some simple filtering rules, but no ALTQ functionality):

device pf
device pflog
device pfsync
options ALTQ
options ALTQ_CBQ
options ALTQ_RED
options ALTQ_RIO
options ALTQ_HFSC
options ALTQ_CDNR
options ALTQ_PRIQ

If I rebuild without ALTQ, the panics no longer recur.
Comment 1 dustinwenz 2016-10-06 20:32:26 UTC
The panic traces are not particularly consistent. Here is one:

FreeBSD 10.3-STABLE #0 r305915: Tue Sep 20 16:41:23 CDT 2016


#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff80960b52 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff80960f35 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80960dc3 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff809ddb83 in sbcut_internal (sb=<value optimized out>, len=<value optimized out>) at /usr/src/sys/kern/uipc_sockbuf.c:886
#5  0xffffffff80b0bb61 in tcp_do_segment (m=<value optimized out>, th=<value optimized out>, so=<value optimized out>, tp=<value optimized out>, drop_hdrlen=<value optimized out>, tlen=0, 
    iptos=<value optimized out>, ti_locked=<value optimized out>) at /usr/src/sys/netinet/tcp_input.c:2820
#6  0xffffffff80b09e76 in tcp_input (m=<value optimized out>, off0=<value optimized out>) at /usr/src/sys/netinet/tcp_input.c:1396
#7  0xffffffff80a98827 in ip_input (m=0xfffff8005d303000) at /usr/src/sys/netinet/ip_input.c:733
#8  0xffffffff80a38212 in netisr_dispatch_src (proto=<value optimized out>, source=<value optimized out>, m=0x0) at /usr/src/sys/net/netisr.c:976
#9  0xffffffff80a2f5d6 in ether_demux (ifp=<value optimized out>, m=0xfffff8005d303000) at /usr/src/sys/net/if_ethersubr.c:851
#10 0xffffffff80a3027e in ether_nh_input (m=<value optimized out>) at /usr/src/sys/net/if_ethersubr.c:646
#11 0xffffffff80a38212 in netisr_dispatch_src (proto=<value optimized out>, source=<value optimized out>, m=0x0) at /usr/src/sys/net/netisr.c:976
#12 0xffffffff80503989 in igb_rxeof (count=98) at /usr/src/sys/dev/e1000/if_igb.c:4790
#13 0xffffffff80503fd3 in igb_msix_que (arg=0xfffff8000b822e08) at /usr/src/sys/dev/e1000/if_igb.c:1597
#14 0xffffffff8092c4ab in intr_event_execute_handlers (p=<value optimized out>, ie=0xfffff8000b7e7500) at /usr/src/sys/kern/kern_intr.c:1264
#15 0xffffffff8092c8f6 in ithread_loop (arg=0xfffff8000b8313a0) at /usr/src/sys/kern/kern_intr.c:1277
#16 0xffffffff80929fda in fork_exit (callout=0xffffffff8092c860 <ithread_loop>, arg=0xfffff8000b8313a0, frame=0xfffffe07c4f40ac0) at /usr/src/sys/kern/kern_fork.c:1030
#17 0xffffffff80d825de in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:613
#18 0x0000000000000000 in ?? ()
Comment 2 dustinwenz 2016-10-06 20:33:24 UTC
Here is another example of a panic:

FreeBSD 10.3-STABLE r305619

#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff809608c2 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff80960ca5 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80960b33 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff80d9c32b in trap_fatal (frame=<value optimized out>, eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:858
#5  0xffffffff80d9c62d in trap_pfault (frame=0xfffffe07c4fa46d0, usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:681
#6  0xffffffff80d9bc7a in trap (frame=0xfffffe07c4fa46d0) at /usr/src/sys/amd64/amd64/trap.c:447
#7  0xffffffff80d81d0c in calltrap () at /usr/src/sys/amd64/amd64/exception.S:238
#8  0xffffffff80d8d49c in pmap_kextract (va=2409610190963724) at /usr/src/sys/amd64/amd64/pmap.c:665
#9  0xffffffff80ebb53b in bounce_bus_dmamap_load_buffer (dmat=0xfffff8000b847200, map=0xffffffff816a5858, buf=<value optimized out>, buflen=<value optimized out>, pmap=0xffffffff8172c2e0, 
    flags=<value optimized out>, segs=<value optimized out>) at /usr/src/sys/x86/x86/busdma_bounce.c:690
#10 0xffffffff80998fc2 in bus_dmamap_load_mbuf_sg (dmat=0xfffff8000b847200, map=0x0, m0=<value optimized out>, segs=0xfffffe07c4fa48a0, nsegs=0xfffffe07c4fa489c, flags=<value optimized out>)
    at /usr/src/sys/kern/subr_bus_dma.c:123
#11 0xffffffff80503c7e in igb_refresh_mbufs (rxr=0xfffff8000b841000, limit=394) at /usr/src/sys/dev/e1000/if_igb.c:4126
#12 0xffffffff80503a77 in igb_rxeof (count=<value optimized out>) at /usr/src/sys/dev/e1000/if_igb.c:5009
#13 0xffffffff80503f63 in igb_msix_que (arg=0xfffff8000b823000) at /usr/src/sys/dev/e1000/if_igb.c:1597
#14 0xffffffff8092c21b in intr_event_execute_handlers (p=<value optimized out>, ie=0xfffff8000b7ead00) at /usr/src/sys/kern/kern_intr.c:1264
#15 0xffffffff8092c666 in ithread_loop (arg=0xfffff8000b852580) at /usr/src/sys/kern/kern_intr.c:1277
#16 0xffffffff80929d4a in fork_exit (callout=0xffffffff8092c5d0 <ithread_loop>, arg=0xfffff8000b852580, frame=0xfffffe07c4fa4ac0) at /usr/src/sys/kern/kern_fork.c:1030
#17 0xffffffff80d8224e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:613
#18 0x0000000000000000 in ?? ()
Comment 3 Cassiano Peixoto 2016-10-07 12:49:56 UTC
It's the same issue that i have, look:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212413

It's not an igb issue but ALTQ issue. Seems nobody matters about this.
Comment 4 dustinwenz 2016-10-07 15:08:51 UTC
The ALTQ code in stable hasn't changed since the middle of April. Either ALTQ is so underutilized that no one noticed a problem until recently, or a change in some other code has just created a conflict.
Comment 5 dustinwenz 2016-10-07 15:09:56 UTC
Your crash looks like the same issue. Are you aware of any other reports of it?
Comment 6 Cassiano Peixoto 2016-10-07 17:12:57 UTC
(In reply to dustinwenz from comment #5)
No, i'm not aware. But i'm quite sure the issue is when ALTQ is enabled on kernel config.
Comment 7 mach 2016-10-08 08:54:01 UTC
Created attachment 175525 [details]
FreeBSD 11-RELEASE-p1 kernel crash ALTQ/IGB

FreeBSD 11-RELEASE-p1 kernel crash ALTQ/IGB Happens on all machines with IGB, but doesn't with EM... However disabling ALTQ stops crashes on IGB machines.
Comment 8 John W. O'Brien 2016-11-05 17:05:25 UTC
I believe I have been experiencing this too. Didn't happen at r301164. Does happen as of r306933.

Here are examples of two recent traces from 10.3-STABLE r308209 with ALTQ enabled.

#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff809939b2 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff80993d95 in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80993c23 in panic (fmt=0x0)
    at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff80a10cb3 in sbcut_internal (sb=<value optimized out>,
    len=<value optimized out>) at /usr/src/sys/kern/uipc_sockbuf.c:886
#5  0xffffffff80b52a71 in tcp_do_segment (m=<value optimized out>,
    th=<value optimized out>, so=<value optimized out>,
    tp=<value optimized out>, drop_hdrlen=<value optimized out>, tlen=0,
    iptos=<value optimized out>, ti_locked=<value optimized out>)
    at /usr/src/sys/netinet/tcp_input.c:2838
#6  0xffffffff80b50897 in tcp_input (m=<value optimized out>,
    off0=<value optimized out>) at /usr/src/sys/netinet/tcp_input.c:1414
#7  0xffffffff80adee4b in ip_input (m=0xfffff800460cf400)
    at /usr/src/sys/netinet/ip_input.c:733
#8  0xffffffff80a761b2 in netisr_dispatch_src (proto=<value optimized out>,
    source=<value optimized out>, m=0x0) at /usr/src/sys/net/netisr.c:976
#9  0xffffffff80a6c036 in ether_demux (ifp=<value optimized out>,
    m=0xfffff800460cf400) at /usr/src/sys/net/if_ethersubr.c:851
#10 0xffffffff80a6ccde in ether_nh_input (m=<value optimized out>)
    at /usr/src/sys/net/if_ethersubr.c:646
#11 0xffffffff80a761b2 in netisr_dispatch_src (proto=<value optimized out>,
    source=<value optimized out>, m=0x0) at /usr/src/sys/net/netisr.c:976
#12 0xffffffff805101b9 in igb_rxeof (count=98)
    at /usr/src/sys/dev/e1000/if_igb.c:4790
#13 0xffffffff80510803 in igb_msix_que (arg=0xfffff8000996f400)
    at /usr/src/sys/dev/e1000/if_igb.c:1597
#14 0xffffffff8095f30b in intr_event_execute_handlers (
    p=<value optimized out>, ie=0xfffff80009979d00)
    at /usr/src/sys/kern/kern_intr.c:1264
#15 0xffffffff8095f756 in ithread_loop (arg=0xfffff800099982a0)
    at /usr/src/sys/kern/kern_intr.c:1277
#16 0xffffffff8095ce3a in fork_exit (
    callout=0xffffffff8095f6c0 <ithread_loop>, arg=0xfffff800099982a0,
    frame=0xfffffe0000340ac0) at /usr/src/sys/kern/kern_fork.c:1030
#17 0xffffffff80ddda3e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:613
#18 0x0000000000000000 in ?? ()


#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff809939b2 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#2  0xffffffff80993d95 in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:889
#3  0xffffffff80993c23 in panic (fmt=0x0)
    at /usr/src/sys/kern/kern_shutdown.c:818
#4  0xffffffff80df7b7b in trap_fatal (frame=<value optimized out>,
    eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:858
#5  0xffffffff80df77ef in trap (frame=<value optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:203
#6  0xffffffff80ddd4fc in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:238
#7  0xffffffff80a0cfcf in m_tag_delete_chain (m=0xfffff80059071600,
    t=<value optimized out>) at /usr/src/sys/kern/uipc_mbuf2.c:353
#8  0xffffffff80c6c07e in uma_zfree_arg (zone=0xfffff800073fe000,
    item=0xfffff80059071600, udata=0x0) at /usr/src/sys/vm/uma_core.c:2693
#9  0xffffffff80a08cd3 in m_freem (mb=<value optimized out>) at uma.h:364
#10 0xffffffff80a6c047 in ether_demux (ifp=<value optimized out>,
    m=0xfffff80059071600) at /usr/src/sys/net/if_ethersubr.c:871
#11 0xffffffff80a6ccde in ether_nh_input (m=<value optimized out>)
    at /usr/src/sys/net/if_ethersubr.c:646
#12 0xffffffff80a761b2 in netisr_dispatch_src (proto=<value optimized out>,
    source=<value optimized out>, m=0x206c6562616c2e65)
    at /usr/src/sys/net/netisr.c:976
#13 0xffffffff805101b9 in igb_rxeof (count=98)
    at /usr/src/sys/dev/e1000/if_igb.c:4790
#14 0xffffffff80510803 in igb_msix_que (arg=0xfffff8000996f670)
    at /usr/src/sys/dev/e1000/if_igb.c:1597
#15 0xffffffff8095f30b in intr_event_execute_handlers (
    p=<value optimized out>, ie=0xfffff80009979700)
    at /usr/src/sys/kern/kern_intr.c:1264
#16 0xffffffff8095f756 in ithread_loop (arg=0xfffff800099981e0)
    at /usr/src/sys/kern/kern_intr.c:1277
#17 0xffffffff8095ce3a in fork_exit (
    callout=0xffffffff8095f6c0 <ithread_loop>, arg=0xfffff800099981e0,
    frame=0xfffffe000037cac0) at /usr/src/sys/kern/kern_fork.c:1030
#18 0xffffffff80ddda3e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:613
#19 0x0000000000000000 in ?? ()

Let me know how else I can help isolate the issue.
Comment 9 nicolas 2016-11-06 22:14:47 UTC
I confirm this bug on FreeBSD 11.0 RELEASE. Only igb interfaces are affected as the issue don't occur with host using em instead of igb interfaces.

The crash seem to be related to a certain type and amount of packets.
Removing ALTQ support from kernel fix the issue.

It seem to be related to r303174. ALTQ seem to have not been supported with igb interface according https://forums.freebsd.org/threads/48283/ and https://lists.freebsd.org/pipermail/freebsd-pf/2016-August/008217.html

I suggest to test with r303174 removed:
---
--- head/sys/dev/e1000/if_igb.h	2016/03/22 12:40:09	297187
+++ head/sys/dev/e1000/if_igb.h	2016/05/06 15:41:38	299182
@@ -35,6 +35,10 @@
 #ifndef _IF_IGB_H_
 #define _IF_IGB_H_
 
-#ifdef ALTQ
-#define IGB_LEGACY_TX
-#endif
-
 #include <sys/param.h>
 #include <sys/systm.h>
 #ifndef IGB_LEGACY_TX
---

Did you have the same issue with msix disabled ?
sysctl -w hw.pci.enable_msix=0

Not sure if this is true but MSI-IX and ALTQ seem to be incompatible.
Comment 10 freebsd 2016-11-07 17:21:57 UTC
ALTQ requires the legacy single-queue mode in the igb(4) driver, but it seems this mode is unstable with high-bandwidth traffic.  However, ALTQ may still be used by adding a queue to limit throughput through igb interfaces.  For example, the following workaround has prevented crashes on my systems for the last six months:

     ## Limit bandwidth on internal interface to avoid igb driver bug
     altq on $int_if cbq bandwidth 404Mb queue { internal }
     queue internal bandwidth 99% priority 1 cbq(default red borrow)
Comment 11 emz 2016-12-05 16:36:02 UTC
I have two machines, both with igb(4), ALTQ and panics after an upgrade to 11.0-RELEASE. kern/213832, reported by me, may be related to this issue. Is anyone looking into this ?
Comment 12 ncrogers 2016-12-05 19:42:39 UTC
If you're not actually using ALTQ (i.e., in a loaded PF configuration), I would  suggest simply removing ALTQ from your kernel config and building a new kernel.

If you ARE utilizing altq with an igb interface, try setting hw.igb.num_queues=1 in loader.conf. ALTQ isn't able to utilize more than one interface queue anyways.
Comment 13 ncrogers 2016-12-05 19:45:59 UTC
This issue is also discussed here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409
Comment 14 Kubilay Kocak freebsd_committer 2016-12-06 04:21:09 UTC
For anyone who is experiencing this crash/panic, please include as a single plaintext file attachment (not as a comment):

- Crash backtrace (gdb -> bt)
- Full version (uname -a) of the system
- pciconf -lv output
Comment 15 Luiz Otavio O Souza,+55 (14) 99772-1255 freebsd_committer 2017-02-06 06:20:45 UTC
We tracked down this bug on pfSense. The bug is quite obvious, the fix probably not that much.  We'll discuss the issue with FreeBSD people.

For now adding "hw.igb.num_queues=1" to /boot/local.conf is enough to prevent the crashes.
Comment 16 freebsd 2017-02-06 06:56:13 UTC
(In reply to Luiz Otavio O Souza,+55 (14) 99772-1255 from comment #15)
Thanks for reporting the simple workaround, but perhaps you meant to specify /boot/loader.conf.local rather than /boot/local.conf?
Comment 17 Kenneth D. Merry freebsd_committer 2017-02-20 19:07:39 UTC
I have run into this as well.  IGB_LEGACY_TX has been broken for quite some time, at least for me.  (Since mid-2015 at least.)  I bought an em(4) card to use with ALTQ, since it was easier and cheaper than trying to figure out why igb(4) was broken.  I'm using igb(4) on the internal ports on my gateway router.

Change 299182 bit me last night when I updated my gateway router from an April 2016 10-stable to a February 2017 10-stable.

So, I've got ALTQ enabled, but I'm not using it on igb(4), just on em(4) ports.

In my opinion, we should not automatically enable IGB_LEGACY_TX until it has been fixed to work reliably at any traffic rate.  Anyone enabling ALTQ with an igb(4) interface in their system may well run into problems.

It only took about 15 minutes for my router to hang, and another 15 minutes to panic after that.

FreeBSD mithlond.kdm.org 10.3-STABLE FreeBSD 10.3-STABLE #20 r313925: Sat Feb 18
 17:36:43 EST 2017     ken@mithlond.kdm.org:/usr/obj/usr/home/ken/src/freebsd/st
able/10/sys/mithlond  amd64

panic: sbsndptr: sockbuf 0xfffff80158d796f8 and mbuf 0xfffff80196bad600 clashing

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: sbsndptr: sockbuf 0xfffff80158d796f8 and mbuf 0xfffff80196bad600 clashing
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff80364dab = db_trace_self_wrapper+0x2b/frame 0xfffffe0000359160
kdb_backtrace() at 0xffffffff809904c9 = kdb_backtrace+0x39/frame 0xfffffe0000359210
vpanic() at 0xffffffff80950e46 = vpanic+0x126/frame 0xfffffe0000359250
panic() at 0xffffffff80950d13 = panic+0x43/frame 0xfffffe00003592b0
sbsndmbuf() at 0xffffffff809d4b90 = sbsndmbuf/frame 0xfffffe00003592c0
tcp_output() at 0xffffffff80b16b9c = tcp_output+0xe6c/frame 0xfffffe00003594c0
tcp_do_segment() at 0xffffffff80b14858 = tcp_do_segment+0x2ff8/frame 0xfffffe00003595d0
tcp_input() at 0xffffffff80b10f06 = tcp_input+0x1036/frame 0xfffffe0000359730
ip_input() at 0xffffffff80a9b217 = ip_input+0x97/frame 0xfffffe0000359780
netisr_dispatch_src() at 0xffffffff80a352e2 = netisr_dispatch_src+0x62/frame 0xfffffe00003597f0
ether_demux() at 0xffffffff80a2bba6 = ether_demux+0x126/frame 0xfffffe0000359820
ether_nh_input() at 0xffffffff80a2c84e = ether_nh_input+0x35e/frame 0xfffffe0000359880
netisr_dispatch_src() at 0xffffffff80a352e2 = netisr_dispatch_src+0x62/frame 0xfffffe00003598f0
igb_rxeof() at 0xffffffff805104d6 = igb_rxeof+0x616/frame 0xfffffe0000359990
igb_msix_que() at 0xffffffff80510b7b = igb_msix_que+0x11b/frame 0xfffffe00003599e0
intr_event_execute_handlers() at 0xffffffff8091eb99 = intr_event_execute_handlers+0xb9/frame 0xfffffe0000359a20
ithread_loop() at 0xffffffff8091f556 = ithread_loop+0x96/frame 0xfffffe0000359a70
fork_exit() at 0xffffffff8091c45a = fork_exit+0x9a/frame 0xfffffe0000359ab0
fork_trampoline() at 0xffffffff80d957ce = fork_trampoline+0xe/frame 0xfffffe0000359ab0
Comment 18 freebsd 2017-02-20 23:26:36 UTC
(In reply to Kenneth D. Merry from comment #17)
Ken, does setting hw.igb.num_queues=1 keep your gateway router from crashing?  I haven't been able to check this myself, as choking my internal igb(4) interface w/ALTQ (see comment 10) has prevented crashes for the past 237 days.

FWIW, my gateway router doesn't have a free slot for an em(4) NIC :-(
Comment 19 Kenneth D. Merry freebsd_committer 2017-02-21 02:47:12 UTC
(In reply to freebsd from comment #18)

It's having IGB_LEGACY_TX turned on that makes it panic.  That also makes it use one queue.  I'm guessing igb(4) may also work with ALTQ with IGB_LEGACY_TX disabled, and hw.igb.num_queues=1 set, but that's not something I need on this particular box.

I've got em(4) on the external interface, and igb(4) on the internal interfaces.  I'm only using ALTQ for the external interface.

I could reconfigure it and try that out, but I'd rather not take the box down if I can help it. :)
Comment 20 freebsd 2017-02-21 16:18:16 UTC
(In reply to Kenneth D. Merry from comment #19)
I hear you about the downtime.  My gateway router only has igb(4) interfaces, so I must use ALTQ on the external one.  Although I don't need ALTQ on the internal interface, choking it with ALTQ keeps the system stable even with IGB_LEGACY_TX turned on.

My original goal when reporting bug #208409 was to resolve the discrepancy between the documented list of interfaces that support ALTQ and the actual OS behavior which reports ALTQ is not supported on igb(4) in the default configuration.  Since ALTQ does work with igb(4) as the documentation states, I suggest we expose an additional kernel config option that will allow it to be enabled separately from ALTQ as a whole, and eliminating the need to modify the kernel source code.  Along with a small note in the man page for altq(4), this would resolve the original problem.

For example, an option like this:

options    ALTQ_IGB    # Enable ALTQ on igb devices

Unlike exposing IGB_LEGACY_TX, this would be implementation-neutral, and eliminate any future confusion over its intent.  And when the bugs in the igb driver are fixed, it could be easily removed again.
Comment 21 freebsd 2017-02-21 20:01:43 UTC
Created attachment 180203 [details]
Add ALTQ_IGB kernel config option

A minimal patch to separate enabling ALTQ on igb(4) from ALTQ globally.  This patch is a workaround for bugs in the igb(4) driver that result in system crashes when IGB_LEGACY_TX is set, as ALTQ on igb(4) requires IGB_LEGACY_TX.  This allows easy configuration of ALTQ on systems containing igb(4) interfaces, but which don't actually need to run ALTQ on igb(4).
Comment 22 ncrogers 2017-02-21 20:13:27 UTC
(In reply to freebsd from comment #21)
This makes sense to me, although there is also the same problem for ixgbe(4) + ALTQ, where the IXGBE_LEGACY_TX path is necessary, so it would be nice to have an ALTQ_IXGBE tunable as well.
Comment 23 freebsd 2017-02-21 21:11:36 UTC
Comment on attachment 180203 [details]
Add ALTQ_IGB kernel config option

Oops, I forgot to update the man page for altq(4).  Here's the patch for that:

Index: share/man/man4/altq.4
===================================================================
--- share/man/man4/altq.4       (revision 314050)
+++ share/man/man4/altq.4       (working copy)
@@ -103,6 +103,8 @@
 Build the
 .Dq "Fair Queuing"
 discipline.
+.It Dv ALTQ_IGB
+Enable ALTQ on igb(4) devices
 .It Dv ALTQ_NOPCC
 Required if the TSC is unusable.
 .It Dv ALTQ_DEBUG
Comment 24 freebsd 2017-02-21 21:17:08 UTC
(In reply to ncrogers from comment #22)
I don't have an ixgbe(4) interface, so I can't test any changes to that code.  It should be easy to generate an analogous patch though, can you try it and submit what works on your system(s)?
Comment 25 freebsd 2017-02-22 19:23:48 UTC
I can confirm that setting hw.igb.num_queues=1 in /boot/loader.conf prevents crashes when running a kernel compiled with IGB_LEGACY_TX.  I removed the queue configuration from my internal igb(4) interface, and netperf's TCP stream test hasn't been able to provoke a crash after ~30 min.

Assuming the system remains stable, this is a much simpler fix than adding a new kernel config option.  Ken, if you have the time could you see if this configuration change eliminates your system instability as well?
Comment 26 commit-hook freebsd_committer 2017-02-25 20:22:35 UTC
A commit references this bug:

Author: loos
Date: Sat Feb 25 20:21:39 UTC 2017
New revision: 314281
URL: https://svnweb.freebsd.org/changeset/base/314281

Log:
  Disable the driver managed queue for igb(4) when the legacy transmit
  interface is used.

  The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel.

  As noted in altq(9), it is responsibility of the caller to protect this
  queue against concurrent access and, in the igb case, the interface send
  queue is protected by tx queue mutex.  This obviously cannot protect the
  driver managed queue against concurrent access from different tx queues
  and leads to numerous and quite strange panic traces (usually shown as
  packets disappearing into thin air).

  Improving the locking to cope with this means serialize all access to this
  (single) queue and produces no gain, it actually affects the performance
  quite noticeabily.

  The driver managed queue is already disabled when an ALTQ queue discipline
  is set on interface (in altq_enable()), because the driver managed queue
  can interfere with ALTQ timing (whence the reports that setting an ALTQ
  queue discipline on interface also fixes the issue).

  Disabling this additional queue keeps the ability to use if_start() to
  send packets to individual NIC queues while it simply eliminate the race.

  This is a direct commit to stable/11 as -head driver does not support ALTQ
  anymore.

  PR:		213257
  PR:		212413
  Discussed with:	sbruno
  Tested by:	Konstantin Kormashev <konstantin@netgate.com>
  Obtained from:	pfSense
  Sponsored by:	Rubicon Communications, LLC (Netgate)

Changes:
  stable/11/sys/dev/e1000/if_igb.c
Comment 27 commit-hook freebsd_committer 2017-03-11 07:55:09 UTC
A commit references this bug:

Author: loos
Date: Sat Mar 11 07:54:05 UTC 2017
New revision: 315060
URL: https://svnweb.freebsd.org/changeset/base/315060

Log:
  MFC of r314281:

  Disable the driver managed queue for igb(4) when the legacy transmit
  interface is used.

  The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel.

  As noted in altq(9), it is responsibility of the caller to protect this
  queue against concurrent access and, in the igb case, the interface send
  queue is protected by tx queue mutex.  This obviously cannot protect the
  driver managed queue against concurrent access from different tx queues
  and leads to numerous and quite strange panic traces (usually shown as
  packets disappearing into thin air).

  Improving the locking to cope with this means serialize all access to this
  (single) queue and produces no gain, it actually affects the performance
  quite noticeabily.

  The driver managed queue is already disabled when an ALTQ queue discipline
  is set on interface (in altq_enable()), because the driver managed queue
  can interfere with ALTQ timing (whence the reports that setting an ALTQ
  queue discipline on interface also fixes the issue).

  Disabling this additional queue keeps the ability to use if_start() to
  send packets to individual NIC queues while it simply eliminate the race.

  This is a direct commit to stable/11 as -head driver does not support ALTQ
  anymore.

  PR:		213257
  PR:		212413
  Discussed with:	sbruno
  Tested by:	Konstantin Kormashev <konstantin@netgate.com>
  Obtained from:	pfSense
  Sponsored by:	Rubicon Communications, LLC (Netgate)

Changes:
_U  stable/10/
  stable/10/sys/dev/e1000/if_igb.c
Comment 28 Luiz Otavio O Souza,+55 (14) 99772-1255 freebsd_committer 2017-03-15 07:43:50 UTC
Any issues or regressions ?

If not, can I close this PR ?