I've updated from 10.3-STABLE to 11-RC2. Then after some time my server just crash. I have no services running on this server, so there is no load. But it keeps crashing after some time (some minutes or hours). It reminds me a bug about a month ago on RELENG11 mentioned by Glen Barber where the system crashs all the time. Seems a problem with igb driver, but i'm not sure. Let me know if i can provide more info. Thanks. Bellow is my debug output: messages: Sep 6 10:44:00 B-ras kernel: Fatal trap 12: page fault while in kernel mode Sep 6 10:44:00 B-ras kernel: cpuid = 1; apic id = 02 Sep 6 10:44:00 B-ras kernel: fault virtual address = 0x78 Sep 6 10:44:00 B-ras kernel: fault code = supervisor read data, page not present Sep 6 10:44:00 B-ras kernel: instruction pointer = 0x20:0xffffffff80aff74c Sep 6 10:44:00 B-ras kernel: stack pointer = 0x28:0xfffffe000024d8d0 Sep 6 10:44:00 B-ras kernel: frame pointer = 0x28:0xfffffe000024d930 Sep 6 10:44:00 B-ras kernel: code segment = base rx0, limit 0xfffff, type 0x1b Sep 6 10:44:00 B-ras kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 kdb output: # kgdb kernel.debug /var/crash/vmcore.last GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq267: igb2:que 1) trap number = 12 panic: page fault cpuid = 1 KDB: stack backtrace: #0 0xffffffff80a5bc17 at kdb_backtrace+0x67 #1 0xffffffff80a1db22 at vpanic+0x182 #2 0xffffffff80a1d993 at panic+0x43 #3 0xffffffff80eb4581 at trap_fatal+0x351 #4 0xffffffff80eb4773 at trap_pfault+0x1e3 #5 0xffffffff80eb3d1c at trap+0x26c #6 0xffffffff80e9add1 at calltrap+0x8 #7 0xffffffff80b13e75 at netisr_dispatch_src+0xa5 #8 0xffffffff80aff076 at ether_input+0x26 #9 0xffffffff80531b57 at igb_rxeof+0x797 #10 0xffffffff80532e7f at igb_msix_que+0xdf #11 0xffffffff809e88f8 at intr_event_execute_handlers+0xe8 #12 0xffffffff809e8bca at ithread_loop+0xba #13 0xffffffff809e6258 at fork_exit+0xc8 #14 0xffffffff80e9b30e at fork_trampoline+0xe Uptime: 4h34m20s Dumping 989 out of 8144 MB: (CTRL-C to abort) ..2%..12%..22%..31%..41%..51%..62%..72%..81%..91% Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/coretemp.ko...Reading symbols from /usr/lib/debug//boot/kernel/coretemp.ko.debug...done. done. Loaded symbols for /boot/kernel/coretemp.ko Reading symbols from /boot/kernel/ng_socket.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_socket.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_socket.ko Reading symbols from /boot/kernel/netgraph.ko...Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug...done. done. Loaded symbols for /boot/kernel/netgraph.ko Reading symbols from /boot/kernel/ng_mppc.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_mppc.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_mppc.ko Reading symbols from /boot/kernel/rc4.ko...Reading symbols from /usr/lib/debug//boot/kernel/rc4.ko.debug...done. done. Loaded symbols for /boot/kernel/rc4.ko Reading symbols from /boot/kernel/ng_ether.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_ether.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_ether.ko Reading symbols from /boot/kernel/ng_pppoe.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_pppoe.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_pppoe.ko #0 doadump (textdump=<value optimized out>) at pcpu.h:221 221 pcpu.h: No such file or directory. in pcpu.h (kgdb) list *0xffffffff80aff74c 0xffffffff80aff74c is in ether_nh_input (/usr/src/sys/net/if_ethersubr.c:472). 467 ether_input_internal(struct ifnet *ifp, struct mbuf *m) 468 { 469 struct ether_header *eh; 470 u_short etype; 471 472 if ((ifp->if_flags & IFF_UP) == 0) { 473 m_freem(m); 474 return; 475 } 476 #ifdef DIAGNOSTIC Current language: auto; currently minimal (kgdb) bt #0 doadump (textdump=<value optimized out>) at pcpu.h:221 #1 0xffffffff80a1d76a in kern_reboot (howto=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80a1db5b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80a1d993 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:690 #4 0xffffffff80eb4581 in trap_fatal (frame=0xfffffe000024d820, eva=120) at /usr/src/sys/amd64/amd64/trap.c:841 #5 0xffffffff80eb4773 in trap_pfault (frame=0xfffffe000024d820, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691 #6 0xffffffff80eb3d1c in trap (frame=0xfffffe000024d820) at /usr/src/sys/amd64/amd64/trap.c:442 #7 0xffffffff80e9add1 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff80aff74c in ether_nh_input (m=0xfffff8002216e600) at /usr/src/sys/net/if_ethersubr.c:669 #9 0xffffffff80b13e75 in netisr_dispatch_src (proto=5, source=<value optimized out>, m=0xfffff8002216e600) at /usr/src/sys/net/netisr.c:1121 #10 0xffffffff80aff076 in ether_input (ifp=<value optimized out>, m=0xfffff8002feb0800) at /usr/src/sys/net/if_ethersubr.c:759 #11 0xffffffff80531b57 in igb_rxeof (que=<value optimized out>, count=<value optimized out>, done=<value optimized out>) at /usr/src/sys/dev/e1000/if_igb.c:4957 #12 0xffffffff80532e7f in igb_msix_que (arg=0xfffff80006468468) at /usr/src/sys/dev/e1000/if_igb.c:1612 #13 0xffffffff809e88f8 in intr_event_execute_handlers (p=<value optimized out>, ie=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1262 #14 0xffffffff809e8bca in ithread_loop (arg=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1275 #15 0xffffffff809e6258 in fork_exit (callout=0xffffffff809e8b10 <ithread_loop>, arg=0xfffff80006463920, frame=0xfffffe000024dc00) at /usr/src/sys/kern/kern_fork.c:1038 #16 0xffffffff80e9b30e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611 #17 0x0000000000000000 in ?? () (kgdb)
Guys, Just an update about this issue. I had to remove ALTQ from my kernel. After that it stopped crashing. So looks some conflict with ALTQ.
I confirm this bug on FreeBSD 11.0 RELEASE. Only igb interfaces are affected as the issue don't occur with host using em instead of igb interfaces. The crash seem to be related to a certain type and amount of packets. Removing ALTQ support from kernel fix the issue.
(In reply to nicolas from comment #2) FYI, including ALTQ in the kernel config switches igb(4) interfaces to legacy single-queue mode. I have seen crashes like these with high-bandwidth traffic when the driver is in this mode. If you need ALTQ, add a queue to limit throughput through igb interfaces. For example, the following workaround has prevented crashes on my systems for the last six months: ## Limit bandwidth on internal interface to avoid igb driver bug altq on $int_if cbq bandwidth 404Mb queue { internal } queue internal bandwidth 99% priority 1 cbq(default red borrow)
This issue is also discussed here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409 and here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213257
A commit references this bug: Author: loos Date: Sat Feb 25 20:21:39 UTC 2017 New revision: 314281 URL: https://svnweb.freebsd.org/changeset/base/314281 Log: Disable the driver managed queue for igb(4) when the legacy transmit interface is used. The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel. As noted in altq(9), it is responsibility of the caller to protect this queue against concurrent access and, in the igb case, the interface send queue is protected by tx queue mutex. This obviously cannot protect the driver managed queue against concurrent access from different tx queues and leads to numerous and quite strange panic traces (usually shown as packets disappearing into thin air). Improving the locking to cope with this means serialize all access to this (single) queue and produces no gain, it actually affects the performance quite noticeabily. The driver managed queue is already disabled when an ALTQ queue discipline is set on interface (in altq_enable()), because the driver managed queue can interfere with ALTQ timing (whence the reports that setting an ALTQ queue discipline on interface also fixes the issue). Disabling this additional queue keeps the ability to use if_start() to send packets to individual NIC queues while it simply eliminate the race. This is a direct commit to stable/11 as -head driver does not support ALTQ anymore. PR: 213257 PR: 212413 Discussed with: sbruno Tested by: Konstantin Kormashev <konstantin@netgate.com> Obtained from: pfSense Sponsored by: Rubicon Communications, LLC (Netgate) Changes: stable/11/sys/dev/e1000/if_igb.c
A commit references this bug: Author: loos Date: Sat Mar 11 07:54:05 UTC 2017 New revision: 315060 URL: https://svnweb.freebsd.org/changeset/base/315060 Log: MFC of r314281: Disable the driver managed queue for igb(4) when the legacy transmit interface is used. The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel. As noted in altq(9), it is responsibility of the caller to protect this queue against concurrent access and, in the igb case, the interface send queue is protected by tx queue mutex. This obviously cannot protect the driver managed queue against concurrent access from different tx queues and leads to numerous and quite strange panic traces (usually shown as packets disappearing into thin air). Improving the locking to cope with this means serialize all access to this (single) queue and produces no gain, it actually affects the performance quite noticeabily. The driver managed queue is already disabled when an ALTQ queue discipline is set on interface (in altq_enable()), because the driver managed queue can interfere with ALTQ timing (whence the reports that setting an ALTQ queue discipline on interface also fixes the issue). Disabling this additional queue keeps the ability to use if_start() to send packets to individual NIC queues while it simply eliminate the race. This is a direct commit to stable/11 as -head driver does not support ALTQ anymore. PR: 213257 PR: 212413 Discussed with: sbruno Tested by: Konstantin Kormashev <konstantin@netgate.com> Obtained from: pfSense Sponsored by: Rubicon Communications, LLC (Netgate) Changes: _U stable/10/ stable/10/sys/dev/e1000/if_igb.c
Committed back in 2017. ^Triage: assign to committer that resolved.