Bug 212413 - FreeBSD 11-RC2 crashing after some time
Summary: FreeBSD 11-RC2 crashing after some time
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RC1
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2016-09-06 14:52 UTC by Cassiano Peixoto
Modified: 2017-03-11 07:55 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Cassiano Peixoto 2016-09-06 14:52:11 UTC
I've updated from 10.3-STABLE to 11-RC2. Then after some time my server just crash. I have no services running on this server, so there is no load. But it keeps crashing after some time (some minutes or hours).

It reminds me a bug about a month ago on RELENG11 mentioned by Glen Barber
where the system crashs all the time.

Seems a problem with igb driver, but i'm not sure.

Let me know if i can provide more info.

Thanks.

Bellow is my debug output:

messages:

Sep  6 10:44:00 B-ras kernel: Fatal trap 12: page fault while in kernel mode
Sep  6 10:44:00 B-ras kernel: cpuid = 1; apic id = 02
Sep  6 10:44:00 B-ras kernel: fault virtual address	= 0x78
Sep  6 10:44:00 B-ras kernel: fault code		= supervisor read data, page not present
Sep  6 10:44:00 B-ras kernel: instruction pointer	= 0x20:0xffffffff80aff74c
Sep  6 10:44:00 B-ras kernel: stack pointer	        = 0x28:0xfffffe000024d8d0
Sep  6 10:44:00 B-ras kernel: frame pointer	        = 0x28:0xfffffe000024d930
Sep  6 10:44:00 B-ras kernel: code segment		= base rx0, limit 0xfffff, type 0x1b
Sep  6 10:44:00 B-ras kernel: = DPL 0, pres 1, long 1, def32 0, gran 1


kdb output:

# kgdb kernel.debug /var/crash/vmcore.last
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (irq267: igb2:que 1)
trap number		= 12
panic: page fault
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff80a5bc17 at kdb_backtrace+0x67
#1 0xffffffff80a1db22 at vpanic+0x182
#2 0xffffffff80a1d993 at panic+0x43
#3 0xffffffff80eb4581 at trap_fatal+0x351
#4 0xffffffff80eb4773 at trap_pfault+0x1e3
#5 0xffffffff80eb3d1c at trap+0x26c
#6 0xffffffff80e9add1 at calltrap+0x8
#7 0xffffffff80b13e75 at netisr_dispatch_src+0xa5
#8 0xffffffff80aff076 at ether_input+0x26
#9 0xffffffff80531b57 at igb_rxeof+0x797
#10 0xffffffff80532e7f at igb_msix_que+0xdf
#11 0xffffffff809e88f8 at intr_event_execute_handlers+0xe8
#12 0xffffffff809e8bca at ithread_loop+0xba
#13 0xffffffff809e6258 at fork_exit+0xc8
#14 0xffffffff80e9b30e at fork_trampoline+0xe
Uptime: 4h34m20s
Dumping 989 out of 8144 MB: (CTRL-C to abort) ..2%..12%..22%..31%..41%..51%..62%..72%..81%..91%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/coretemp.ko...Reading symbols from /usr/lib/debug//boot/kernel/coretemp.ko.debug...done.
done.
Loaded symbols for /boot/kernel/coretemp.ko
Reading symbols from /boot/kernel/ng_socket.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_socket.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ng_socket.ko
Reading symbols from /boot/kernel/netgraph.ko...Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug...done.
done.
Loaded symbols for /boot/kernel/netgraph.ko
Reading symbols from /boot/kernel/ng_mppc.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_mppc.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ng_mppc.ko
Reading symbols from /boot/kernel/rc4.ko...Reading symbols from /usr/lib/debug//boot/kernel/rc4.ko.debug...done.
done.
Loaded symbols for /boot/kernel/rc4.ko
Reading symbols from /boot/kernel/ng_ether.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_ether.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ng_ether.ko
Reading symbols from /boot/kernel/ng_pppoe.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_pppoe.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ng_pppoe.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
221	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) list *0xffffffff80aff74c
0xffffffff80aff74c is in ether_nh_input (/usr/src/sys/net/if_ethersubr.c:472).
467	ether_input_internal(struct ifnet *ifp, struct mbuf *m)
468	{
469		struct ether_header *eh;
470		u_short etype;
471	
472		if ((ifp->if_flags & IFF_UP) == 0) {
473			m_freem(m);
474			return;
475		}
476	#ifdef DIAGNOSTIC
Current language:  auto; currently minimal
(kgdb) bt
#0  doadump (textdump=<value optimized out>) at pcpu.h:221
#1  0xffffffff80a1d76a in kern_reboot (howto=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80a1db5b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
#3  0xffffffff80a1d993 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:690
#4  0xffffffff80eb4581 in trap_fatal (frame=0xfffffe000024d820, eva=120) at /usr/src/sys/amd64/amd64/trap.c:841
#5  0xffffffff80eb4773 in trap_pfault (frame=0xfffffe000024d820, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691
#6  0xffffffff80eb3d1c in trap (frame=0xfffffe000024d820) at /usr/src/sys/amd64/amd64/trap.c:442
#7  0xffffffff80e9add1 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff80aff74c in ether_nh_input (m=0xfffff8002216e600) at /usr/src/sys/net/if_ethersubr.c:669
#9  0xffffffff80b13e75 in netisr_dispatch_src (proto=5, source=<value optimized out>, m=0xfffff8002216e600) at /usr/src/sys/net/netisr.c:1121
#10 0xffffffff80aff076 in ether_input (ifp=<value optimized out>, m=0xfffff8002feb0800) at /usr/src/sys/net/if_ethersubr.c:759
#11 0xffffffff80531b57 in igb_rxeof (que=<value optimized out>, count=<value optimized out>, done=<value optimized out>) at /usr/src/sys/dev/e1000/if_igb.c:4957
#12 0xffffffff80532e7f in igb_msix_que (arg=0xfffff80006468468) at /usr/src/sys/dev/e1000/if_igb.c:1612
#13 0xffffffff809e88f8 in intr_event_execute_handlers (p=<value optimized out>, ie=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1262
#14 0xffffffff809e8bca in ithread_loop (arg=<value optimized out>) at /usr/src/sys/kern/kern_intr.c:1275
#15 0xffffffff809e6258 in fork_exit (callout=0xffffffff809e8b10 <ithread_loop>, arg=0xfffff80006463920, frame=0xfffffe000024dc00) at /usr/src/sys/kern/kern_fork.c:1038
#16 0xffffffff80e9b30e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611
#17 0x0000000000000000 in ?? ()
(kgdb)
Comment 1 Cassiano Peixoto 2016-09-12 14:00:28 UTC
Guys,

Just an update about this issue. I had to remove ALTQ from my kernel. After that it stopped crashing. So looks some conflict with ALTQ.
Comment 2 nicolas 2016-11-06 21:47:32 UTC
I confirm this bug on FreeBSD 11.0 RELEASE. Only igb interfaces are affected as the issue don't occur with host using em instead of igb interfaces.

The crash seem to be related to a certain type and amount of packets.
Removing ALTQ support from kernel fix the issue.
Comment 3 freebsd 2016-11-07 17:11:31 UTC
(In reply to nicolas from comment #2)
FYI, including ALTQ in the kernel config switches igb(4) interfaces to legacy single-queue mode.  I have seen crashes like these with high-bandwidth traffic when the driver is in this mode.  If you need ALTQ, add a queue to limit throughput through igb interfaces.  For example, the following workaround has prevented crashes on my systems for the last six months:

     ## Limit bandwidth on internal interface to avoid igb driver bug
     altq on $int_if cbq bandwidth 404Mb queue { internal }
     queue internal bandwidth 99% priority 1 cbq(default red borrow)
Comment 4 ncrogers 2016-12-05 19:46:38 UTC
This issue is also discussed here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409

and here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213257
Comment 5 commit-hook freebsd_committer 2017-02-25 20:22:41 UTC
A commit references this bug:

Author: loos
Date: Sat Feb 25 20:21:39 UTC 2017
New revision: 314281
URL: https://svnweb.freebsd.org/changeset/base/314281

Log:
  Disable the driver managed queue for igb(4) when the legacy transmit
  interface is used.

  The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel.

  As noted in altq(9), it is responsibility of the caller to protect this
  queue against concurrent access and, in the igb case, the interface send
  queue is protected by tx queue mutex.  This obviously cannot protect the
  driver managed queue against concurrent access from different tx queues
  and leads to numerous and quite strange panic traces (usually shown as
  packets disappearing into thin air).

  Improving the locking to cope with this means serialize all access to this
  (single) queue and produces no gain, it actually affects the performance
  quite noticeabily.

  The driver managed queue is already disabled when an ALTQ queue discipline
  is set on interface (in altq_enable()), because the driver managed queue
  can interfere with ALTQ timing (whence the reports that setting an ALTQ
  queue discipline on interface also fixes the issue).

  Disabling this additional queue keeps the ability to use if_start() to
  send packets to individual NIC queues while it simply eliminate the race.

  This is a direct commit to stable/11 as -head driver does not support ALTQ
  anymore.

  PR:		213257
  PR:		212413
  Discussed with:	sbruno
  Tested by:	Konstantin Kormashev <konstantin@netgate.com>
  Obtained from:	pfSense
  Sponsored by:	Rubicon Communications, LLC (Netgate)

Changes:
  stable/11/sys/dev/e1000/if_igb.c
Comment 6 commit-hook freebsd_committer 2017-03-11 07:55:16 UTC
A commit references this bug:

Author: loos
Date: Sat Mar 11 07:54:05 UTC 2017
New revision: 315060
URL: https://svnweb.freebsd.org/changeset/base/315060

Log:
  MFC of r314281:

  Disable the driver managed queue for igb(4) when the legacy transmit
  interface is used.

  The legacy API (IGB_LEGACY_TX) is enabled when ALTQ is built into kernel.

  As noted in altq(9), it is responsibility of the caller to protect this
  queue against concurrent access and, in the igb case, the interface send
  queue is protected by tx queue mutex.  This obviously cannot protect the
  driver managed queue against concurrent access from different tx queues
  and leads to numerous and quite strange panic traces (usually shown as
  packets disappearing into thin air).

  Improving the locking to cope with this means serialize all access to this
  (single) queue and produces no gain, it actually affects the performance
  quite noticeabily.

  The driver managed queue is already disabled when an ALTQ queue discipline
  is set on interface (in altq_enable()), because the driver managed queue
  can interfere with ALTQ timing (whence the reports that setting an ALTQ
  queue discipline on interface also fixes the issue).

  Disabling this additional queue keeps the ability to use if_start() to
  send packets to individual NIC queues while it simply eliminate the race.

  This is a direct commit to stable/11 as -head driver does not support ALTQ
  anymore.

  PR:		213257
  PR:		212413
  Discussed with:	sbruno
  Tested by:	Konstantin Kormashev <konstantin@netgate.com>
  Obtained from:	pfSense
  Sponsored by:	Rubicon Communications, LLC (Netgate)

Changes:
_U  stable/10/
  stable/10/sys/dev/e1000/if_igb.c