Created attachment 200136 [details] dmesg.boot Since updating from 11.2 to 12.0-RELEASE, two of our systems have given this same panic a total of three times in the last 24 hours. Usually the instruction pointer is 0, but not always: --- Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x28:0xfffffe0000470a78 frame pointer = 0x28:0xfffffe0000470b20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (0)) trap number = 12 panic: page fault cpuid = 1 time = 1544894686 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff80bb5a39 at softclock+0x79 #8 0xffffffff80b5ee17 at ithread_loop+0x1a7 #9 0xffffffff80b5bf33 at fork_exit+0x83 #10 0xffffffff810501be at fork_trampoline+0xe Uptime: 1d18h48m53s --- Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x158 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x158 stack pointer = 0x28:0xfffffe0000470a78 frame pointer = 0x28:0xfffffe0000470b20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (0)) trap number = 12 panic: page fault cpuid = 0 time = 1544781943 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff80bb5a39 at softclock+0x79 #8 0xffffffff80b5ee17 at ithread_loop+0x1a7 #9 0xffffffff80b5bf33 at fork_exit+0x83 #10 0xffffffff810501be at fork_trampoline+0xe Uptime: 11h29m48s --- Fatal trap 12: page fault while in kernel mode cpuid = 6; apic id = 14 fault virtual address = 0x0 fault code = supervisor read instruction, page not present instruction pointer = 0x20:0x0 stack pointer = 0x0:0xfffffe0000470a78 frame pointer = 0x0:0xfffffe0000470b20 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (0)) trap number = 12 panic: page fault cpuid = 6 time = 1544887543 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff80bb5a39 at softclock+0x79 #8 0xffffffff80b5ee17 at ithread_loop+0x1a7 #9 0xffffffff80b5bf33 at fork_exit+0x83 #10 0xffffffff810501be at fork_trampoline+0xe Uptime: 1d5h17m30s --- Both systems are identical (a bit old, though) Supermicro 5016T-MTFB. dmesg.boot from one of them is attached. I don't have dumps of these, for some reason -- I'm looking into why so I can capture the next one.
I have minidumps of these now, and it's happening on multiple generations of hardware (so far, all older pre-UEFI stuff).
# kgdb /boot/kernel/kernel vmcore.0 GNU gdb (GDB) 8.2 [GDB v8.2 for FreeBSD] Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd12.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel...(no debugging symbols found)...done. 0xffffffff80bcd0bd in sched_switch () (kgdb) bt #0 0xffffffff80bcd0bd in sched_switch () #1 0xffffffff80ba6de1 in mi_switch () #2 0xffffffff80bf554c in sleepq_wait () #3 0xffffffff80ba6817 in _sleep () #4 0xffffffff80bfae71 in taskqueue_thread_loop () #5 0xffffffff80b5bf33 in fork_exit () #6 <signal handler called> (kgdb) I can't get anything useful out of lldb or /usr/libexec/kgdb.
Created attachment 200671 [details] dmesg(8) from successful boot of 11.2-RELEASE-p4 on HP DL380 G7
I have exactly the same panic which prevents booting 12.0-RELEASE on HP DL380 Gen7 server hardware: --- frame pointer = 0x28:0xfffffe0000577a70 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (0)) trap number = 12 panic: page fault cpuid = 0 time = 3 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff810749c9 at trap_pfault+0x49 #5 0xffffffff81073fee at trap+0x29e #6 0xffffffff8104f1d5 at calltrap+0x8 #7 0xffffffff80bb554e at softclock_call_cc+0x12e #8 0xffffffff80bb5a39 at softclock+0x79 #9 0xffffffff80b5ee17 at ithread_loop+0x1a7 #10 0xffffffff80b5bf33 at fork_exit+0x83 #11 0xffffffff810501be at fork_trampoline+0xe Uptime: 3s Automatic reboot in 15 seconds - press a key on the console to abort --- See attached dmesg from 11.2 on this hardware.
Another, slightly different, but still looks vaguely clock related. kernel trap 9 with interrupts disabled Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80bb524f stack pointer = 0x28:0xfffffe0040332770 frame pointer = 0x28:0xfffffe00403327e0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 11 (idle: cpu0) trap number = 9 panic: general protection fault cpuid = 0 time = 1546921049 KDB: stack backtrace: #0 0xffffffff80be7977 at kdb_backtrace+0x67 #1 0xffffffff80b9b563 at vpanic+0x1a3 #2 0xffffffff80b9b3b3 at panic+0x43 #3 0xffffffff8107496f at trap_fatal+0x35f #4 0xffffffff81073dbd at trap+0x6d #5 0xffffffff8104f1d5 at calltrap+0x8 #6 0xffffffff811aa358 at handleevents+0x1a8 #7 0xffffffff811aabc1 at timercb+0x2a1 #8 0xffffffff8107a7f9 at hpet_intr_single+0x1b9 #9 0xffffffff8107a89e at hpet_intr+0x8e #10 0xffffffff80b5ebbd at intr_event_handle+0xbd #11 0xffffffff811e2928 at intr_execute_handlers+0x58 #12 0xffffffff811e8a34 at lapic_handle_intr+0x44 #13 0xffffffff81050489 at Xapic_isr1+0xd9 #14 0xffffffff80459e47 at acpi_cpu_idle+0x2e7 #15 0xffffffff811df68f at cpu_idle_acpi+0x3f #16 0xffffffff811df747 at cpu_idle+0xa7 #17 0xffffffff80bcfb25 at sched_idletd+0x515 Uptime: 13d4h15m15s Dumping 1981 out of 8146 MB: (CTRL-C to abort) ..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Dump complete Automatic reboot in 60 seconds - press a key on the console to abort Rebooting...
This could be related to bug 234296, in which case it is probably a use-after-free of a networking structure. Could you characterize the ARP/Neighbour cache usage on affected systems? Are there many short-lived entries in either cache? Are any of the systems IPv4-only or IPv6-only?
For me the panic occurs at boot time during kernel probing and device enumeration. It's possible that the panic happens right at the hand-off to init, but there's no indication that /etc/rc ever gets started. I think the ARP/Neighbor tables are still empty at that point. The host has both IPv4 and IPv6 enabled, but there are no RAs on its network, so it only has a link-local address.
(In reply to Greg Rivers from comment #7) I suspect that the boot-time panic is unrelated. Would you be willing to file a separate PR and CC me? Could you also include a verbose dmesg (boot -v at the loader prompt) from the system leading up to the crash?
All of my systems are dual-stack v4/v6, and the panics happen after a few days of uptime. Not boot-time. If it was boot-time I wouldn't have finished upgrading the whole rack. :) There shouldn't be any churn in the ARP and ND tables -- it's a server rack and most addresses are statically assigned, and nothing's coming or going. The expire times are at defaults. So the only churn should be re-populating right after a normal expire. Traffic does get kinda heavy at times though (lots of HTTPS, lots of NFSv3) though we had a lightly loaded system get it yesterday. All the NICs are em(4) -- we have a few systems with igb but none of those have had the panic yet. No idea if that's relevant or just luck. I put 12.0-p2 on everything overnight last night and one system has paniced since then, so it's not anything that was fixed in that patch. The panics did not happen on 11.x or 10.x -- this is all new with 12.0. I do now have vmcore images now for many of these, including the 12.0-p2 one from two hours ago. (Initially I didn't have dumps working on geli+gmirrored swap. I do now.)
(In reply to Mike Andrews from comment #9) Would you be willing to share the vmcores with me, along with a copy of the corresponding /boot/kernel and /usr/lib/debug/boot/kernel?
https://www.bit0.com/download/vmcore.0.whitedog.2018-12-30 and https://www.bit0.com/download/kernel.whitedog.2018-12-30 should cover that -- I don't have anything in /usr/lib/debug at all but the kernel is a stock 12.0-p0 r341666 GENERIC one, not custom built from source... so I imagine the /usr/lib/debug from the release ISOs would do. This isn't precisely the same panic as I initially described though. It's just the one that had the least proprietary info in it. :) Let me look at the others and maybe I can put some of them up GPG'ed and send a passphrase separately.
https://www.bit0.com/download/vmcores.champagne.tar.gz.gpg
https://www.bit0.com/download/vmcores.schnapps.tar.gz.gpg https://www.bit0.com/download/vmcores.whiskey.tar.gz.gpg https://www.bit0.com/download/vmcores.wine.tar.gz.gpg emailed passphrase
*** This bug has been marked as a duplicate of bug 234296 ***