Bug 234042 - panics on 12.0-RELEASE: swi4: clock(0), instruction pointer 0x0
Summary: panics on 12.0-RELEASE: swi4: clock(0), instruction pointer 0x0
Status: Closed DUPLICATE of bug 234296
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-15 20:12 UTC by Mike Andrews
Modified: 2019-01-20 17:09 UTC (History)
3 users (show)

See Also:


Attachments
dmesg.boot (10.30 KB, text/plain)
2018-12-15 20:12 UTC, Mike Andrews
no flags Details
dmesg(8) from successful boot of 11.2-RELEASE-p4 on HP DL380 G7 (16.02 KB, text/plain)
2019-01-01 05:10 UTC, Greg Rivers
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Andrews 2018-12-15 20:12:54 UTC
Created attachment 200136 [details]
dmesg.boot

Since updating from 11.2 to 12.0-RELEASE, two of our systems have given this same panic a total of three times in the last 24 hours.  Usually the instruction pointer is 0, but not always:

---

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x0
fault code              = supervisor read instruction, page not present
instruction pointer     = 0x20:0x0
stack pointer           = 0x28:0xfffffe0000470a78
frame pointer           = 0x28:0xfffffe0000470b20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 1
time = 1544894686
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80bb5a39 at softclock+0x79
#8 0xffffffff80b5ee17 at ithread_loop+0x1a7
#9 0xffffffff80b5bf33 at fork_exit+0x83
#10 0xffffffff810501be at fork_trampoline+0xe
Uptime: 1d18h48m53s

---

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x158
fault code              = supervisor read instruction, page not present
instruction pointer     = 0x20:0x158
stack pointer           = 0x28:0xfffffe0000470a78
frame pointer           = 0x28:0xfffffe0000470b20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 0
time = 1544781943
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80bb5a39 at softclock+0x79
#8 0xffffffff80b5ee17 at ithread_loop+0x1a7
#9 0xffffffff80b5bf33 at fork_exit+0x83
#10 0xffffffff810501be at fork_trampoline+0xe
Uptime: 11h29m48s

---

Fatal trap 12: page fault while in kernel mode
cpuid = 6; apic id = 14
fault virtual address   = 0x0
fault code              = supervisor read instruction, page not present
instruction pointer     = 0x20:0x0
stack pointer           = 0x0:0xfffffe0000470a78
frame pointer           = 0x0:0xfffffe0000470b20
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi4: clock (0))
trap number             = 12
panic: page fault
cpuid = 6
time = 1544887543
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff810749c9 at trap_pfault+0x49
#5 0xffffffff81073fee at trap+0x29e
#6 0xffffffff8104f1d5 at calltrap+0x8
#7 0xffffffff80bb5a39 at softclock+0x79
#8 0xffffffff80b5ee17 at ithread_loop+0x1a7
#9 0xffffffff80b5bf33 at fork_exit+0x83
#10 0xffffffff810501be at fork_trampoline+0xe
Uptime: 1d5h17m30s

---

Both systems are identical (a bit old, though) Supermicro 5016T-MTFB.  dmesg.boot from one of them is attached.

I don't have dumps of these, for some reason -- I'm looking into why so I can capture the next one.
Comment 1 Mike Andrews 2018-12-29 16:29:53 UTC
I have minidumps of these now, and it's happening on multiple generations of hardware (so far, all older pre-UEFI stuff).
Comment 2 Mike Andrews 2018-12-30 22:09:07 UTC
# kgdb /boot/kernel/kernel vmcore.0
GNU gdb (GDB) 8.2 [GDB v8.2 for FreeBSD]
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd12.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...(no debugging symbols found)...done.
0xffffffff80bcd0bd in sched_switch ()
(kgdb) bt
#0  0xffffffff80bcd0bd in sched_switch ()
#1  0xffffffff80ba6de1 in mi_switch ()
#2  0xffffffff80bf554c in sleepq_wait ()
#3  0xffffffff80ba6817 in _sleep ()
#4  0xffffffff80bfae71 in taskqueue_thread_loop ()
#5  0xffffffff80b5bf33 in fork_exit ()
#6  <signal handler called>
(kgdb)

I can't get anything useful out of lldb or /usr/libexec/kgdb.
Comment 3 Greg Rivers 2019-01-01 05:10:57 UTC
Created attachment 200671 [details]
dmesg(8) from successful boot of 11.2-RELEASE-p4 on HP DL380 G7
Comment 4 Greg Rivers 2019-01-01 05:13:46 UTC
I have exactly the same panic which prevents booting 12.0-RELEASE on HP DL380 Gen7 server hardware:
---
frame pointer           = 0x28:0xfffffe0000577a70                               
code segment            = base rx0, limit 0xfffff, type 0x1b                    
                        = DPL 0, pres 1, long 1, def32 0, gran 1                
processor eflags        = interrupt enabled, resume, IOPL = 0                   
current process         = 12 (swi4: clock (0))                                  
trap number             = 12                                                    
panic: page fault                                                               
cpuid = 0                                                                       
time = 3                                                                        
KDB: stack backtrace:                                                           
#0 0xffffffff80be7977 at kdb_backtrace+0x67                                     
#1 0xffffffff80b9b563 at vpanic+0x1a3                                           
#2 0xffffffff80b9b3b3 at panic+0x43                                             
#3 0xffffffff8107496f at trap_fatal+0x35f                                       
#4 0xffffffff810749c9 at trap_pfault+0x49                                       
#5 0xffffffff81073fee at trap+0x29e                                             
#6 0xffffffff8104f1d5 at calltrap+0x8                                           
#7 0xffffffff80bb554e at softclock_call_cc+0x12e                                
#8 0xffffffff80bb5a39 at softclock+0x79                                         
#9 0xffffffff80b5ee17 at ithread_loop+0x1a7                                     
#10 0xffffffff80b5bf33 at fork_exit+0x83                                        
#11 0xffffffff810501be at fork_trampoline+0xe                                   
Uptime: 3s                                                                      
Automatic reboot in 15 seconds - press a key on the console to abort            
---
See attached dmesg from 11.2 on this hardware.
Comment 5 Mike Andrews 2019-01-09 20:21:17 UTC
Another, slightly different, but still looks vaguely clock related.

kernel trap 9 with interrupts disabled


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff80bb524f
stack pointer           = 0x28:0xfffffe0040332770
frame pointer           = 0x28:0xfffffe00403327e0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 11 (idle: cpu0)
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1546921049
KDB: stack backtrace:
#0 0xffffffff80be7977 at kdb_backtrace+0x67
#1 0xffffffff80b9b563 at vpanic+0x1a3
#2 0xffffffff80b9b3b3 at panic+0x43
#3 0xffffffff8107496f at trap_fatal+0x35f
#4 0xffffffff81073dbd at trap+0x6d
#5 0xffffffff8104f1d5 at calltrap+0x8
#6 0xffffffff811aa358 at handleevents+0x1a8
#7 0xffffffff811aabc1 at timercb+0x2a1
#8 0xffffffff8107a7f9 at hpet_intr_single+0x1b9
#9 0xffffffff8107a89e at hpet_intr+0x8e
#10 0xffffffff80b5ebbd at intr_event_handle+0xbd
#11 0xffffffff811e2928 at intr_execute_handlers+0x58
#12 0xffffffff811e8a34 at lapic_handle_intr+0x44
#13 0xffffffff81050489 at Xapic_isr1+0xd9
#14 0xffffffff80459e47 at acpi_cpu_idle+0x2e7
#15 0xffffffff811df68f at cpu_idle_acpi+0x3f
#16 0xffffffff811df747 at cpu_idle+0xa7
#17 0xffffffff80bcfb25 at sched_idletd+0x515
Uptime: 13d4h15m15s
Dumping 1981 out of 8146 MB: (CTRL-C to abort) ..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Dump complete
Automatic reboot in 60 seconds - press a key on the console to abort
Rebooting...
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2019-01-10 20:08:17 UTC
This could be related to bug 234296, in which case it is probably a use-after-free of a networking structure.  Could you characterize the ARP/Neighbour cache usage on affected systems?  Are there many short-lived entries in either cache?  Are any of the systems IPv4-only or IPv6-only?
Comment 7 Greg Rivers 2019-01-10 21:40:50 UTC
For me the panic occurs at boot time during kernel probing and device enumeration. It's possible that the panic happens right at the hand-off to init, but there's no indication that /etc/rc ever gets started. I think the ARP/Neighbor tables are still empty at that point.

The host has both IPv4 and IPv6 enabled, but there are no RAs on its network, so it only has a link-local address.
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2019-01-10 21:44:52 UTC
(In reply to Greg Rivers from comment #7)
I suspect that the boot-time panic is unrelated.  Would you be willing to file a separate PR and CC me?  Could you also include a verbose dmesg (boot -v at the loader prompt) from the system leading up to the crash?
Comment 9 Mike Andrews 2019-01-10 22:49:38 UTC
All of my systems are dual-stack v4/v6, and the panics happen after a few days of uptime.  Not boot-time.  If it was boot-time I wouldn't have finished upgrading the whole rack. :)

There shouldn't be any churn in the ARP and ND tables -- it's a server rack and most addresses are statically assigned, and nothing's coming or going.  The expire times are at defaults.  So the only churn should be re-populating right after a normal expire.  Traffic does get kinda heavy at times though (lots of HTTPS, lots of NFSv3) though we had a lightly loaded system get it yesterday.

All the NICs are em(4) -- we have a few systems with igb but none of those have had the panic yet.  No idea if that's relevant or just luck.

I put 12.0-p2 on everything overnight last night and one system has paniced since then, so it's not anything that was fixed in that patch.  The panics did not happen on 11.x or 10.x -- this is all new with 12.0.

I do now have vmcore images now for many of these, including the 12.0-p2 one from two hours ago.  (Initially I didn't have dumps working on geli+gmirrored swap.  I do now.)
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2019-01-10 23:35:05 UTC
(In reply to Mike Andrews from comment #9)
Would you be willing to share the vmcores with me, along with a copy of the corresponding /boot/kernel and /usr/lib/debug/boot/kernel?
Comment 11 Mike Andrews 2019-01-13 20:08:22 UTC
https://www.bit0.com/download/vmcore.0.whitedog.2018-12-30 and https://www.bit0.com/download/kernel.whitedog.2018-12-30 should cover that -- I don't have anything in /usr/lib/debug at all but the kernel is a stock 12.0-p0 r341666 GENERIC one, not custom built from source...  so I imagine the /usr/lib/debug from the release ISOs would do.

This isn't precisely the same panic as I initially described though.  It's just the one that had the least proprietary info in it.  :)  Let me look at the others and maybe I can put some of them up GPG'ed and send a passphrase separately.
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2019-01-20 17:09:44 UTC

*** This bug has been marked as a duplicate of bug 234296 ***