Bug 241059 - random kernel panics [presumed faulty hardware]
Summary: random kernel panics [presumed faulty hardware]
Status: Closed Not Enough Information
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
Keywords: crash, panic
Depends on:
Reported: 2019-10-04 14:58 UTC by Christos Chatzaras
Modified: 2019-12-27 07:31 UTC (History)
3 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Christos Chatzaras 2019-10-04 14:58:43 UTC
The last days I enable dumpdev to get crashdump for crashes that randomly happen to my servers.

I can provide shell access to a developer to check the core. 



panic: general protection fault

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer     = 0x20:0xffffffff81098b76
stack pointer           = 0x28:0xfffffe009c39e7e0
frame pointer           = 0x28:0xfffffe009c39e8c0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 37432 (bash)
trap number             = 9
panic: general protection fault
cpuid = 1
time = 1570197605
KDB: stack backtrace:
#0 0xffffffff80c1bd47 at kdb_backtrace+0x67
#1 0xffffffff80bcf07d at vpanic+0x19d
#2 0xffffffff80bceed3 at panic+0x43
#3 0xffffffff810a6d2c at trap_fatal+0x39c
#4 0xffffffff810a613c at trap+0x6c
#5 0xffffffff8107fdcc at calltrap+0x8
#6 0xffffffff80f0fadc at vmspace_exit+0x9c
#7 0xffffffff80b88fb9 at exit1+0x5d9
#8 0xffffffff80b889dd at sys_sys_exit+0xd
#9 0xffffffff810a78e4 at amd64_syscall+0x364
#10 0xffffffff810806f0 at fast_syscall_common+0x101
Uptime: 8d15h44m0s
Dumping 2383 out of 32589 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu.h:234
234     /usr/src/sys/amd64/include/pcpu.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu.h:234
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:371
#2  0xffffffff80bcec78 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:451
#3  0xffffffff80bcf0d9 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:877
#4  0xffffffff80bceed3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:804
#5  0xffffffff810a6d2c in trap_fatal (frame=0xfffffe009c39e720, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff810a613c in trap (frame=0xfffffe009c39e720)
    at /usr/src/sys/amd64/amd64/trap.c:221
#7  <signal handler called>
#8  0xffffffff81098b76 in pmap_remove_pages (pmap=<optimized out>)
    at /usr/src/sys/amd64/amd64/pmap.c:6934
#9  0xffffffff80f0fadc in vmspace_exit (td=0xfffff8012651c5e0)
    at /usr/src/sys/vm/vm_map.c:409
#10 0xffffffff80b88fb9 in exit1 (td=0xfffff8012651c5e0, rval=<optimized out>,
    signo=0) at /usr/src/sys/kern/kern_exit.c:396
#11 0xffffffff80b889dd in sys_sys_exit (
    td=0xffffffff81a49240 <pv_list_locks+5120>, uap=<optimized out>)
    at /usr/src/sys/kern/kern_exit.c:175
#12 0xffffffff810a78e4 in syscallenter (td=0xfffff8012651c5e0)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#13 amd64_syscall (td=0xfffff8012651c5e0, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1186
#14 <signal handler called>
#15 0x000000080051255a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd788
Comment 1 Mark Johnston freebsd_committer 2019-10-04 15:36:23 UTC
Does it appear to be a regression?  That is, did you start seeing the panic after an update?

I would be interested in looking at the core.
Comment 2 Christos Chatzaras 2019-10-04 15:39:25 UTC
I got panics with 12.0 too, and if I remember correctly even at 11.x.

It doesn't happen always in the same servers.

But I didn't have dumpdev enabled before so I am not sure if it was always the same issue.
Comment 3 Christos Chatzaras 2019-10-04 15:40:10 UTC
I will send you privately user/pass to login to the server.

Thank you.
Comment 4 Mark Johnston freebsd_committer 2019-10-04 16:10:57 UTC
This particular crash was caused by a bit-flip.  %rcx contains a direct map address with one of the upper bits set to 0:

   0xffffffff81098b43 <+1267>:  jmp    0xffffffff81098bc0 <pmap_remove_pages+1392>
   0xffffffff81098b45 <+1269>:  add    $0xffffffffffffffff,%rax                                                                                                
   0xffffffff81098b49 <+1273>:  mov    %rax,0x78(%r8)                                                                                                          
   0xffffffff81098b4d <+1277>:  mov    0x48(%rdx,%rbx,8),%rax                                                                                                  
   0xffffffff81098b52 <+1282>:  mov    %rsi,%rcx             
   0xffffffff81098b55 <+1285>:  add    $0x40,%rcx                  
   0xffffffff81098b59 <+1289>:  test   %rax,%rax                                                                                                               
   0xffffffff81098b5c <+1292>:  lea    0x10(%rax),%rax                                                                                                         
   0xffffffff81098b60 <+1296>:  cmove  %rcx,%rax                                                                                                               
   0xffffffff81098b64 <+1300>:  mov    0x50(%rdx,%rbx,8),%rcx                                                                                                  
   0xffffffff81098b69 <+1305>:  mov    %rcx,(%rax)                             
   0xffffffff81098b6c <+1308>:  mov    0x48(%rdx,%rbx,8),%rax                                                                                                  
   0xffffffff81098b71 <+1313>:  mov    0x50(%rdx,%rbx,8),%rcx                  
=> 0xffffffff81098b76 <+1318>:  mov    %rax,(%rcx)                                                                                                             

(kgdb) info reg
rax            0x0                 0
rbx            0x1e0               480
rcx            0xf7fff802c4df2d20  -576469536503550688
rdx            0xfffff806af211000  -8767385038848
rsi            0xfffff807f0b673f0  -8761989762064
rdi            0xffffffff81a49240  -2119921088
rbp            0xfffffe009c39e8c0  0xfffffe009c39e8c0
rsp            0xfffffe009c39e7e0  0xfffffe009c39e7e0
Comment 5 Mark Johnston freebsd_committer 2019-10-04 16:35:02 UTC
I'm going to close this for now.  Reportedly there are panics across multiple servers, suggesting a software bug, but bit-flips are typically caused by hardware faults, and this particular system is not using ECC RAM.  Please re-open this bug once you have more crash dumps available for investigation.
Comment 6 Christos Chatzaras 2019-10-05 11:23:51 UTC
You are right. The RAM looks bad.

I test all my servers with these userland programs:

redis-server --test-memory 4096

memtester 4096 1

Redis test:

I run the test few times, some times it show error and some times it show it is ok. But I am not sure how reliable is the redis-server test and if it locks the RAM before it does the tests.


In the server that panic yesterday the "Bit Flip" test shows issues. Also in another server tests show "Solid Bits" and "Bit Flip" issues.

As these 2 servers are old I will replace them this week with new hardware.