Bug 241059

Summary:	random kernel panics [presumed faulty hardware]
Product:	Base System	Reporter:	Christos Chatzaras <chris>
Component:	kern	Assignee:	freebsd-bugs (Nobody) <bugs>
Status:	Closed Not Enough Information
Severity:	Affects Only Me	CC:	chris, emaste, markj
Priority:	---	Keywords:	crash
Version:	12.1-RELEASE
Hardware:	amd64
OS:	Any

Description Christos Chatzaras 2019-10-04 14:58:43 UTC

The last days I enable dumpdev to get crashdump for crashes that randomly happen to my servers.

I can provide shell access to a developer to check the core. 

------

12.1-BETA1

panic: general protection fault

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer     = 0x20:0xffffffff81098b76
stack pointer           = 0x28:0xfffffe009c39e7e0
frame pointer           = 0x28:0xfffffe009c39e8c0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 37432 (bash)
trap number             = 9
panic: general protection fault
cpuid = 1
time = 1570197605
KDB: stack backtrace:
#0 0xffffffff80c1bd47 at kdb_backtrace+0x67
#1 0xffffffff80bcf07d at vpanic+0x19d
#2 0xffffffff80bceed3 at panic+0x43
#3 0xffffffff810a6d2c at trap_fatal+0x39c
#4 0xffffffff810a613c at trap+0x6c
#5 0xffffffff8107fdcc at calltrap+0x8
#6 0xffffffff80f0fadc at vmspace_exit+0x9c
#7 0xffffffff80b88fb9 at exit1+0x5d9
#8 0xffffffff80b889dd at sys_sys_exit+0xd
#9 0xffffffff810a78e4 at amd64_syscall+0x364
#10 0xffffffff810806f0 at fast_syscall_common+0x101
Uptime: 8d15h44m0s
Dumping 2383 out of 32589 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu.h:234
234     /usr/src/sys/amd64/include/pcpu.h: No such file or directory.
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu.h:234
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:371
#2  0xffffffff80bcec78 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:451
#3  0xffffffff80bcf0d9 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:877
#4  0xffffffff80bceed3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:804
#5  0xffffffff810a6d2c in trap_fatal (frame=0xfffffe009c39e720, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:943
#6  0xffffffff810a613c in trap (frame=0xfffffe009c39e720)
    at /usr/src/sys/amd64/amd64/trap.c:221
#7  <signal handler called>
#8  0xffffffff81098b76 in pmap_remove_pages (pmap=<optimized out>)
    at /usr/src/sys/amd64/amd64/pmap.c:6934
#9  0xffffffff80f0fadc in vmspace_exit (td=0xfffff8012651c5e0)
    at /usr/src/sys/vm/vm_map.c:409
#10 0xffffffff80b88fb9 in exit1 (td=0xfffff8012651c5e0, rval=<optimized out>,
    signo=0) at /usr/src/sys/kern/kern_exit.c:396
#11 0xffffffff80b889dd in sys_sys_exit (
    td=0xffffffff81a49240 <pv_list_locks+5120>, uap=<optimized out>)
    at /usr/src/sys/kern/kern_exit.c:175
#12 0xffffffff810a78e4 in syscallenter (td=0xfffff8012651c5e0)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135
#13 amd64_syscall (td=0xfffff8012651c5e0, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1186
#14 <signal handler called>
#15 0x000000080051255a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd788
(kgdb)

Comment 1 Mark Johnston freebsd_committer

2019-10-04 15:36:23 UTC

Does it appear to be a regression?  That is, did you start seeing the panic after an update?

I would be interested in looking at the core.

Comment 2 Christos Chatzaras 2019-10-04 15:39:25 UTC

I got panics with 12.0 too, and if I remember correctly even at 11.x.

It doesn't happen always in the same servers.

But I didn't have dumpdev enabled before so I am not sure if it was always the same issue.

Comment 3 Christos Chatzaras 2019-10-04 15:40:10 UTC

I will send you privately user/pass to login to the server.

Thank you.

Comment 4 Mark Johnston freebsd_committer

2019-10-04 16:10:57 UTC

This particular crash was caused by a bit-flip.  %rcx contains a direct map address with one of the upper bits set to 0:

...
   0xffffffff81098b43 <+1267>:  jmp    0xffffffff81098bc0 <pmap_remove_pages+1392>
   0xffffffff81098b45 <+1269>:  add    $0xffffffffffffffff,%rax                                                                                                
   0xffffffff81098b49 <+1273>:  mov    %rax,0x78(%r8)                                                                                                          
   0xffffffff81098b4d <+1277>:  mov    0x48(%rdx,%rbx,8),%rax                                                                                                  
   0xffffffff81098b52 <+1282>:  mov    %rsi,%rcx             
   0xffffffff81098b55 <+1285>:  add    $0x40,%rcx                  
   0xffffffff81098b59 <+1289>:  test   %rax,%rax                                                                                                               
   0xffffffff81098b5c <+1292>:  lea    0x10(%rax),%rax                                                                                                         
   0xffffffff81098b60 <+1296>:  cmove  %rcx,%rax                                                                                                               
   0xffffffff81098b64 <+1300>:  mov    0x50(%rdx,%rbx,8),%rcx                                                                                                  
   0xffffffff81098b69 <+1305>:  mov    %rcx,(%rax)                             
   0xffffffff81098b6c <+1308>:  mov    0x48(%rdx,%rbx,8),%rax                                                                                                  
   0xffffffff81098b71 <+1313>:  mov    0x50(%rdx,%rbx,8),%rcx                  
=> 0xffffffff81098b76 <+1318>:  mov    %rax,(%rcx)                                                                                                             
...

(kgdb) info reg
rax            0x0                 0
rbx            0x1e0               480
rcx            0xf7fff802c4df2d20  -576469536503550688
                  ^
rdx            0xfffff806af211000  -8767385038848
rsi            0xfffff807f0b673f0  -8761989762064
rdi            0xffffffff81a49240  -2119921088
rbp            0xfffffe009c39e8c0  0xfffffe009c39e8c0
rsp            0xfffffe009c39e7e0  0xfffffe009c39e7e0

Comment 5 Mark Johnston freebsd_committer

2019-10-04 16:35:02 UTC

I'm going to close this for now.  Reportedly there are panics across multiple servers, suggesting a software bug, but bit-flips are typically caused by hardware faults, and this particular system is not using ECC RAM.  Please re-open this bug once you have more crash dumps available for investigation.

Comment 6 Christos Chatzaras 2019-10-05 11:23:51 UTC

You are right. The RAM looks bad.

I test all my servers with these userland programs:

redis-server --test-memory 4096

memtester 4096 1


Redis test:

I run the test few times, some times it show error and some times it show it is ok. But I am not sure how reliable is the redis-server test and if it locks the RAM before it does the tests.

Memtester:

In the server that panic yesterday the "Bit Flip" test shows issues. Also in another server tests show "Solid Bits" and "Bit Flip" issues.

As these 2 servers are old I will replace them this week with new hardware.