The last days I enable dumpdev to get crashdump for crashes that randomly happen to my servers. I can provide shell access to a developer to check the core. ------ 12.1-BETA1 panic: general protection fault Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff81098b76 stack pointer = 0x28:0xfffffe009c39e7e0 frame pointer = 0x28:0xfffffe009c39e8c0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 37432 (bash) trap number = 9 panic: general protection fault cpuid = 1 time = 1570197605 KDB: stack backtrace: #0 0xffffffff80c1bd47 at kdb_backtrace+0x67 #1 0xffffffff80bcf07d at vpanic+0x19d #2 0xffffffff80bceed3 at panic+0x43 #3 0xffffffff810a6d2c at trap_fatal+0x39c #4 0xffffffff810a613c at trap+0x6c #5 0xffffffff8107fdcc at calltrap+0x8 #6 0xffffffff80f0fadc at vmspace_exit+0x9c #7 0xffffffff80b88fb9 at exit1+0x5d9 #8 0xffffffff80b889dd at sys_sys_exit+0xd #9 0xffffffff810a78e4 at amd64_syscall+0x364 #10 0xffffffff810806f0 at fast_syscall_common+0x101 Uptime: 8d15h44m0s Dumping 2383 out of 32589 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at /usr/src/sys/amd64/include/pcpu.h:234 234 /usr/src/sys/amd64/include/pcpu.h: No such file or directory. (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu.h:234 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:371 #2 0xffffffff80bcec78 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:451 #3 0xffffffff80bcf0d9 in vpanic (fmt=<optimized out>, ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:877 #4 0xffffffff80bceed3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:804 #5 0xffffffff810a6d2c in trap_fatal (frame=0xfffffe009c39e720, eva=0) at /usr/src/sys/amd64/amd64/trap.c:943 #6 0xffffffff810a613c in trap (frame=0xfffffe009c39e720) at /usr/src/sys/amd64/amd64/trap.c:221 #7 <signal handler called> #8 0xffffffff81098b76 in pmap_remove_pages (pmap=<optimized out>) at /usr/src/sys/amd64/amd64/pmap.c:6934 #9 0xffffffff80f0fadc in vmspace_exit (td=0xfffff8012651c5e0) at /usr/src/sys/vm/vm_map.c:409 #10 0xffffffff80b88fb9 in exit1 (td=0xfffff8012651c5e0, rval=<optimized out>, signo=0) at /usr/src/sys/kern/kern_exit.c:396 #11 0xffffffff80b889dd in sys_sys_exit ( td=0xffffffff81a49240 <pv_list_locks+5120>, uap=<optimized out>) at /usr/src/sys/kern/kern_exit.c:175 #12 0xffffffff810a78e4 in syscallenter (td=0xfffff8012651c5e0) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:135 #13 amd64_syscall (td=0xfffff8012651c5e0, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1186 #14 <signal handler called> #15 0x000000080051255a in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd788 (kgdb)
Does it appear to be a regression? That is, did you start seeing the panic after an update? I would be interested in looking at the core.
I got panics with 12.0 too, and if I remember correctly even at 11.x. It doesn't happen always in the same servers. But I didn't have dumpdev enabled before so I am not sure if it was always the same issue.
I will send you privately user/pass to login to the server. Thank you.
This particular crash was caused by a bit-flip. %rcx contains a direct map address with one of the upper bits set to 0: ... 0xffffffff81098b43 <+1267>: jmp 0xffffffff81098bc0 <pmap_remove_pages+1392> 0xffffffff81098b45 <+1269>: add $0xffffffffffffffff,%rax 0xffffffff81098b49 <+1273>: mov %rax,0x78(%r8) 0xffffffff81098b4d <+1277>: mov 0x48(%rdx,%rbx,8),%rax 0xffffffff81098b52 <+1282>: mov %rsi,%rcx 0xffffffff81098b55 <+1285>: add $0x40,%rcx 0xffffffff81098b59 <+1289>: test %rax,%rax 0xffffffff81098b5c <+1292>: lea 0x10(%rax),%rax 0xffffffff81098b60 <+1296>: cmove %rcx,%rax 0xffffffff81098b64 <+1300>: mov 0x50(%rdx,%rbx,8),%rcx 0xffffffff81098b69 <+1305>: mov %rcx,(%rax) 0xffffffff81098b6c <+1308>: mov 0x48(%rdx,%rbx,8),%rax 0xffffffff81098b71 <+1313>: mov 0x50(%rdx,%rbx,8),%rcx => 0xffffffff81098b76 <+1318>: mov %rax,(%rcx) ... (kgdb) info reg rax 0x0 0 rbx 0x1e0 480 rcx 0xf7fff802c4df2d20 -576469536503550688 ^ rdx 0xfffff806af211000 -8767385038848 rsi 0xfffff807f0b673f0 -8761989762064 rdi 0xffffffff81a49240 -2119921088 rbp 0xfffffe009c39e8c0 0xfffffe009c39e8c0 rsp 0xfffffe009c39e7e0 0xfffffe009c39e7e0
I'm going to close this for now. Reportedly there are panics across multiple servers, suggesting a software bug, but bit-flips are typically caused by hardware faults, and this particular system is not using ECC RAM. Please re-open this bug once you have more crash dumps available for investigation.
You are right. The RAM looks bad. I test all my servers with these userland programs: redis-server --test-memory 4096 memtester 4096 1 Redis test: I run the test few times, some times it show error and some times it show it is ok. But I am not sure how reliable is the redis-server test and if it locks the RAM before it does the tests. Memtester: In the server that panic yesterday the "Bit Flip" test shows issues. Also in another server tests show "Solid Bits" and "Bit Flip" issues. As these 2 servers are old I will replace them this week with new hardware.