Summary: | random kernel panics [presumed faulty hardware] | ||
---|---|---|---|
Product: | Base System | Reporter: | Christos Chatzaras <chris> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Closed Not Enough Information | ||
Severity: | Affects Only Me | CC: | chris, emaste, markj |
Priority: | --- | Keywords: | crash |
Version: | 12.1-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Christos Chatzaras
2019-10-04 14:58:43 UTC
Does it appear to be a regression? That is, did you start seeing the panic after an update? I would be interested in looking at the core. I got panics with 12.0 too, and if I remember correctly even at 11.x. It doesn't happen always in the same servers. But I didn't have dumpdev enabled before so I am not sure if it was always the same issue. I will send you privately user/pass to login to the server. Thank you. This particular crash was caused by a bit-flip. %rcx contains a direct map address with one of the upper bits set to 0: ... 0xffffffff81098b43 <+1267>: jmp 0xffffffff81098bc0 <pmap_remove_pages+1392> 0xffffffff81098b45 <+1269>: add $0xffffffffffffffff,%rax 0xffffffff81098b49 <+1273>: mov %rax,0x78(%r8) 0xffffffff81098b4d <+1277>: mov 0x48(%rdx,%rbx,8),%rax 0xffffffff81098b52 <+1282>: mov %rsi,%rcx 0xffffffff81098b55 <+1285>: add $0x40,%rcx 0xffffffff81098b59 <+1289>: test %rax,%rax 0xffffffff81098b5c <+1292>: lea 0x10(%rax),%rax 0xffffffff81098b60 <+1296>: cmove %rcx,%rax 0xffffffff81098b64 <+1300>: mov 0x50(%rdx,%rbx,8),%rcx 0xffffffff81098b69 <+1305>: mov %rcx,(%rax) 0xffffffff81098b6c <+1308>: mov 0x48(%rdx,%rbx,8),%rax 0xffffffff81098b71 <+1313>: mov 0x50(%rdx,%rbx,8),%rcx => 0xffffffff81098b76 <+1318>: mov %rax,(%rcx) ... (kgdb) info reg rax 0x0 0 rbx 0x1e0 480 rcx 0xf7fff802c4df2d20 -576469536503550688 ^ rdx 0xfffff806af211000 -8767385038848 rsi 0xfffff807f0b673f0 -8761989762064 rdi 0xffffffff81a49240 -2119921088 rbp 0xfffffe009c39e8c0 0xfffffe009c39e8c0 rsp 0xfffffe009c39e7e0 0xfffffe009c39e7e0 I'm going to close this for now. Reportedly there are panics across multiple servers, suggesting a software bug, but bit-flips are typically caused by hardware faults, and this particular system is not using ECC RAM. Please re-open this bug once you have more crash dumps available for investigation. You are right. The RAM looks bad. I test all my servers with these userland programs: redis-server --test-memory 4096 memtester 4096 1 Redis test: I run the test few times, some times it show error and some times it show it is ok. But I am not sure how reliable is the redis-server test and if it locks the RAM before it does the tests. Memtester: In the server that panic yesterday the "Bit Flip" test shows issues. Also in another server tests show "Solid Bits" and "Bit Flip" issues. As these 2 servers are old I will replace them this week with new hardware. |