Dear list, After some time, I have again experienced a kernel panic on a (physical) server, running Freebsd 11.2-RELEASE-p7 with custom/debug kernel, ZFS root. Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff82299013 stack pointer = 0x28:0xfffffe0352893ad0 frame pointer = 0x28:0xfffffe0352893b10 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 9 (dbuf_evict_thread) trap number = 9 panic: general protection fault cpuid = 2 KDB: stack backtrace: #0 0xffffffff80b3d567 at kdb_backtrace+0x67 #1 0xffffffff80af6b07 at vpanic+0x177 #2 0xffffffff80af6983 at panic+0x43 #3 0xffffffff80f77fdf at trap_fatal+0x35f #4 0xffffffff80f7759e at trap+0x5e #5 0xffffffff80f5807c at calltrap+0x8 #6 0xffffffff8229c049 at dbuf_evict_one+0xe9 #7 0xffffffff82297a15 at dbuf_evict_thread+0x1a5 #8 0xffffffff80aba083 at fork_exit+0x83 #9 0xffffffff80f58f9e at fork_trampoline+0xe Uptime: 20d6h13m55s Dumping 2593 out of 12248 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% I have used "crashinfo" utility to generate the text file which is available at this URL: http://www.ocpea.com/dump/core-3.txt. All advice is deeply appreciated as this is a production server. :) Kind regards, Jurij
Hello, After approximately 1 month, the server (FreeBSD 11.2-RELEASE-p9 #0 r344062) has crashed again. It seems to me, the reason for the crash is most likely the same as the last time: Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff82299013 stack pointer = 0x28:0xfffffe0352893ad0 frame pointer = 0x28:0xfffffe0352893b10 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 9 (dbuf_evict_thread) trap number = 9 panic: general protection fault cpuid = 2 KDB: stack backtrace: #0 0xffffffff80b3d5a7 at kdb_backtrace+0x67 #1 0xffffffff80af6b47 at vpanic+0x177 #2 0xffffffff80af69c3 at panic+0x43 #3 0xffffffff80f77fdf at trap_fatal+0x35f #4 0xffffffff80f7759e at trap+0x5e #5 0xffffffff80f580bc at calltrap+0x8 #6 0xffffffff8229c049 at dbuf_evict_one+0xe9 #7 0xffffffff82297a15 at dbuf_evict_thread+0x1a5 #8 0xffffffff80aba0c3 at fork_exit+0x83 #9 0xffffffff80f58fee at fork_trampoline+0xe Uptime: 31d1h57m16s Uptime: 31d1h57m16s Dumping 2771 out of 12248 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/ums.ko...Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...done. done. Loaded symbols for /boot/kernel/ums.ko Reading symbols from /boot/kernel/pflog.ko...Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug...done. done. Loaded symbols for /boot/kernel/pflog.ko Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done. done. Loaded symbols for /boot/kernel/pf.ko Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done. done. Loaded symbols for /boot/kernel/nullfs.ko Reading symbols from /boot/kernel/blank_saver.ko...Reading symbols from /usr/lib/debug//boot/kernel/blank_saver.ko.debug...done. done. Loaded symbols for /boot/kernel/blank_saver.ko Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done. done. Loaded symbols for /boot/kernel/fdescfs.ko Reading symbols from /boot/kernel/smbfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbfs.ko.debug...done. done. Loaded symbols for /boot/kernel/smbfs.ko Reading symbols from /boot/kernel/libiconv.ko...Reading symbols from /usr/lib/debug//boot/kernel/libiconv.ko.debug...done. done. Loaded symbols for /boot/kernel/libiconv.ko Reading symbols from /boot/kernel/libmchain.ko...Reading symbols from /usr/lib/debug//boot/kernel/libmchain.ko.debug...done. done. Loaded symbols for /boot/kernel/libmchain.ko Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done. done. Loaded symbols for /boot/kernel/geom_mirror.ko #0 doadump (textdump=<value optimized out>) at pcpu.h:229 229 pcpu.h: No such file or directory. in pcpu.h (kgdb) #0 doadump (textdump=<value optimized out>) at pcpu.h:229 #1 0xffffffff80af675b in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:383 #2 0xffffffff80af6b81 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:776 #3 0xffffffff80af69c3 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:707 #4 0xffffffff80f77fdf in trap_fatal (frame=0xfffffe0352893a10, eva=0) at /usr/src/sys/amd64/amd64/trap.c:875 #5 0xffffffff80f7759e in trap (frame=0xfffffe0352893a10) at pcpu.h:229 #6 0xffffffff80f580bc in calltrap () at /usr/src/sys/amd64/amd64/exception.S:231 #7 0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79 #8 0xffffffff8229c049 in dbuf_evict_one () at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:487 #9 0xffffffff82297a15 in dbuf_evict_thread (unused=<value optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:525 #10 0xffffffff80aba0c3 in fork_exit ( callout=0xffffffff82297870 <dbuf_evict_thread>, arg=0x0, frame=0xfffffe0352893c00) at /usr/src/sys/kern/kern_fork.c:1054 #11 0xffffffff80f58fee in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:959 #12 0x0000000000000000 in ?? () Current language: auto; currently minimal (kgdb) The "crashinfo" txt file is available at thie URL: http://www.ocpea.com/dump/core-5.txt original core dump files are also available if needed. Can someone, please, take a look at this? I will be more than happy to help with debugging, supplying additional info etc if needed. Kind regards, Jurij
It would we useful to see *db in frame 7. It also would be useful to find out what is a real line number there: #7 0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79 You can disassemble dbuf_destroy() and see what's at address 0xffffffff82299013 and what instructions / calls lead to it.
Hello Andriy, I am pretty new at this - could you please elaborate a bit on how to "disassemble dbuf_destroy() and see what's at address 0xffffffff82299013 and what instructions / calls lead to it". I presume I can get the required information from the crash dump? Regards, Jurij
(In reply to Jurij Kovacic from comment #3) Yes, `disassemble dbuf_destroy` in kgdb.
Hello Andriy, Thank you very much for the explanation. After running: kgdb /boot/kernel/kernel /var/crash/vmcore.last the instruction at "0xffffffff82299013" is: 0xffffffff82299013 <dbuf_destroy+563>: mov (%rax),%rcx Please find the complete disassembly of the dbuf_destroy function below. Kind regards, Jurij Dump of assembler code for function dbuf_destroy: 0xffffffff82298de0 <dbuf_destroy+0>: push %rbp 0xffffffff82298de1 <dbuf_destroy+1>: mov %rsp,%rbp 0xffffffff82298de4 <dbuf_destroy+4>: push %r15 0xffffffff82298de6 <dbuf_destroy+6>: push %r14 0xffffffff82298de8 <dbuf_destroy+8>: push %r13 0xffffffff82298dea <dbuf_destroy+10>: push %r12 0xffffffff82298dec <dbuf_destroy+12>: push %rbx 0xffffffff82298ded <dbuf_destroy+13>: sub $0x18,%rsp 0xffffffff82298df1 <dbuf_destroy+17>: mov %rdi,%r13 0xffffffff82298df4 <dbuf_destroy+20>: mov 0x30(%r13),%r14 0xffffffff82298df8 <dbuf_destroy+24>: mov 0x88(%r13),%rdi 0xffffffff82298dff <dbuf_destroy+31>: test %rdi,%rdi 0xffffffff82298e02 <dbuf_destroy+34>: je 0xffffffff82298e17 <dbuf_destroy+55> 0xffffffff82298e04 <dbuf_destroy+36>: mov %r13,%rsi 0xffffffff82298e07 <dbuf_destroy+39>: callq 0xffffffff8228c220 <arc_buf_destroy> 0xffffffff82298e0c <dbuf_destroy+44>: movq $0x0,0x88(%r13) 0xffffffff82298e17 <dbuf_destroy+55>: cmpq $0xffffffffffffffff,0x40(%r13) 0xffffffff82298e1c <dbuf_destroy+60>: jne 0xffffffff82298e43 <dbuf_destroy+99> 0xffffffff82298e1e <dbuf_destroy+62>: mov 0x18(%r13),%rdi 0xffffffff82298e22 <dbuf_destroy+66>: mov $0x140,%esi 0xffffffff82298e27 <dbuf_destroy+71>: callq 0xffffffff82328d00 <zio_buf_free> 0xffffffff82298e2c <dbuf_destroy+76>: mov $0x140,%edi 0xffffffff82298e31 <dbuf_destroy+81>: mov $0x4,%esi 0xffffffff82298e36 <dbuf_destroy+86>: callq 0xffffffff8228b6c0 <arc_space_return> 0xffffffff82298e3b <dbuf_destroy+91>: movl $0x0,0x78(%r13) 0xffffffff82298e43 <dbuf_destroy+99>: mov 0xd8(%r13),%r15 0xffffffff82298e4a <dbuf_destroy+106>: test %r15,%r15 0xffffffff82298e4d <dbuf_destroy+109>: je 0xffffffff82298e8a <dbuf_destroy+170> 0xffffffff82298e4f <dbuf_destroy+111>: movq $0x0,0xd8(%r13) 0xffffffff82298e5a <dbuf_destroy+122>: mov 0x30(%r15),%rax 0xffffffff82298e5e <dbuf_destroy+126>: mov 0x38(%r15),%rbx 0xffffffff82298e62 <dbuf_destroy+130>: test %rax,%rax 0xffffffff82298e65 <dbuf_destroy+133>: je 0xffffffff82298e6c <dbuf_destroy+140> 0xffffffff82298e67 <dbuf_destroy+135>: mov %r15,%rdi 0xffffffff82298e6a <dbuf_destroy+138>: callq *%rax 0xffffffff82298e6c <dbuf_destroy+140>: test %rbx,%rbx 0xffffffff82298e6f <dbuf_destroy+143>: je 0xffffffff82298e8a <dbuf_destroy+170> 0xffffffff82298e71 <dbuf_destroy+145>: mov 0xffffffff8240c470,%rdi 0xffffffff82298e79 <dbuf_destroy+153>: mov 0x38(%r15),%rsi 0xffffffff82298e7d <dbuf_destroy+157>: xor %ecx,%ecx 0xffffffff82298e7f <dbuf_destroy+159>: mov %r15,%rdx 0xffffffff82298e82 <dbuf_destroy+162>: mov %r15,%r8 0xffffffff82298e85 <dbuf_destroy+165>: callq 0xffffffff82272960 <taskq_dispatch_ent> 0xffffffff82298e8a <dbuf_destroy+170>: movq $0x0,0x18(%r13) 0xffffffff82298e92 <dbuf_destroy+178>: cmpl $0x2,0x78(%r13) 0xffffffff82298e97 <dbuf_destroy+183>: je 0xffffffff82298ea1 <dbuf_destroy+193> 0xffffffff82298e99 <dbuf_destroy+185>: movl $0x0,0x78(%r13) 0xffffffff82298ea1 <dbuf_destroy+193>: lea 0xc8(%r13),%rdi 0xffffffff82298ea8 <dbuf_destroy+200>: callq 0xffffffff822dbf30 <multilist_link_active> 0xffffffff82298ead <dbuf_destroy+205>: test %eax,%eax 0xffffffff82298eaf <dbuf_destroy+207>: je 0xffffffff82298ed4 <dbuf_destroy+244> 0xffffffff82298eb1 <dbuf_destroy+209>: mov 0xffffffff8240c478,%rdi 0xffffffff82298eb9 <dbuf_destroy+217>: mov %r13,%rsi 0xffffffff82298ebc <dbuf_destroy+220>: callq 0xffffffff822dbbe0 <multilist_remove> 0xffffffff82298ec1 <dbuf_destroy+225>: mov 0x10(%r13),%rsi 0xffffffff82298ec5 <dbuf_destroy+229>: neg %rsi 0xffffffff82298ec8 <dbuf_destroy+232>: mov $0xffffffff8240c480,%rdi 0xffffffff82298ecf <dbuf_destroy+239>: callq 0xffffffff82273960 <atomic_add_64_nv> 0xffffffff82298ed4 <dbuf_destroy+244>: movl $0x5,0x78(%r13) 0xffffffff82298edc <dbuf_destroy+252>: movq $0x0,0x48(%r13) 0xffffffff82298ee4 <dbuf_destroy+260>: lea 0x58(%r13),%rdi 0xffffffff82298ee8 <dbuf_destroy+264>: mov $0xffffffff823d4fd1,%rsi 0xffffffff82298eef <dbuf_destroy+271>: mov $0x812,%edx 0xffffffff82298ef4 <dbuf_destroy+276>: callq 0xffffffff80aff910 <_sx_xunlock> 0xffffffff82298ef9 <dbuf_destroy+281>: mov 0x28(%r13),%rdi 0xffffffff82298efd <dbuf_destroy+285>: mov $0xffffffff823d5126,%rsi 0xffffffff82298f04 <dbuf_destroy+292>: callq 0xffffffff82331f70 <zrl_add_impl> 0xffffffff82298f09 <dbuf_destroy+297>: mov 0x28(%r13),%rdi 0xffffffff82298f0d <dbuf_destroy+301>: mov 0x40(%rdi),%r15 0xffffffff82298f11 <dbuf_destroy+305>: mov 0x40(%r15),%rbx 0xffffffff82298f15 <dbuf_destroy+309>: cmpq $0xffffffffffffffff,0x40(%r13) 0xffffffff82298f1a <dbuf_destroy+314>: je 0xffffffff82299059 <dbuf_destroy+633> 0xffffffff82298f20 <dbuf_destroy+320>: mov %rbx,-0x30(%rbp) 0xffffffff82298f24 <dbuf_destroy+324>: mov %r14,-0x38(%rbp) 0xffffffff82298f28 <dbuf_destroy+328>: lea 0x1f8(%r15),%r12 0xffffffff82298f2f <dbuf_destroy+335>: mov 0x210(%r15),%rbx 0xffffffff82298f36 <dbuf_destroy+342>: and $0xfffffffffffffff1,%rbx 0xffffffff82298f3a <dbuf_destroy+346>: mov %gs:0x0,%r14 0xffffffff82298f43 <dbuf_destroy+355>: cmp %r14,%rbx 0xffffffff82298f46 <dbuf_destroy+358>: je 0xffffffff82298f5e <dbuf_destroy+382> 0xffffffff82298f48 <dbuf_destroy+360>: xor %esi,%esi 0xffffffff82298f4a <dbuf_destroy+362>: mov $0xffffffff823d4fd1,%rdx 0xffffffff82298f51 <dbuf_destroy+369>: mov $0x81a,%ecx 0xffffffff82298f56 <dbuf_destroy+374>: mov %r12,%rdi 0xffffffff82298f59 <dbuf_destroy+377>: callq 0xffffffff80aff0d0 <_sx_xlock> 0xffffffff82298f5e <dbuf_destroy+382>: lea 0x218(%r15),%rdi 0xffffffff82298f65 <dbuf_destroy+389>: mov %r13,%rsi 0xffffffff82298f68 <dbuf_destroy+392>: callq 0xffffffff82266e70 <avl_remove> 0xffffffff82298f6d <dbuf_destroy+397>: lea 0xa8(%r15),%rdi 0xffffffff82298f74 <dbuf_destroy+404>: mov $0x1,%esi 0xffffffff82298f79 <dbuf_destroy+409>: callq 0xffffffff80f56de0 <atomic_subtract_int> 0xffffffff82298f7e <dbuf_destroy+414>: callq 0xffffffff822739b0 <membar_producer> 0xffffffff82298f83 <dbuf_destroy+419>: mov 0x28(%r13),%rdi 0xffffffff82298f87 <dbuf_destroy+423>: callq 0xffffffff82332000 <zrl_remove> 0xffffffff82298f8c <dbuf_destroy+428>: cmp %r14,%rbx 0xffffffff82298f8f <dbuf_destroy+431>: je 0xffffffff82298fa5 <dbuf_destroy+453> 0xffffffff82298f91 <dbuf_destroy+433>: mov $0xffffffff823d4fd1,%rsi 0xffffffff82298f98 <dbuf_destroy+440>: mov $0x820,%edx 0xffffffff82298f9d <dbuf_destroy+445>: mov %r12,%rdi 0xffffffff82298fa0 <dbuf_destroy+448>: callq 0xffffffff80aff910 <_sx_xunlock> 0xffffffff82298fa5 <dbuf_destroy+453>: mov %r15,%rdi 0xffffffff82298fa8 <dbuf_destroy+456>: mov %r13,%rsi 0xffffffff82298fab <dbuf_destroy+459>: callq 0xffffffff822b4dd0 <dnode_rele> 0xffffffff82298fb0 <dbuf_destroy+464>: movq $0x0,0x28(%r13) 0xffffffff82298fb8 <dbuf_destroy+472>: mov 0x0(%r13),%rsi 0xffffffff82298fbc <dbuf_destroy+476>: mov 0x20(%r13),%rdi 0xffffffff82298fc0 <dbuf_destroy+480>: mov 0x40(%r13),%rcx 0xffffffff82298fc4 <dbuf_destroy+484>: movzbl 0x50(%r13),%edx 0xffffffff82298fc9 <dbuf_destroy+489>: callq 0xffffffff82297340 <cityhash4> 0xffffffff82298fce <dbuf_destroy+494>: mov %rax,%rbx 0xffffffff82298fd1 <dbuf_destroy+497>: and 0xffffffff8240a458,%rbx 0xffffffff82298fd9 <dbuf_destroy+505>: movzbl %bl,%eax 0xffffffff82298fdc <dbuf_destroy+508>: shl $0x5,%rax 0xffffffff82298fe0 <dbuf_destroy+512>: lea -0x7dbf5b98(%rax),%r15 0xffffffff82298fe7 <dbuf_destroy+519>: xor %esi,%esi 0xffffffff82298fe9 <dbuf_destroy+521>: mov $0xffffffff823d4fd1,%rdx 0xffffffff82298ff0 <dbuf_destroy+528>: mov $0x129,%ecx 0xffffffff82298ff5 <dbuf_destroy+533>: mov %r15,%rdi 0xffffffff82298ff8 <dbuf_destroy+536>: callq 0xffffffff80aff0d0 <_sx_xlock> 0xffffffff82298ffd <dbuf_destroy+541>: shl $0x3,%rbx 0xffffffff82299001 <dbuf_destroy+545>: add 0xffffffff8240a460,%rbx 0xffffffff82299009 <dbuf_destroy+553>: mov -0x38(%rbp),%r14 0xffffffff8229900d <dbuf_destroy+557>: nopl (%rax) 0xffffffff82299010 <dbuf_destroy+560>: mov %rbx,%rax 0xffffffff82299013 <dbuf_destroy+563>: mov (%rax),%rcx 0xffffffff82299016 <dbuf_destroy+566>: lea 0x38(%rcx),%rbx 0xffffffff8229901a <dbuf_destroy+570>: cmp %r13,%rcx 0xffffffff8229901d <dbuf_destroy+573>: jne 0xffffffff82299010 <dbuf_destroy+560> 0xffffffff8229901f <dbuf_destroy+575>: mov 0x38(%r13),%rcx 0xffffffff82299023 <dbuf_destroy+579>: mov %rcx,(%rax) 0xffffffff82299026 <dbuf_destroy+582>: movq $0x0,0x38(%r13) 0xffffffff8229902e <dbuf_destroy+590>: mov $0xffffffff823d4fd1,%rsi 0xffffffff82299035 <dbuf_destroy+597>: mov $0x131,%edx 0xffffffff8229903a <dbuf_destroy+602>: mov %r15,%rdi 0xffffffff8229903d <dbuf_destroy+605>: callq 0xffffffff80aff910 <_sx_xunlock> 0xffffffff82299042 <dbuf_destroy+610>: mov $0xffffffff8240c4c8,%rdi 0xffffffff82299049 <dbuf_destroy+617>: mov $0x1,%esi 0xffffffff8229904e <dbuf_destroy+622>: callq 0xffffffff80f56e60 <atomic_subtract_long> 0xffffffff82299053 <dbuf_destroy+627>: mov -0x30(%rbp),%rbx 0xffffffff82299057 <dbuf_destroy+631>: jmp 0xffffffff8229905e <dbuf_destroy+638> 0xffffffff82299059 <dbuf_destroy+633>: callq 0xffffffff82332000 <zrl_remove> 0xffffffff8229905e <dbuf_destroy+638>: movq $0x0,0x30(%r13) 0xffffffff82299066 <dbuf_destroy+646>: mov 0xffffffff8240c468,%rdi 0xffffffff8229906e <dbuf_destroy+654>: mov %r13,%rsi 0xffffffff82299071 <dbuf_destroy+657>: callq 0xffffffff825e83c0 <kmem_cache_free> 0xffffffff82299076 <dbuf_destroy+662>: mov $0xe8,%edi 0xffffffff8229907b <dbuf_destroy+667>: mov $0x4,%esi 0xffffffff82299080 <dbuf_destroy+672>: callq 0xffffffff8228b6c0 <arc_space_return> 0xffffffff82299085 <dbuf_destroy+677>: test %r14,%r14 0xffffffff82299088 <dbuf_destroy+680>: je 0xffffffff822990bc <dbuf_destroy+732> 0xffffffff8229908a <dbuf_destroy+682>: cmp %rbx,%r14 0xffffffff8229908d <dbuf_destroy+685>: je 0xffffffff822990bc <dbuf_destroy+732> 0xffffffff8229908f <dbuf_destroy+687>: lea 0x58(%r14),%rdi 0xffffffff82299093 <dbuf_destroy+691>: xor %esi,%esi 0xffffffff82299095 <dbuf_destroy+693>: mov $0xffffffff823d4fd1,%rdx 0xffffffff8229909c <dbuf_destroy+700>: mov $0xaa6,%ecx 0xffffffff822990a1 <dbuf_destroy+705>: callq 0xffffffff80aff0d0 <_sx_xlock> 0xffffffff822990a6 <dbuf_destroy+710>: mov %r14,%rdi 0xffffffff822990a9 <dbuf_destroy+713>: add $0x18,%rsp 0xffffffff822990ad <dbuf_destroy+717>: pop %rbx 0xffffffff822990ae <dbuf_destroy+718>: pop %r12 0xffffffff822990b0 <dbuf_destroy+720>: pop %r13 0xffffffff822990b2 <dbuf_destroy+722>: pop %r14 0xffffffff822990b4 <dbuf_destroy+724>: pop %r15 0xffffffff822990b6 <dbuf_destroy+726>: pop %rbp 0xffffffff822990b7 <dbuf_destroy+727>: jmpq 0xffffffff8229b290 <dbuf_rele_and_unlock> 0xffffffff822990bc <dbuf_destroy+732>: add $0x18,%rsp 0xffffffff822990c0 <dbuf_destroy+736>: pop %rbx 0xffffffff822990c1 <dbuf_destroy+737>: pop %r12 0xffffffff822990c3 <dbuf_destroy+739>: pop %r13 0xffffffff822990c5 <dbuf_destroy+741>: pop %r14 0xffffffff822990c7 <dbuf_destroy+743>: pop %r15 0xffffffff822990c9 <dbuf_destroy+745>: pop %rbp 0xffffffff822990ca <dbuf_destroy+746>: retq End of assembler dump. Current language: auto; currently minimal
So, it looks that the crash was in dbuf_hash_remove() and its cause appears to be that the db in question was not in the hash table. So, could you print *db in frame 7 ? Commands: frame 7 print *db :-)
Hi Andriy, Thank you again for the detailed instructions. The following is what I got after entering "frame 7" and "print *db" as instructed: (kgdb) frame 7 #7 0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79 79 atomic_subtract_32(target, 1); Current language: auto; currently minimal (kgdb) print *db $1 = {db = {db_object = 52289, db_offset = 15868362752, db_size = 131072, db_data = 0x0}, db_objset = 0xfffff8000f30a400, db_dnode_handle = 0x0, db_parent = 0xfffff800168d9000, db_hash_next = 0x0, db_blkid = 121066, db_blkptr = 0x0, db_level = 0 '\0', db_mtx = {lock_object = {lo_name = 0xffffffff823d529f "db->db_mtx", lo_flags = 577830912, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}, db_state = DB_EVICTING, db_holds = {rc_count = 0}, db_buf = 0x0, db_changed = {cv_description = 0xffffffff823d52ab "db->db_changed", cv_waiters = 0}, db_data_pending = 0x0, db_last_dirty = 0x0, db_link = {avl_child = 0xfffff8028de9cb90, avl_pcb = 18446735289386928809}, db_cache_link = {list_next = 0x0, list_prev = 0x0}, db_user = 0x0, db_user_immediate_evict = 0 '\0', db_freed_in_flight = 0 '\0', db_pending_evict = 0 '\0', db_dirtycnt = 0 '\0'} (kgdb) Kind regards, Jurij
Could you please also do 'info reg' in that frame?
Yes, sure - I did "frame 7" and then "info reg" - this is what I got: (kgdb) frame 7 #7 0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79 79 atomic_subtract_32(target, 1); (kgdb) info reg rax 0xffff78028de9cb18 -149522610533608 rbx 0xffff78028de9cb18 -149522610533608 rcx 0xffff78028de9cae0 -149522610533664 rdx 0xfffff8000ac14001 -8795912585215 rsi 0xffffffff8240b008 -2109689848 rdi 0xffffffff8240b008 -2109689848 rbp 0xfffffe0352893b10 0xfffffe0352893b10 rsp 0xfffffe0352893ad0 0xfffffe0352893ad0 r8 0x9ae16a3b2f90408f -7286425919675154289 r9 0xf66ab0fc2de1ac00 -690544995699938304 r10 0x0 0 r11 0xfffff800d6677c78 -8792495915912 r12 0xfffff8005b1914d0 -8794564651824 r13 0xfffff8028de9cae0 -8785122178336 r14 0xfffff800168d9000 -8795714646016 r15 0xffffffff8240b008 -2109689848 rip 0xffffffff82299013 0xffffffff82299013 <dbuf_destroy+563> eflags 0x10287 66183 cs 0x20 32 ss 0x28 40 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (kgdb)
(In reply to Jurij Kovacic from comment #9) Okay, compare the value of r13 (and this is the original db pointer) to the value in rcx. They differ by a single bit: (gdb) p/x 0xfffff8028de9cae0 ^ 0xffff78028de9cae0 $1 = 0x800000000000 The values in rbx and rax are derived from the value in rcx. And the crash happens when accessing the address stored in rax. This was a long way to say that most likely you got a single bit flip on the hardware level. I assume that the RAM is not ECC.
(In reply to Andriy Gapon from comment #10) Hi Andriy, From what I can tell the RAM modules indeed are not ECC type, ... Memory Device Array Handle: 0x0025 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 4096 MB ... since Total Width and Data Width are the same - you can find the complete dmidecode output below. I take it your recommendation is to replace the RAM modules? Kind regards, Jurij # dmidecode 3.2 Scanning /dev/mem for entry point. SMBIOS 2.5 present. Handle 0x0025, DMI type 16, 15 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: None Maximum Capacity: 128 GB Error Information Handle: Not Provided Number Of Devices: 4 Handle 0x0027, DMI type 17, 27 bytes Memory Device Array Handle: 0x0025 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 2048 MB Form Factor: DIMM Set: None Locator: CPU1_DIMM0 Bank Locator: BANK0 Type: Other Type Detail: Synchronous Speed: 667 MT/s Manufacturer: Apacer Serial Number: 32110102 Asset Tag: AssetTagNum0 Part Number: 78.A1GDE.9K00C Handle 0x0029, DMI type 17, 27 bytes Memory Device Array Handle: 0x0025 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 2048 MB Form Factor: DIMM Set: None Locator: CPU1_DIMM1 Bank Locator: BANK1 Type: Other Type Detail: Synchronous Speed: 667 MT/s Manufacturer: Apacer Serial Number: 32110102 Asset Tag: AssetTagNum1 Part Number: 78.A1GDE.9K00C Handle 0x002B, DMI type 17, 27 bytes Memory Deavice Array Handle: 0x0025 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 4096 MB Form Factor: DIMM Set: None Locator: CPU1_DIMM2 Bank Locator: BANK2 Type: Other Type Detail: Synchronous Speed: 667 MT/s Manufacturer: Kingston Serial Number: C40B2C66 Asset Tag: AssetTagNum2 Part Number: 99U5403-034.A00LF Handle 0x002D, DMI type 17, 27 bytes Memory Device Array Handle: 0x0025 Error Information Handle: Not Provided Total Width: 64 bits Data Width: 64 bits Size: 4096 MB Form Factor: DIMM Set: None Locator: CPU1_DIMM3 Bank Locator: BANK3 Type: Other Type Detail: Synchronous Speed: 667 MT/s Manufacturer: Kingston Serial Number: 91352C8F Asset Tag: AssetTagNum3 Part Number: 99U5403-034.A00LF
It's not necessarily a memory module, it could be a bad contact in a memory slot or in a CPU socket. Or some source of alpha particles in the vicinity :-) But you get the idea.
Thank you very much for your help, Andriy. To be quite honest, I was beginning to loose hope, since I got no reply on the mailing list, whereas you not only helped, but taught me a few things along the way. Now we know we should focus on the hardware side of things. Once again, thank you. Kind regards, Jurij
You are welcome. And thank you for the report and great help with it.