Bug 235783 - Repeated ZFS-related kernel panic
Summary: Repeated ZFS-related kernel panic
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-fs mailing list
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2019-02-16 17:40 UTC by Jurij Kovacic
Modified: 2019-03-22 09:14 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jurij Kovacic 2019-02-16 17:40:36 UTC
Dear list,

After some time, I have again experienced a kernel panic on a (physical) server, running Freebsd 11.2-RELEASE-p7 with custom/debug kernel, ZFS root.

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer    = 0x20:0xffffffff82299013
stack pointer            = 0x28:0xfffffe0352893ad0
frame pointer            = 0x28:0xfffffe0352893b10
code segment        = base rx0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 9 (dbuf_evict_thread)
trap number        = 9
panic: general protection fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b3d567 at kdb_backtrace+0x67
#1 0xffffffff80af6b07 at vpanic+0x177
#2 0xffffffff80af6983 at panic+0x43
#3 0xffffffff80f77fdf at trap_fatal+0x35f
#4 0xffffffff80f7759e at trap+0x5e
#5 0xffffffff80f5807c at calltrap+0x8
#6 0xffffffff8229c049 at dbuf_evict_one+0xe9
#7 0xffffffff82297a15 at dbuf_evict_thread+0x1a5
#8 0xffffffff80aba083 at fork_exit+0x83
#9 0xffffffff80f58f9e at fork_trampoline+0xe
Uptime: 20d6h13m55s
Dumping 2593 out of 12248 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

I have used "crashinfo" utility to generate the text file which is available at this URL: http://www.ocpea.com/dump/core-3.txt.

All advice is deeply appreciated as this is a production server. :)

Kind regards,
Jurij
Comment 1 Jurij Kovacic 2019-03-16 12:21:14 UTC
Hello,

After approximately 1 month, the server (FreeBSD 11.2-RELEASE-p9 #0 r344062) has crashed again. 

It seems to me, the reason for the crash is most likely the same as the last time:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer     = 0x20:0xffffffff82299013
stack pointer           = 0x28:0xfffffe0352893ad0
frame pointer           = 0x28:0xfffffe0352893b10
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 9 (dbuf_evict_thread)
trap number             = 9
panic: general protection fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b3d5a7 at kdb_backtrace+0x67
#1 0xffffffff80af6b47 at vpanic+0x177
#2 0xffffffff80af69c3 at panic+0x43
#3 0xffffffff80f77fdf at trap_fatal+0x35f
#4 0xffffffff80f7759e at trap+0x5e
#5 0xffffffff80f580bc at calltrap+0x8
#6 0xffffffff8229c049 at dbuf_evict_one+0xe9
#7 0xffffffff82297a15 at dbuf_evict_thread+0x1a5
#8 0xffffffff80aba0c3 at fork_exit+0x83
#9 0xffffffff80f58fee at fork_trampoline+0xe
Uptime: 31d1h57m16s
Uptime: 31d1h57m16s
Dumping 2771 out of 12248 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/ums.ko...Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...done.
done.
Loaded symbols for /boot/kernel/ums.ko
Reading symbols from /boot/kernel/pflog.ko...Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug...done.
done.
Loaded symbols for /boot/kernel/pflog.ko
Reading symbols from /boot/kernel/pf.ko...Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...done.
done.
Loaded symbols for /boot/kernel/pf.ko
Reading symbols from /boot/kernel/nullfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/nullfs.ko
Reading symbols from /boot/kernel/blank_saver.ko...Reading symbols from /usr/lib/debug//boot/kernel/blank_saver.ko.debug...done.
done.
Loaded symbols for /boot/kernel/blank_saver.ko
Reading symbols from /boot/kernel/fdescfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/fdescfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/fdescfs.ko
Reading symbols from /boot/kernel/smbfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/smbfs.ko.debug...done.
done.
Loaded symbols for /boot/kernel/smbfs.ko
Reading symbols from /boot/kernel/libiconv.ko...Reading symbols from /usr/lib/debug//boot/kernel/libiconv.ko.debug...done.
done.
Loaded symbols for /boot/kernel/libiconv.ko
Reading symbols from /boot/kernel/libmchain.ko...Reading symbols from /usr/lib/debug//boot/kernel/libmchain.ko.debug...done.
done.
Loaded symbols for /boot/kernel/libmchain.ko
Reading symbols from /boot/kernel/geom_mirror.ko...Reading symbols from /usr/lib/debug//boot/kernel/geom_mirror.ko.debug...done.
done.
Loaded symbols for /boot/kernel/geom_mirror.ko
#0  doadump (textdump=<value optimized out>) at pcpu.h:229
229     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) #0  doadump (textdump=<value optimized out>) at pcpu.h:229
#1  0xffffffff80af675b in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:383
#2  0xffffffff80af6b81 in vpanic (fmt=<value optimized out>,
    ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:776
#3  0xffffffff80af69c3 in panic (fmt=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:707
#4  0xffffffff80f77fdf in trap_fatal (frame=0xfffffe0352893a10, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:875
#5  0xffffffff80f7759e in trap (frame=0xfffffe0352893a10) at pcpu.h:229
#6  0xffffffff80f580bc in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:231
#7  0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79
#8  0xffffffff8229c049 in dbuf_evict_one ()
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:487
#9  0xffffffff82297a15 in dbuf_evict_thread (unused=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:525
#10 0xffffffff80aba0c3 in fork_exit (
    callout=0xffffffff82297870 <dbuf_evict_thread>, arg=0x0,
    frame=0xfffffe0352893c00) at /usr/src/sys/kern/kern_fork.c:1054
#11 0xffffffff80f58fee in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:959
#12 0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb)

The "crashinfo" txt file is available at thie URL: http://www.ocpea.com/dump/core-5.txt

original core dump files are also available if needed.

Can someone, please, take a look at this? I will be more than happy to help with debugging, supplying additional info etc if needed.

Kind regards,
Jurij
Comment 2 Andriy Gapon freebsd_committer 2019-03-16 17:22:37 UTC
It would we useful to see *db in frame 7.

It also would be useful to find out what is a real line number there:
#7  0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79

You can disassemble dbuf_destroy() and see what's at address 0xffffffff82299013 and what instructions / calls lead to it.
Comment 3 Jurij Kovacic 2019-03-16 17:46:59 UTC
Hello Andriy,

I am pretty new at this - could you please elaborate a bit on how to "disassemble dbuf_destroy() and see what's at address 0xffffffff82299013 and what instructions / calls lead to it". I presume I can get the required information from the crash dump?

Regards,
Jurij
Comment 4 Andriy Gapon freebsd_committer 2019-03-18 10:09:59 UTC
(In reply to Jurij Kovacic from comment #3)
Yes, `disassemble dbuf_destroy` in kgdb.
Comment 5 Jurij Kovacic 2019-03-18 21:11:40 UTC
Hello Andriy,

Thank you very much for the explanation. 

After running:
kgdb /boot/kernel/kernel /var/crash/vmcore.last

the instruction at "0xffffffff82299013" is:
0xffffffff82299013 <dbuf_destroy+563>:  mov    (%rax),%rcx


Please find the complete disassembly of the dbuf_destroy function below.

Kind regards,
Jurij


Dump of assembler code for function dbuf_destroy:
0xffffffff82298de0 <dbuf_destroy+0>:    push   %rbp
0xffffffff82298de1 <dbuf_destroy+1>:    mov    %rsp,%rbp
0xffffffff82298de4 <dbuf_destroy+4>:    push   %r15
0xffffffff82298de6 <dbuf_destroy+6>:    push   %r14
0xffffffff82298de8 <dbuf_destroy+8>:    push   %r13
0xffffffff82298dea <dbuf_destroy+10>:   push   %r12
0xffffffff82298dec <dbuf_destroy+12>:   push   %rbx
0xffffffff82298ded <dbuf_destroy+13>:   sub    $0x18,%rsp
0xffffffff82298df1 <dbuf_destroy+17>:   mov    %rdi,%r13
0xffffffff82298df4 <dbuf_destroy+20>:   mov    0x30(%r13),%r14
0xffffffff82298df8 <dbuf_destroy+24>:   mov    0x88(%r13),%rdi
0xffffffff82298dff <dbuf_destroy+31>:   test   %rdi,%rdi
0xffffffff82298e02 <dbuf_destroy+34>:   je     0xffffffff82298e17 <dbuf_destroy+55>
0xffffffff82298e04 <dbuf_destroy+36>:   mov    %r13,%rsi
0xffffffff82298e07 <dbuf_destroy+39>:   callq  0xffffffff8228c220 <arc_buf_destroy>
0xffffffff82298e0c <dbuf_destroy+44>:   movq   $0x0,0x88(%r13)
0xffffffff82298e17 <dbuf_destroy+55>:   cmpq   $0xffffffffffffffff,0x40(%r13)
0xffffffff82298e1c <dbuf_destroy+60>:   jne    0xffffffff82298e43 <dbuf_destroy+99>
0xffffffff82298e1e <dbuf_destroy+62>:   mov    0x18(%r13),%rdi
0xffffffff82298e22 <dbuf_destroy+66>:   mov    $0x140,%esi
0xffffffff82298e27 <dbuf_destroy+71>:   callq  0xffffffff82328d00 <zio_buf_free>
0xffffffff82298e2c <dbuf_destroy+76>:   mov    $0x140,%edi
0xffffffff82298e31 <dbuf_destroy+81>:   mov    $0x4,%esi
0xffffffff82298e36 <dbuf_destroy+86>:   callq  0xffffffff8228b6c0 <arc_space_return>
0xffffffff82298e3b <dbuf_destroy+91>:   movl   $0x0,0x78(%r13)
0xffffffff82298e43 <dbuf_destroy+99>:   mov    0xd8(%r13),%r15
0xffffffff82298e4a <dbuf_destroy+106>:  test   %r15,%r15
0xffffffff82298e4d <dbuf_destroy+109>:  je     0xffffffff82298e8a <dbuf_destroy+170>
0xffffffff82298e4f <dbuf_destroy+111>:  movq   $0x0,0xd8(%r13)
0xffffffff82298e5a <dbuf_destroy+122>:  mov    0x30(%r15),%rax
0xffffffff82298e5e <dbuf_destroy+126>:  mov    0x38(%r15),%rbx
0xffffffff82298e62 <dbuf_destroy+130>:  test   %rax,%rax
0xffffffff82298e65 <dbuf_destroy+133>:  je     0xffffffff82298e6c <dbuf_destroy+140>
0xffffffff82298e67 <dbuf_destroy+135>:  mov    %r15,%rdi
0xffffffff82298e6a <dbuf_destroy+138>:  callq  *%rax
0xffffffff82298e6c <dbuf_destroy+140>:  test   %rbx,%rbx
0xffffffff82298e6f <dbuf_destroy+143>:  je     0xffffffff82298e8a <dbuf_destroy+170>
0xffffffff82298e71 <dbuf_destroy+145>:  mov    0xffffffff8240c470,%rdi
0xffffffff82298e79 <dbuf_destroy+153>:  mov    0x38(%r15),%rsi
0xffffffff82298e7d <dbuf_destroy+157>:  xor    %ecx,%ecx
0xffffffff82298e7f <dbuf_destroy+159>:  mov    %r15,%rdx
0xffffffff82298e82 <dbuf_destroy+162>:  mov    %r15,%r8
0xffffffff82298e85 <dbuf_destroy+165>:  callq  0xffffffff82272960 <taskq_dispatch_ent>
0xffffffff82298e8a <dbuf_destroy+170>:  movq   $0x0,0x18(%r13)
0xffffffff82298e92 <dbuf_destroy+178>:  cmpl   $0x2,0x78(%r13)
0xffffffff82298e97 <dbuf_destroy+183>:  je     0xffffffff82298ea1 <dbuf_destroy+193>
0xffffffff82298e99 <dbuf_destroy+185>:  movl   $0x0,0x78(%r13)
0xffffffff82298ea1 <dbuf_destroy+193>:  lea    0xc8(%r13),%rdi
0xffffffff82298ea8 <dbuf_destroy+200>:  callq  0xffffffff822dbf30 <multilist_link_active>
0xffffffff82298ead <dbuf_destroy+205>:  test   %eax,%eax
0xffffffff82298eaf <dbuf_destroy+207>:  je     0xffffffff82298ed4 <dbuf_destroy+244>
0xffffffff82298eb1 <dbuf_destroy+209>:  mov    0xffffffff8240c478,%rdi
0xffffffff82298eb9 <dbuf_destroy+217>:  mov    %r13,%rsi
0xffffffff82298ebc <dbuf_destroy+220>:  callq  0xffffffff822dbbe0 <multilist_remove>
0xffffffff82298ec1 <dbuf_destroy+225>:  mov    0x10(%r13),%rsi
0xffffffff82298ec5 <dbuf_destroy+229>:  neg    %rsi
0xffffffff82298ec8 <dbuf_destroy+232>:  mov    $0xffffffff8240c480,%rdi
0xffffffff82298ecf <dbuf_destroy+239>:  callq  0xffffffff82273960 <atomic_add_64_nv>
0xffffffff82298ed4 <dbuf_destroy+244>:  movl   $0x5,0x78(%r13)
0xffffffff82298edc <dbuf_destroy+252>:  movq   $0x0,0x48(%r13)
0xffffffff82298ee4 <dbuf_destroy+260>:  lea    0x58(%r13),%rdi
0xffffffff82298ee8 <dbuf_destroy+264>:  mov    $0xffffffff823d4fd1,%rsi
0xffffffff82298eef <dbuf_destroy+271>:  mov    $0x812,%edx
0xffffffff82298ef4 <dbuf_destroy+276>:  callq  0xffffffff80aff910 <_sx_xunlock>
0xffffffff82298ef9 <dbuf_destroy+281>:  mov    0x28(%r13),%rdi
0xffffffff82298efd <dbuf_destroy+285>:  mov    $0xffffffff823d5126,%rsi
0xffffffff82298f04 <dbuf_destroy+292>:  callq  0xffffffff82331f70 <zrl_add_impl>
0xffffffff82298f09 <dbuf_destroy+297>:  mov    0x28(%r13),%rdi
0xffffffff82298f0d <dbuf_destroy+301>:  mov    0x40(%rdi),%r15
0xffffffff82298f11 <dbuf_destroy+305>:  mov    0x40(%r15),%rbx
0xffffffff82298f15 <dbuf_destroy+309>:  cmpq   $0xffffffffffffffff,0x40(%r13)
0xffffffff82298f1a <dbuf_destroy+314>:  je     0xffffffff82299059 <dbuf_destroy+633>
0xffffffff82298f20 <dbuf_destroy+320>:  mov    %rbx,-0x30(%rbp)
0xffffffff82298f24 <dbuf_destroy+324>:  mov    %r14,-0x38(%rbp)
0xffffffff82298f28 <dbuf_destroy+328>:  lea    0x1f8(%r15),%r12
0xffffffff82298f2f <dbuf_destroy+335>:  mov    0x210(%r15),%rbx
0xffffffff82298f36 <dbuf_destroy+342>:  and    $0xfffffffffffffff1,%rbx
0xffffffff82298f3a <dbuf_destroy+346>:  mov    %gs:0x0,%r14
0xffffffff82298f43 <dbuf_destroy+355>:  cmp    %r14,%rbx
0xffffffff82298f46 <dbuf_destroy+358>:  je     0xffffffff82298f5e <dbuf_destroy+382>
0xffffffff82298f48 <dbuf_destroy+360>:  xor    %esi,%esi
0xffffffff82298f4a <dbuf_destroy+362>:  mov    $0xffffffff823d4fd1,%rdx
0xffffffff82298f51 <dbuf_destroy+369>:  mov    $0x81a,%ecx
0xffffffff82298f56 <dbuf_destroy+374>:  mov    %r12,%rdi
0xffffffff82298f59 <dbuf_destroy+377>:  callq  0xffffffff80aff0d0 <_sx_xlock>
0xffffffff82298f5e <dbuf_destroy+382>:  lea    0x218(%r15),%rdi
0xffffffff82298f65 <dbuf_destroy+389>:  mov    %r13,%rsi
0xffffffff82298f68 <dbuf_destroy+392>:  callq  0xffffffff82266e70 <avl_remove>
0xffffffff82298f6d <dbuf_destroy+397>:  lea    0xa8(%r15),%rdi
0xffffffff82298f74 <dbuf_destroy+404>:  mov    $0x1,%esi
0xffffffff82298f79 <dbuf_destroy+409>:  callq  0xffffffff80f56de0 <atomic_subtract_int>
0xffffffff82298f7e <dbuf_destroy+414>:  callq  0xffffffff822739b0 <membar_producer>
0xffffffff82298f83 <dbuf_destroy+419>:  mov    0x28(%r13),%rdi
0xffffffff82298f87 <dbuf_destroy+423>:  callq  0xffffffff82332000 <zrl_remove>
0xffffffff82298f8c <dbuf_destroy+428>:  cmp    %r14,%rbx
0xffffffff82298f8f <dbuf_destroy+431>:  je     0xffffffff82298fa5 <dbuf_destroy+453>
0xffffffff82298f91 <dbuf_destroy+433>:  mov    $0xffffffff823d4fd1,%rsi
0xffffffff82298f98 <dbuf_destroy+440>:  mov    $0x820,%edx
0xffffffff82298f9d <dbuf_destroy+445>:  mov    %r12,%rdi
0xffffffff82298fa0 <dbuf_destroy+448>:  callq  0xffffffff80aff910 <_sx_xunlock>
0xffffffff82298fa5 <dbuf_destroy+453>:  mov    %r15,%rdi
0xffffffff82298fa8 <dbuf_destroy+456>:  mov    %r13,%rsi
0xffffffff82298fab <dbuf_destroy+459>:  callq  0xffffffff822b4dd0 <dnode_rele>
0xffffffff82298fb0 <dbuf_destroy+464>:  movq   $0x0,0x28(%r13)
0xffffffff82298fb8 <dbuf_destroy+472>:  mov    0x0(%r13),%rsi
0xffffffff82298fbc <dbuf_destroy+476>:  mov    0x20(%r13),%rdi
0xffffffff82298fc0 <dbuf_destroy+480>:  mov    0x40(%r13),%rcx
0xffffffff82298fc4 <dbuf_destroy+484>:  movzbl 0x50(%r13),%edx
0xffffffff82298fc9 <dbuf_destroy+489>:  callq  0xffffffff82297340 <cityhash4>
0xffffffff82298fce <dbuf_destroy+494>:  mov    %rax,%rbx
0xffffffff82298fd1 <dbuf_destroy+497>:  and    0xffffffff8240a458,%rbx
0xffffffff82298fd9 <dbuf_destroy+505>:  movzbl %bl,%eax
0xffffffff82298fdc <dbuf_destroy+508>:  shl    $0x5,%rax
0xffffffff82298fe0 <dbuf_destroy+512>:  lea    -0x7dbf5b98(%rax),%r15
0xffffffff82298fe7 <dbuf_destroy+519>:  xor    %esi,%esi
0xffffffff82298fe9 <dbuf_destroy+521>:  mov    $0xffffffff823d4fd1,%rdx
0xffffffff82298ff0 <dbuf_destroy+528>:  mov    $0x129,%ecx
0xffffffff82298ff5 <dbuf_destroy+533>:  mov    %r15,%rdi
0xffffffff82298ff8 <dbuf_destroy+536>:  callq  0xffffffff80aff0d0 <_sx_xlock>
0xffffffff82298ffd <dbuf_destroy+541>:  shl    $0x3,%rbx
0xffffffff82299001 <dbuf_destroy+545>:  add    0xffffffff8240a460,%rbx
0xffffffff82299009 <dbuf_destroy+553>:  mov    -0x38(%rbp),%r14
0xffffffff8229900d <dbuf_destroy+557>:  nopl   (%rax)
0xffffffff82299010 <dbuf_destroy+560>:  mov    %rbx,%rax
0xffffffff82299013 <dbuf_destroy+563>:  mov    (%rax),%rcx
0xffffffff82299016 <dbuf_destroy+566>:  lea    0x38(%rcx),%rbx
0xffffffff8229901a <dbuf_destroy+570>:  cmp    %r13,%rcx
0xffffffff8229901d <dbuf_destroy+573>:  jne    0xffffffff82299010 <dbuf_destroy+560>
0xffffffff8229901f <dbuf_destroy+575>:  mov    0x38(%r13),%rcx
0xffffffff82299023 <dbuf_destroy+579>:  mov    %rcx,(%rax)
0xffffffff82299026 <dbuf_destroy+582>:  movq   $0x0,0x38(%r13)
0xffffffff8229902e <dbuf_destroy+590>:  mov    $0xffffffff823d4fd1,%rsi
0xffffffff82299035 <dbuf_destroy+597>:  mov    $0x131,%edx
0xffffffff8229903a <dbuf_destroy+602>:  mov    %r15,%rdi
0xffffffff8229903d <dbuf_destroy+605>:  callq  0xffffffff80aff910 <_sx_xunlock>
0xffffffff82299042 <dbuf_destroy+610>:  mov    $0xffffffff8240c4c8,%rdi
0xffffffff82299049 <dbuf_destroy+617>:  mov    $0x1,%esi
0xffffffff8229904e <dbuf_destroy+622>:  callq  0xffffffff80f56e60 <atomic_subtract_long>
0xffffffff82299053 <dbuf_destroy+627>:  mov    -0x30(%rbp),%rbx
0xffffffff82299057 <dbuf_destroy+631>:  jmp    0xffffffff8229905e <dbuf_destroy+638>
0xffffffff82299059 <dbuf_destroy+633>:  callq  0xffffffff82332000 <zrl_remove>
0xffffffff8229905e <dbuf_destroy+638>:  movq   $0x0,0x30(%r13)
0xffffffff82299066 <dbuf_destroy+646>:  mov    0xffffffff8240c468,%rdi
0xffffffff8229906e <dbuf_destroy+654>:  mov    %r13,%rsi
0xffffffff82299071 <dbuf_destroy+657>:  callq  0xffffffff825e83c0 <kmem_cache_free>
0xffffffff82299076 <dbuf_destroy+662>:  mov    $0xe8,%edi
0xffffffff8229907b <dbuf_destroy+667>:  mov    $0x4,%esi
0xffffffff82299080 <dbuf_destroy+672>:  callq  0xffffffff8228b6c0 <arc_space_return>
0xffffffff82299085 <dbuf_destroy+677>:  test   %r14,%r14
0xffffffff82299088 <dbuf_destroy+680>:  je     0xffffffff822990bc <dbuf_destroy+732>
0xffffffff8229908a <dbuf_destroy+682>:  cmp    %rbx,%r14
0xffffffff8229908d <dbuf_destroy+685>:  je     0xffffffff822990bc <dbuf_destroy+732>
0xffffffff8229908f <dbuf_destroy+687>:  lea    0x58(%r14),%rdi
0xffffffff82299093 <dbuf_destroy+691>:  xor    %esi,%esi
0xffffffff82299095 <dbuf_destroy+693>:  mov    $0xffffffff823d4fd1,%rdx
0xffffffff8229909c <dbuf_destroy+700>:  mov    $0xaa6,%ecx
0xffffffff822990a1 <dbuf_destroy+705>:  callq  0xffffffff80aff0d0 <_sx_xlock>
0xffffffff822990a6 <dbuf_destroy+710>:  mov    %r14,%rdi
0xffffffff822990a9 <dbuf_destroy+713>:  add    $0x18,%rsp
0xffffffff822990ad <dbuf_destroy+717>:  pop    %rbx
0xffffffff822990ae <dbuf_destroy+718>:  pop    %r12
0xffffffff822990b0 <dbuf_destroy+720>:  pop    %r13
0xffffffff822990b2 <dbuf_destroy+722>:  pop    %r14
0xffffffff822990b4 <dbuf_destroy+724>:  pop    %r15
0xffffffff822990b6 <dbuf_destroy+726>:  pop    %rbp
0xffffffff822990b7 <dbuf_destroy+727>:  jmpq   0xffffffff8229b290 <dbuf_rele_and_unlock>
0xffffffff822990bc <dbuf_destroy+732>:  add    $0x18,%rsp
0xffffffff822990c0 <dbuf_destroy+736>:  pop    %rbx
0xffffffff822990c1 <dbuf_destroy+737>:  pop    %r12
0xffffffff822990c3 <dbuf_destroy+739>:  pop    %r13
0xffffffff822990c5 <dbuf_destroy+741>:  pop    %r14
0xffffffff822990c7 <dbuf_destroy+743>:  pop    %r15
0xffffffff822990c9 <dbuf_destroy+745>:  pop    %rbp
0xffffffff822990ca <dbuf_destroy+746>:  retq
End of assembler dump.
Current language:  auto; currently minimal
Comment 6 Andriy Gapon freebsd_committer 2019-03-18 23:09:16 UTC
So, it looks that the crash was in dbuf_hash_remove() and its cause appears to be that the db in question was not in the hash table.
So, could you print *db in frame 7 ?

Commands:
frame 7
print *db
:-)
Comment 7 Jurij Kovacic 2019-03-19 16:12:56 UTC
Hi Andriy,

Thank you again for the detailed instructions.

The following is what I got after entering "frame 7" and "print *db" as instructed: 


(kgdb) frame 7
#7  0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79
79              atomic_subtract_32(target, 1);
Current language:  auto; currently minimal
(kgdb) print *db
$1 = {db = {db_object = 52289, db_offset = 15868362752, db_size = 131072, db_data = 0x0}, db_objset = 0xfffff8000f30a400, db_dnode_handle = 0x0, db_parent = 0xfffff800168d9000,
  db_hash_next = 0x0, db_blkid = 121066, db_blkptr = 0x0, db_level = 0 '\0', db_mtx = {lock_object = {lo_name = 0xffffffff823d529f "db->db_mtx", lo_flags = 577830912, lo_data = 0,
      lo_witness = 0x0}, sx_lock = 1}, db_state = DB_EVICTING, db_holds = {rc_count = 0}, db_buf = 0x0, db_changed = {cv_description = 0xffffffff823d52ab "db->db_changed", cv_waiters = 0},
  db_data_pending = 0x0, db_last_dirty = 0x0, db_link = {avl_child = 0xfffff8028de9cb90, avl_pcb = 18446735289386928809}, db_cache_link = {list_next = 0x0, list_prev = 0x0}, db_user = 0x0,
  db_user_immediate_evict = 0 '\0', db_freed_in_flight = 0 '\0', db_pending_evict = 0 '\0', db_dirtycnt = 0 '\0'}
(kgdb)

Kind regards,
Jurij
Comment 8 Andriy Gapon freebsd_committer 2019-03-19 17:50:47 UTC
Could you please also do 'info reg' in that frame?
Comment 9 Jurij Kovacic 2019-03-19 19:09:01 UTC
Yes, sure - I did "frame 7" and then "info reg" - this is what I got:

(kgdb) frame 7
#7  0xffffffff82299013 in dbuf_destroy (db=0xfffff8028de9cae0) at atomic.h:79
79              atomic_subtract_32(target, 1);
(kgdb) info reg
rax            0xffff78028de9cb18       -149522610533608
rbx            0xffff78028de9cb18       -149522610533608
rcx            0xffff78028de9cae0       -149522610533664
rdx            0xfffff8000ac14001       -8795912585215
rsi            0xffffffff8240b008       -2109689848
rdi            0xffffffff8240b008       -2109689848
rbp            0xfffffe0352893b10       0xfffffe0352893b10
rsp            0xfffffe0352893ad0       0xfffffe0352893ad0
r8             0x9ae16a3b2f90408f       -7286425919675154289
r9             0xf66ab0fc2de1ac00       -690544995699938304
r10            0x0      0
r11            0xfffff800d6677c78       -8792495915912
r12            0xfffff8005b1914d0       -8794564651824
r13            0xfffff8028de9cae0       -8785122178336
r14            0xfffff800168d9000       -8795714646016
r15            0xffffffff8240b008       -2109689848
rip            0xffffffff82299013       0xffffffff82299013 <dbuf_destroy+563>
eflags         0x10287  66183
cs             0x20     32
ss             0x28     40
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(kgdb)
Comment 10 Andriy Gapon freebsd_committer 2019-03-21 09:36:30 UTC
(In reply to Jurij Kovacic from comment #9)
Okay, compare the value of r13 (and this is the original db pointer) to the value in rcx. They differ by a single bit:
(gdb) p/x 0xfffff8028de9cae0 ^ 0xffff78028de9cae0
$1 = 0x800000000000

The values in rbx and rax are derived from the value in rcx.  And the crash happens when accessing the address stored in rax.

This was a long way to say that most likely you got a single bit flip on the hardware level.  I assume that the RAM is not ECC.
Comment 11 Jurij Kovacic 2019-03-21 17:07:09 UTC
(In reply to Andriy Gapon from comment #10)

Hi Andriy,

From what I can tell the RAM modules indeed are not ECC type,
...
Memory Device
        Array Handle: 0x0025
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 4096 MB
...

since Total Width and Data Width are the same - you can find the complete dmidecode output below. 

I take it your recommendation is to replace the RAM modules?

Kind regards,
Jurij

# dmidecode 3.2
Scanning /dev/mem for entry point.
SMBIOS 2.5 present.

Handle 0x0025, DMI type 16, 15 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: None
        Maximum Capacity: 128 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x0027, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x0025
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1_DIMM0
        Bank Locator: BANK0
        Type: Other
        Type Detail: Synchronous
        Speed: 667 MT/s
        Manufacturer: Apacer
        Serial Number: 32110102
        Asset Tag: AssetTagNum0
        Part Number: 78.A1GDE.9K00C

Handle 0x0029, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x0025
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1_DIMM1
        Bank Locator: BANK1
        Type: Other
        Type Detail: Synchronous
        Speed: 667 MT/s
        Manufacturer: Apacer
        Serial Number: 32110102
        Asset Tag: AssetTagNum1
        Part Number: 78.A1GDE.9K00C

Handle 0x002B, DMI type 17, 27 bytes
Memory Deavice
        Array Handle: 0x0025
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1_DIMM2
        Bank Locator: BANK2
        Type: Other
        Type Detail: Synchronous
        Speed: 667 MT/s
        Manufacturer: Kingston
        Serial Number: C40B2C66
        Asset Tag: AssetTagNum2
        Part Number: 99U5403-034.A00LF

Handle 0x002D, DMI type 17, 27 bytes
Memory Device
        Array Handle: 0x0025
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: CPU1_DIMM3
        Bank Locator: BANK3
        Type: Other
        Type Detail: Synchronous
        Speed: 667 MT/s
        Manufacturer: Kingston
        Serial Number: 91352C8F
        Asset Tag: AssetTagNum3
        Part Number: 99U5403-034.A00LF
Comment 12 Andriy Gapon freebsd_committer 2019-03-21 17:43:00 UTC
It's not necessarily a memory module, it could be a bad contact in a memory slot or in a CPU socket. Or some source of alpha particles in the vicinity :-)
But you get the idea.
Comment 13 Jurij Kovacic 2019-03-21 17:52:17 UTC
Thank you very much for your help, Andriy. To be quite honest, I was beginning to loose hope, since I got no reply on the mailing list, whereas you not only helped, but taught me a few things along the way. 

Now we know we should focus on the hardware side of things.

Once again, thank you.

Kind regards,
Jurij
Comment 14 Andriy Gapon freebsd_committer 2019-03-22 09:13:48 UTC
You are welcome.
And thank you for the report and great help with it.