Bug 219224 - [zfs] sync hanging in _cv_wait zil_commit zfs_sync sys_sync
Summary: [zfs] sync hanging in _cv_wait zil_commit zfs_sync sys_sync
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-11 17:06 UTC by Tino Reinhardt
Modified: 2017-06-20 17:45 UTC (History)
1 user (show)

See Also:


Attachments
"procstat -kk -a" output (346.14 KB, text/plain)
2017-05-12 18:59 UTC, Tino Reinhardt
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tino Reinhardt 2017-05-11 17:06:23 UTC
I'm observing hanging processes on a zfs system, apparently after umount
tried to sync data before unmounting a jails devfs.

tools-1-b# service jail restart git-1-b
Stopping jails: git-1-b
load: 1.77  cmd: umount 79620 [zilog->zl_cv_batch[0]] 568.08r 0.00u 0.01s 0% 2000k
load: 1.65  cmd: umount 79620 [zilog->zl_cv_batch[0]] 575.16r 0.00u 0.01s 0% 2000k
load: 2.32  cmd: umount 79620 [zilog->zl_cv_batch[0]] 580.44r 0.00u 0.01s 0% 2000k

tools-1-b# procstat -kk 79620
  PID    TID COMM             TDNAME           KSTACK                       
79620 100216 umount           -                mi_switch+0xd2 sleepq_wait+0x3a _cv_wait+0x194 zil_commit+0x85 zfs_sync+0xad sys_sync+0x146 amd64_syscall+0x4ce Xfast_syscall+0xfb 

tools-1-b# kgdb
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)
...
Attempt to extract a component of a value that is not a structure pointer.
Attempt to extract a component of a value that is not a structure pointer.
#0  0xffffffff80b0cafb in sched_switch ()
(kgdb) tid 100216
[Switching to thread 2772 (Thread 100216)]#0  0xffffffff80b0cafb in sched_switch ()
(kgdb) bt
#0  0xffffffff80b0cafb in sched_switch ()
#1  0xffffffff80ae4ae2 in mi_switch ()
#2  0xffffffff80b3226a in sleepq_wait ()
#3  0xffffffff80a66464 in _cv_wait ()
#4  0xffffffff82252945 in ?? ()
#5  0xfffffe201d5308e0 in ?? ()
#6  0xfffff805d1af7500 in ?? ()
#7  0xfffffe201d530870 in ?? ()
#8  0xffffffff80bae521 in __mnt_vnode_markerfree_all ()
#9  0xffffffff8227aded in ?? ()
#10 0x00000000fffff7ff in ?? ()
#11 0xfffffe201d530968 in ?? ()
#12 0xfffffe201d530990 in ?? ()
#13 0xffffffff80bb48a6 in sys_sync ()
Previous frame identical to this frame (corrupt stack?)
(kgdb) 

Two other processes are hanging too:

tools-1-b# ps ax | awk '$3~"D"&&$5~"^/"'
36031  -  DsJ      0:00.49 /var/cfengine/bin/cf-agent -Dfrom_cfexecd,scheduled_
40395  -  DJ       0:00.01 /var/cfengine/bin/cf-promises -c /var/cfengine/input
79620  0  D+       0:00.01 /sbin/umount /jails/git-1-b/dev

tools-1-b# procstat -kk 36031
  PID    TID COMM             TDNAME           KSTACK                       
36031 103064 cf-agent         -                mi_switch+0xd2 sleepq_wait+0x3a _cv_wait+0x194 zil_commit+0x85 zfs_freebsd_putpages+0x71e VOP_PUTPAGES_APV+0x8d vnode_pager_putpages+0xe3 vm_pageout_flush+0xed vm_object_page_collect_flush+0x21a vm_object_page_clean+0x197 vm_object_terminate+0x99 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x8f vgonel+0x2ef vrecycle+0x90 zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x8d vinactive+0x17d

tools-1-b# procstat -kk 40395
  PID    TID COMM             TDNAME           KSTACK                       
40395 103070 cf-promises      -                mi_switch+0xd2 sleepq_wait+0x3a _cv_wait+0x194 zil_commit+0x85 zfs_freebsd_putpages+0x71e VOP_PUTPAGES_APV+0x8d vnode_pager_putpages+0xe3 vm_pageout_flush+0xed vm_object_page_collect_flush+0x21a vm_object_page_clean+0x197 vm_object_terminate+0x99 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x8f vgonel+0x2ef vrecycle+0x90 zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x8d vinactive+0x17d

Everything is on zfs, with both the mothership and the
jail in zroot/jails/cfengine-1-b sharing pool "zroot":

tools-1-b# zpool list 
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data   2.91T  1.43M  2.91T         -     0%     0%  2.14x  ONLINE  -
zroot   744G  19.6G   724G         -    19%     2%  1.60x  ONLINE  -

I observe another process being reported at 100%
cpu usage, that's highly unusual, also inside the cfengine-jail:

tools-1-b# ps w 2988
 PID TT  STAT      TIME COMMAND
2988  -  TsJ  118:40.91 /usr/local/sbin/cf-serverd

tools-1-b# procstat -kk 2988
  PID    TID COMM             TDNAME           KSTACK                       
 2988 100835 cf-serverd       -                mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f 
 2988 100922 cf-serverd       -                mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f 
 2988 100963 cf-serverd       -                dmu_object_info_from_db+0x2d zfs_zget+0xde zfs_get_data+0x40 zil_commit+0x6a1 zfs_freebsd_putpages+0x71e VOP_PUTPAGES_APV+0x8d vnode_pager_putpages+0xe3 vm_pageout_flush+0xed vm_object_page_collect_flush+0x21a vm_object_page_clean+0x197 vm_object_terminate+0x99 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x8f vgonel+0x2ef vrecycle+0x90 zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x8d vinactive+0x17d 
 2988 100966 cf-serverd       -                mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f 
 2988 100970 cf-serverd       -                mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f 
 2988 100973 cf-serverd       -                mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f

[ ... 1000 more threads with exactly
"mi_switch+0xd2 thread_suspend_check+0x34a ast+0x5be doreti_ast+0x1f"]

This thread 100963 seems to be looping around:

tools-1-b# {repeat 20 procstat -k 2988} | awk '$2=="100963"' | cut -d- -f3- | sort | uniq 
                dbuf_hold_impl dbuf_hold dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle 
                dbuf_read dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV 
                dbuf_read dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive 
                dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV vinactive 
                dmu_object_info_from_db zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV vinactive 
                dmu_object_info_from_dnode dmu_object_info_from_db zfs_zget zfsget_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV 
                dmu_zfetch dbuf_read dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle 
                dnode_hold_impl dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_o
bject_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV 
                dnode_rele dmu_bonus_hold zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV 
                zfs_zget zfs_get_data zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV vinactive vputx

Only the above listed processes are affected, there
are ~200 other ones running (and writing to zroot pool).
Comment 1 Tino Reinhardt 2017-05-12 10:57:09 UTC
I seem to be able to reproduce this on a similar system,
having cfengine running and run a loop in a shell like
"while :; do sync; done":

tools-2-b# ps ax | grep sync
   27  -  DL       1:27.96 [syncer]
 3667  1  R+    1017:57.50 sync

tools-2-b# procstat -kk 3667
  PID    TID COMM             TDNAME           KSTACK
 3667 102488 sync             -                dbuf_read+0x155 dnode_hold_impl+0x188 dmu_bonus_hold+0x1d zfs_zget+0xc5 zfs_get_data+0x40 zil_commit+0x6a1 zfs_sync+0xad sys_sync+0x146 amd64_syscall+0x4ce Xfast_syscall+0xfb 
tools-2-b# procstat -kk 3667
  PID    TID COMM             TDNAME           KSTACK
 3667 102488 sync             -                dnode_hold_impl+0x409 dmu_bonus_hold+0x1d zfs_zget+0xc5 zfs_get_data+0x40 zil_commit+0x6a1 zfs_sync+0xad sys_sync+0x146 amd64_syscall+0x4ce Xfast_syscall+0xfb 
tools-2-b# procstat -kk 3667
  PID    TID COMM             TDNAME           KSTACK
 3667 102488 sync             -                dmu_zfetch+0x7b dbuf_read+0x155 dnode_hold_impl+0x188 dmu_bonus_hold+0x1d zfs_zget+0xc5 zfs_get_data+0x40 zil_commit+0x6a1 zfs_sync+0xad sys_sync+0x146 amd64_syscall+0x4ce Xfast_syscall+0xfb

tools-2-b# ps ax | awk '$3~"D"' | tail
   22  -  DL       0:00.00 [vmdaemon]
   23  -  DL       0:00.01 [pagezero]
   24  -  DL       0:02.13 [bufspacedaemon]
   25  -  DL       0:02.06 [bufdaemon]
   26  -  DL       0:02.31 [vnlru]
   27  -  DL       1:26.84 [syncer]
  643  -  DL       1:03.87 [pf purge]
 3650  -  DsJ      0:00.05 /var/cfengine/bin/cf-agent -f failsafe.cf
11495  -  DJ       0:00.01 /var/cfengine/bin/cf-agent -f failsafe.cf
96409  -  DL       0:15.85 [ftcleanup]

tools-2-b# procstat -k 3650
  PID    TID COMM             TDNAME           KSTACK                       
 3650 102929 cf-agent         -                mi_switch sleepq_wait _cv_wait zil_commit zfs_freebsd_putpages VOP_PUTPAGES_APV vnode_pager_putpages vm_pageout_flush vm_object_page_collect_flush vm_object_page_clean vm_object_terminate zfs_freebsd_reclaim VOP_RECLAIM_APV vgonel vrecycle zfs_freebsd_inactive VOP_INACTIVE_APV vinactive

The shell loop is running in a screen and this indicates
it took around 5 minutes for a sync(8) to stay hanging around:

tools-2-b# ps ax | grep sync
   27  -  DL       1:27.96 [syncer]
 3667  1  R+    1017:57.50 sync

tools-2-b# ps olstart 3667
STARTED
Thu May 11 19:50:28 2017

This is the hanging cf-agent process:

tools-2-b# ps olstart 3650
STARTED
Thu May 11 19:50:28 2017

From this on, other calls to sync(2) are hanging:

tools-2-b# printf '#include<unistd.h>\n int main(){sync();}' | cc -x c -o sync -
tools-2-b# ./sync &
[1] 67191
tools-2-b# procstat -kk 67191
  PID    TID COMM             TDNAME           KSTACK
67191 103195 sync             -                mi_switch+0xd2 sleepq_wait+0x3a _cv_wait+0x194 zil_commit+0x85 zfs_sync+0xad sys_sync+0x146 amd64_syscall+0x4ce Xfast_syscall+0xfb
Comment 2 Tino Reinhardt 2017-05-12 18:59:39 UTC
Created attachment 182550 [details]
"procstat -kk -a" output

I'm now on
tools-2-b# uname -a
FreeBSD tools-2-b 11.0-RELEASE-p10 FreeBSD 11.0-RELEASE-p10 #0 r318236: Fri May 12 13:52:37 CEST 2017     root@tools-2-b:/usr/obj/usr/src/sys/DEBUG  amd64

This is GENERIC plus:

options         INVARIANTS
options         INVARIANT_SUPPORT
options         WITNESS
options         WITNESS_SKIPSPIN
options         DEBUG_LOCKS
options         DEBUG_VFS_LOCKS
options         DIAGNOSTIC

While running sync(2) in a loop, cf-agent hangs quickly:

tools-2-b# ktrace cf-agent
load: 20.62  cmd: cf-agent 5360 [zilog->zl_cv_batch[0]] 162.96r 0.01u 0.06s 0% 8740k

tools-2-b# kdump -f ktrace.out | tail
  5360 cf-agent CALL  close(0x5)
  5360 cf-agent RET   close 0
  5360 cf-agent CALL  munmap(0x8006b3000,0x10)
  5360 cf-agent RET   munmap 0
  5360 cf-agent CALL  munmap(0x8006b4000,0x10)
  5360 cf-agent RET   munmap 0
  5360 cf-agent CALL  fcntl(0x4,F_SETLK,0xffffd450)
  5360 cf-agent RET   fcntl 0
  5360 cf-agent CALL  unlink(0x7fffffffd080)
  5360 cf-agent NAMI  "/tmp/SEMDMDBrvXw+F@)aP1"

tools-2-b# procstat -kk 5360
  PID    TID COMM             TDNAME           KSTACK
 5360 102508 cf-agent         -                mi_switch+0x167 sleepq_switch+0xe7 sleepq_wait+0x43 _cv_wait+0x1e4 zil_commit+0x85 zfs_freebsd_putpages+0x7d6 VOP_PUTPAGES_APV+0x104 vnode_pager_putpages+0x89 vm_pageout_flush+0x12d vm_object_page_collect_flush+0x23a vm_object_page_clean+0x1be vm_object_terminate+0xa9 vnode_destroy_vobject+0x79 zfs_freebsd_reclaim+0x63 VOP_RECLAIM_APV+0x104 vgonel+0x2bd vrecycle+0x4a zfs_freebsd_inactive+0xd

tools-2-b# ps ax | grep sync
   27  -  DL     0:00.09 [syncer]
15560  0  S+     0:00.00 grep sync
 5177  1  RN     7:16.42 ./syncloop

tools-2-b# kill 5177
tools-2-b# kill -9 5177
tools-2-b# procstat -kk 5177
  PID    TID COMM             TDNAME           KSTACK                       
 5177 102515 syncloop         -                zfs_zget+0x20f zfs_get_data+0x7c zil_commit+0x926 zfs_sync+0xad sys_sync+0xe6 amd64_syscall+0x2f6 Xfast_syscall+0xfb 
tools-2-b# procstat -kk 5177
  PID    TID COMM             TDNAME           KSTACK                       
 5177 102515 syncloop         -                zfs_zget+0x1b4 zfs_get_data+0x7c zil_commit+0x926 zfs_sync+0xad sys_sync+0xe6 amd64_syscall+0x2f6 Xfast_syscall+0xfb 
tools-2-b# procstat -kk 5177
  PID    TID COMM             TDNAME           KSTACK                       
 5177 102515 syncloop         -                _sx_slock+0x73 dnode_verify+0xe2 dnode_hold_impl+0x53e dmu_bonus_hold+0x25 zfs_zget+0xcf zfs_get_data+0x7c zil_commit+0x926 zfs_sync+0xad sys_sync+0xe6 amd64_syscall+0x2f6 Xfast_syscall+0xfb 
tools-2-b# procstat -kk 5177
  PID    TID COMM             TDNAME           KSTACK                       
 5177 102515 syncloop         -                dnode_hold_impl+0x53e dmu_bonus_hold+0x25 zfs_zget+0xcf zfs_get_data+0x7c zil_commit+0x926 zfs_sync+0xad sys_sync+0xe6 amd64_syscall+0x2f6 Xfast_syscall+0xfb 

tools-2-b# procstat -kk -a > procstat-kk-a.txt

tools-2-b# kgdb -q /usr/lib/debug/boot/kernel/kernel.debug /dev/mem
Reading symbols from /usr/lib/debug/boot/kernel/zfs.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/zfs.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/opensolaris.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/opensolaris.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/carp.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/carp.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/if_lagg.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/if_lagg.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/ums.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/ums.ko.debug
Reading symbols from /usr/lib/debug/boot/kernel/pf.ko.debug...done.
Loaded symbols for /usr/lib/debug/boot/kernel/pf.ko.debug
#0  sched_switch (td=0xfffff801b1245500, newtd=0xfffff80115531000, 
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973
1973                    cpuid = PCPU_GET(cpuid);
(kgdb) tid 102508
[Switching to thread 1672 (Thread 102508)]#0  sched_switch (
    td=0xfffff8055a385000, newtd=0xfffff80115533000, 
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973
1973                    cpuid = PCPU_GET(cpuid);
Current language:  auto; currently minimal
#0  sched_switch (td=0xfffff8055a385000, newtd=0xfffff80115533000, 
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1973
#1  0xffffffff80a497d7 in mi_switch (flags=260, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:455
#2  0xffffffff80a8dac7 in sleepq_switch (wchan=<value optimized out>, pri=0)
    at /usr/src/sys/kern/subr_sleepqueue.c:566
#3  0xffffffff80a8d993 in sleepq_wait (wchan=0xfffff801b112f8f8, pri=0)
    at /usr/src/sys/kern/subr_sleepqueue.c:646
#4  0xffffffff809e11b4 in _cv_wait (cvp=<value optimized out>, 
    lock=<value optimized out>) at /usr/src/sys/kern/kern_condvar.c:144
#5  0xffffffff82238545 in zil_commit (zilog=<value optimized out>, 
    foid=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1574
#6  0xffffffff8226c746 in zfs_freebsd_putpages (ap=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:4767
#7  0xffffffff81005e74 in VOP_PUTPAGES_APV (vop=<value optimized out>, 
    a=<value optimized out>) at vnode_if.c:2930
#8  0xffffffff80d3a349 in vnode_pager_putpages (object=<value optimized out>, 
    m=0xfffffe201e33a3f0, count=<value optimized out>, flags=0, 
    rtvals=0xfffffe201e33a360) at vnode_if.h:1224
#9  0xffffffff80d2f67d in vm_pageout_flush (mc=<value optimized out>, 
    count=<value optimized out>, flags=1, mreq=0, prunlen=0xfffffe201e33a4ac, 
    eio=0xfffffe201e33a530) at vm_pager.h:126
#10 0xffffffff80d26fca in vm_object_page_collect_flush (
    object=<value optimized out>, p=<value optimized out>, pagerflags=1, 
    flags=0, clearobjflags=<value optimized out>, eio=0xfffffe201e33a530)
    at /usr/src/sys/vm/vm_object.c:979
#11 0xffffffff80d26cbe in vm_object_page_clean (object=<value optimized out>, 
    start=<value optimized out>, end=<value optimized out>, 
    flags=<value optimized out>) at /usr/src/sys/vm/vm_object.c:905
#12 0xffffffff80d26829 in vm_object_terminate (object=<value optimized out>)
    at /usr/src/sys/vm/vm_object.c:739
#13 0xffffffff80d3a869 in vnode_destroy_vobject (vp=0xfffff804e7292000)
    at /usr/src/sys/vm/vnode_pager.c:173
#14 0xffffffff822761e3 in zfs_freebsd_reclaim (ap=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5367
#15 0xffffffff810047a4 in VOP_RECLAIM_APV (vop=<value optimized out>, 
    a=<value optimized out>) at vnode_if.c:2021
#16 0xffffffff80afdf0d in vgonel (vp=<value optimized out>) at vnode_if.h:830
#17 0xffffffff80afe45a in vrecycle (vp=0xfffff804e7292000)
    at /usr/src/sys/kern/vfs_subr.c:3087
#18 0xffffffff8227616d in zfs_freebsd_inactive (ap=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:5349
#19 0xffffffff810045f4 in VOP_INACTIVE_APV (vop=<value optimized out>, 
    a=<value optimized out>) at vnode_if.c:1955
#20 0xffffffff80afca9b in vinactive (vp=0xfffff804e7292000, 
    td=0xfffff8055a385000) at vnode_if.h:807
#21 0xffffffff80afd2d8 in vputx (vp=<value optimized out>, 
    func=<value optimized out>) at /usr/src/sys/kern/vfs_subr.c:2682
#22 0xffffffff80b05ae2 in kern_unlinkat (td=<value optimized out>, fd=-100, 
    path=0x7fffffffd080 <Error reading address 0x7fffffffd080: Bad address>, 
    pathseg=UIO_USERSPACE, oldinum=<value optimized out>)
    at /usr/src/sys/kern/vfs_syscalls.c:1774
#23 0xffffffff80ea4426 in amd64_syscall (td=0xfffff8055a385000, 
    traced=<value optimized out>) at subr_syscall.c:135
#24 0xffffffff80e8497b in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:396
#25 0x0000000802a72f7a in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 4
#4  0xffffffff809e11b4 in _cv_wait (cvp=<value optimized out>, 
    lock=<value optimized out>) at /usr/src/sys/kern/kern_condvar.c:144
144             sleepq_wait(cvp, 0);
(kgdb) p *cvp
$1 = {cv_description = 0xffffffff81c81848 "I.=\201????", 
  cv_waiters = -411880152}
(kgdb) frame 5
#5  0xffffffff82238545 in zil_commit (zilog=<value optimized out>, 
    foid=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1574
1574                    cv_wait(&zilog->zl_cv_batch[mybatch & 1], &zilog->zl_lock);
(kgdb) p *zilog
$2 = {zl_lock = {lock_object = {lo_name = 0xffffffff81c81848 "I.=\201????", 
      lo_flags = 3883087144, lo_data = 4294965252, lo_witness = 0x0}, 
    sx_lock = 18446735298679485752}, zl_dmu_pool = 0x0, 
  zl_spa = 0xffffffff81bb7cc8, zl_header = 0x0, zl_os = 0xfffff804e7665e80, 
  zl_get_data = 0, zl_root_zio = 0x0, zl_lr_seq = 0, 
  zl_commit_lr_seq = 18446741874700784480, 
  zl_destroy_txg = 18446735282269118464, zl_replayed_seq = 0xfffff8055a385068, 
  zl_replaying_seq = 18446735284892366080, zl_suspend = 0, zl_cv_writer = {
    cv_description = 0x1906cffffffff <Error reading address 0x1906cffffffff: Bad address>, cv_waiters = 0}, zl_cv_suspend = {cv_description = 0x0, 
    cv_waiters = 0}, zl_suspending = 0 '\0', zl_keep_first = 0 '\0', 
  zl_replay = 0 '\0', zl_stop_sync = 0 '\0', zl_writer = 0 '\0', 
  zl_logbias = 0 '\0', zl_sync = 0 '\0', zl_parse_error = 0, 
  zl_parse_blk_seq = 18446735300605006016, 
  zl_parse_lr_seq = 18446735298679485736, zl_parse_blk_count = 1, 
  zl_parse_lr_count = 17179869439, zl_next_batch = 2, zl_com_batch = 0, 
  zl_cv_batch = 0xfffff8055a3850f8, zl_itxg = 0xfffff8055a385118, 
  zl_itx_commit_list = {list_size = 206226633, list_offset = 2, list_head = {
      list_next = 0x8, list_prev = 0x0}}, zl_itx_list_sz = 15863, 
  zl_cur_used = 63452, zl_lwb_list = {list_size = 79316, list_offset = 0, 
    list_head = {list_next = 0xc4ac4c9, list_prev = 0x0}}, zl_vdev_lock = {
    lock_object = {lo_name = 0x0, lo_flags = 0, lo_data = 0, 
      lo_witness = 0x0}, sx_lock = 270582939648}, zl_vdev_tree = {
    avl_root = 0x0, avl_compar = 0, avl_offset = 4, avl_numnodes = 0, 
    avl_size = 0}, zl_clean_taskq = 0x612d666300000000, zl_bp_tree = {
    avl_root = 0x746e6567, avl_compar = 0, avl_offset = 0, avl_numnodes = 0, 
    avl_size = 0}, zl_replay_time = 0, zl_replay_blks = 0, zl_old_header = {
    zh_claim_txg = 0, zh_replay_seq = 0, zh_log = {
      blk_dva = 0xfffff8055a3852e0, blk_prop = 0, 
      blk_pad = 0xfffff8055a385318, blk_phys_birth = 0, blk_birth = 2, 
      blk_fill = 18446735284884707568, blk_cksum = {
        zc_word = 0xfffff8055a385340}}, zh_claim_blk_seq = 0, zh_flags = 0, 
    zh_claim_lr_seq = 0, zh_pad = 0xfffff8055a385378}, 
  zl_prev_blks = 0xfffff8055a385390, zl_prev_rotor = 0, zl_dirty_link = {
    tn_next = 0xfffff8055a3853d8, 
    tn_member = 0xfffff8055a3853f8 "@֨\200????"}}
(kgdb)
Comment 3 Tino Reinhardt 2017-06-20 17:27:38 UTC
I'm pretty sure I haven't seen this issue on 10.3-RELEASE and now I fail to reproduce it on 11.1-PRERELEASE, etc., all on the same hardware as before.

Working versions:

n-3-b% uname -a
FreeBSD n-3-b 11.1-PRERELEASE FreeBSD 11.1-PRERELEASE #0 r319653: Wed Jun  7 15:17:47 CEST 2017     root@n-3-b:/usr/obj/usr/src/sys/DEBUG  amd64

tools-1-b% uname -a
FreeBSD tools-1-b 11.1-BETA1 FreeBSD 11.1-BETA1 #0 r319852: Mon Jun 12 15:12:38 CEST 2017     root@tools-1-b:/usr/obj/usr/src/sys/DEBUG  amd64

nagios-2-b% uname -a
FreeBSD nagios-2-b 11.1-BETA2 FreeBSD 11.1-BETA2 #0 r320148: Tue Jun 20 18:07:08 CEST 2017     root@tools-2-b:/usr/obj/usr/src/sys/GENERIC  amd64

DEBUG kernel config is GENERIC plus
options         INVARIANTS
options         INVARIANT_SUPPORT
options         WITNESS
options         WITNESS_SKIPSPIN
options         DEBUG_LOCKS
options         DEBUG_VFS_LOCKS

I've had to remove DIAGNOSTIC for performance reasons.

With 11.1-RELEASE upcoming I consider this problem solved, although I fail to spot the changes in svn log so far.

Thank you.