The freebsd.org cluster package builder has been deadlocking regularly with things like this: root@bob.nyi:/home/peter # ps -axHl -O lwp | grep node 0 92502 92499 0 52 0 771080 197588 kqread I+J 1 0:07.57 npm (node) 92502 101429 1 I+J 0:07.57 npm (node) 0 92502 92499 0 22 0 771080 197588 usem I+J 1 0:00.11 npm (node) 92502 101858 1 I+J 0:00.11 npm (node) 0 92502 92499 0 22 0 771080 197588 usem I+J 1 0:00.11 npm (node) 92502 101859 1 I+J 0:00.11 npm (node) 0 92502 92499 0 22 0 771080 197588 usem I+J 1 0:00.11 npm (node) 92502 101860 1 I+J 0:00.11 npm (node) 0 92502 92499 0 22 0 771080 197588 usem I+J 1 0:00.10 npm (node) 92502 101861 1 I+J 0:00.10 npm (node) 0 92502 92499 0 28 0 771080 197588 zfs D+J 1 0:00.89 npm (node) 92502 101862 1 D+J 0:00.89 npm (node) 0 92502 92499 0 28 0 771080 197588 zfs D+J 1 0:00.90 npm (node) 92502 101863 1 D+J 0:00.90 npm (node) 0 92502 92499 0 28 0 771080 197588 zfs D+J 1 0:00.89 npm (node) 92502 101864 1 D+J 0:00.89 npm (node) 0 92502 92499 0 28 0 771080 197588 zfs D+J 1 0:00.91 npm (node) 92502 101865 1 D+J 0:00.91 npm (node) 0 84004 81177 0 20 0 14864 2532 piperd S+ 5 0:00.00 grep node 84004 101732 5 S+ 0:00.00 grep node root@bob.nyi:/home/peter # kgdb (kgdb) tid 101862 [Switching to thread 580 (Thread 101862)]#0 0xffffffff807e9f83 in sched_switch () (kgdb) where #0 0xffffffff807e9f83 in sched_switch () #1 0xffffffff807c3f88 in mi_switch () #2 0xffffffff8080be0a in sleepq_wait () #3 0xffffffff807941d5 in sleeplk () #4 0xffffffff8079350f in __lockmgr_args () #5 0xffffffff808710bc in vop_stdlock () #6 0xffffffff80b9b64f in VOP_LOCK1_APV () #7 0xffffffff808953c3 in _vn_lock () #8 0xffffffff803cb542 in zfs_lookup () #9 0xffffffff803cb86d in zfs_freebsd_lookup () #10 0xffffffff80b99810 in VOP_CACHEDLOOKUP_APV () #11 0xffffffff8086dc56 in vfs_cache_lookup () #12 0xffffffff80b99710 in VOP_LOOKUP_APV () #13 0xffffffff80877131 in lookup () #14 0xffffffff80876811 in namei () #15 0xffffffff80891da1 in kern_renameat () #16 0xffffffff80af24a8 in amd64_syscall () #17 0xffffffff80ad69bb in Xfast_syscall () (kgdb) tid 101863 [Switching to thread 579 (Thread 101863)]#0 0xffffffff807e9f83 in sched_switch () (kgdb) where #0 0xffffffff807e9f83 in sched_switch () #1 0xffffffff807c3f88 in mi_switch () #2 0xffffffff8080be0a in sleepq_wait () #3 0xffffffff807941d5 in sleeplk () #4 0xffffffff8079350f in __lockmgr_args () #5 0xffffffff808710bc in vop_stdlock () #6 0xffffffff80b9b64f in VOP_LOCK1_APV () #7 0xffffffff808953c3 in _vn_lock () #8 0xffffffff808839b7 in vget () #9 0xffffffff8086c3d8 in cache_lookup () #10 0xffffffff8086dc2c in vfs_cache_lookup () #11 0xffffffff80b99710 in VOP_LOOKUP_APV () #12 0xffffffff80877131 in lookup () #13 0xffffffff80876811 in namei () #14 0xffffffff80891da1 in kern_renameat () #15 0xffffffff80af24a8 in amd64_syscall () #16 0xffffffff80ad69bb in Xfast_syscall () (kgdb) tid 101864 [Switching to thread 578 (Thread 101864)]#0 0xffffffff807e9f83 in sched_switch () (kgdb) where #0 0xffffffff807e9f83 in sched_switch () #1 0xffffffff807c3f88 in mi_switch () #2 0xffffffff8080be0a in sleepq_wait () #3 0xffffffff807941d5 in sleeplk () #4 0xffffffff80793d35 in __lockmgr_args () #5 0xffffffff808710bc in vop_stdlock () #6 0xffffffff80b9b64f in VOP_LOCK1_APV () #7 0xffffffff808953c3 in _vn_lock () #8 0xffffffff808840d8 in vputx () #9 0xffffffff803d2303 in zfs_freebsd_rename () #10 0xffffffff80b9adbf in VOP_RENAME_APV () #11 0xffffffff80892198 in kern_renameat () #12 0xffffffff80af24a8 in amd64_syscall () #13 0xffffffff80ad69bb in Xfast_syscall () (kgdb) tid 101865 [Switching to thread 577 (Thread 101865)]#0 0xffffffff807e9f83 in sched_switch () (kgdb) where #0 0xffffffff807e9f83 in sched_switch () #1 0xffffffff807c3f88 in mi_switch () #2 0xffffffff8080be0a in sleepq_wait () #3 0xffffffff807941d5 in sleeplk () #4 0xffffffff8079350f in __lockmgr_args () #5 0xffffffff808710bc in vop_stdlock () #6 0xffffffff80b9b64f in VOP_LOCK1_APV () #7 0xffffffff808953c3 in _vn_lock () #8 0xffffffff808839b7 in vget () #9 0xffffffff8086c3d8 in cache_lookup () #10 0xffffffff8086dc2c in vfs_cache_lookup () #11 0xffffffff80b99710 in VOP_LOOKUP_APV () #12 0xffffffff80877131 in lookup () #13 0xffffffff80876811 in namei () #14 0xffffffff80891da1 in kern_renameat () #15 0xffffffff80af24a8 in amd64_syscall () #16 0xffffffff80ad69bb in Xfast_syscall () (kgdb) Build info is: FreeBSD bob.nyi.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r296980: Thu Mar 17 07:14:40 UTC 2016 This is a relatively new development. This used to build robustly for a while but we've been having lockups reliably for a few weeks.
Same thing is happening for me: % procstat -kk 2291 PID TID COMM TDNAME KSTACK 2291 101307 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent+0x401 sys_kevent+0x12a amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101353 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101354 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101355 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101356 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101357 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101358 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101359 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xc9e vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x21f zfs_rename_unlock+0x3e zfs_freebsd_rename+0xcdf VOP_RENAME_APV+0xab kern_renameat+0x3ef amd64_syscall+0x351 Xfast_syscall+0xfb 2291 101360 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x59c namei+0x4d4 kern_renameat+0x213 amd64_syscall+0x351 Xfast_syscall+0xfb Full thread dump at https://gist.github.com/anonymous/3be664a893c8face928b678956b86302
If it was possible, it might be useful to test with sysctl debug.vfscache=0 Why? Well there was a similar recent occurrence on Freefall, but where the vnode lock was on NFS. The similarity is that several of the threads that were waiting for the vnode lock were: cache_lookup()->vget()->_vn_lock()-> and r285632 changed cache_lookup() from using VI_LOCK() to vhold() before the vget() call. I am wondering if this change somehow broke the code. Anyhow, disabling name caching would avoid doing the code in cache_lookup(). Disabling name caching will have a performance hit, but it would be nice to see if this avoids the deadlock?
(In reply to Rick Macklem from comment #2) OK, did that on the same machine as before. A new npm install gives me something slightly different: PID TID COMM TDNAME KSTACK 21557 100965 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent+0x401 sys_kevent+0x12a amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101212 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101213 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101214 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101215 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x902 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x3c2 zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x59c namei+0x4d4 vn_open_cred+0x21d kern_openat+0x26f amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101646 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101651 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4aa __umtx_op_sem_wait+0x73 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101654 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351 Xfast_syscall+0xfb 21557 101655 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_wait+0x329 __umtx_op_wait_uint_private+0x83 amd64_syscall+0x351 Xfast_syscall+0xfb Process still hangs, acourse.
Oh well. Thanks for trying it. I am still not 100% convinced that recent changes haven't broken cache_lookup(), but your problem isn't that.
(In reply to Rick Macklem from comment #4) Rick, there is certainly a deadlock problem with rename on ZFS (on FreeBSD). It's been there since day one.
I knew there was a ZFS rename issue, but I thought it occurred rarely and typically (always??) when snapshots were being renamed. This problem seems to have become frequent. Of course, just about any change could cause changes in when things get executed and "tickle" the deadlock bug. There was a case on freefall that looked similar (due to a deadlock involving vnode locks, mostly through cache_lookup()) and I had "hoped" this was the same problem (since the freefall case occurred once in several months). It doesn't appear to be the same problem.
(In reply to Rick Macklem from comment #6) > I knew there was a ZFS rename issue, but I thought it > occurred rarely and typically (always??) when snapshots > were being renamed. It is quite rare indeed. Just in case, I meant rename(2), so not sure how snapshot renaming, which is done through a special ioctl, can get into the picture.
I"m finding this highly reproducible at the moment. Anything I might do to aid in diagnosis?
(In reply to Doug Luce from comment #8) I believe that I know the root cause (if it's the same problem that I diagnosed earlier). But my fix is very non-trivial, because it has a domino like effect on the related ZFS code. So, I am not sure how to proceed from here. I have a dusty and somewhat logically incomplete but working patch. At the very least I can rebase it to the latest head and attach it to this bug. Maybe tomorrow.
(In reply to Andriy Gapon from comment #9) You write that you think that a zfs rename (file, NOT snapshot) deadlock has been around for a long time. Do you think this could be the same issue (on 9.3)? https://lists.freebsd.org/pipermail/freebsd-fs/2015-December/022370.html The trigger there was a massively parallel java program creating (and renaming) a lot of output files. I would be interested in testing patches, but that server is unfortunately still running 9.3. Later in the summer perhaps.
(In reply to Bengt Ahlgren from comment #10) Yes, that looks like the same thing.
I've just created a phabricator review request with my patch: https://reviews.freebsd.org/D6533 I haven't added any reviewers or subscribers (other than umbrella ZFS subscriber), because I view this review request more as an invitation to a discussion or a request for comments. But please feel free to add yourself in either capacity. If you would like just to test the change, then there is 'Download Raw Diff' link. The patch is for head as of r300617.
Running the patch give me this on bootup: kernel: lock order reversal: kernel: 1st 0xfffffe00f766fba0 bufwait (bufwait) @ /usr/src/sys/kern/vfs_bio.c:3512 kernel: 2nd 0xfffff800141e0200 dirhash (dirhash) @ /usr/src/sys/ufs/ufs/ufs_dirhash.c:281 kernel: stack backtrace: kernel: #0 0xffffffff80a76bc3 at witness_debugger+0x73 kernel: #1 0xffffffff80a76ac3 at witness_checkorder+0xea3 kernel: #2 0xffffffff80a21cc5 at _sx_xlock+0x75 kernel: #3 0xffffffff80cd5ded at ufsdirhash_move+0x3d kernel: #4 0xffffffff80cd831d at ufs_direnter+0x35d kernel: #5 0xffffffff80ce08a1 at ufs_mkdir+0x8b1 kernel: #6 0xffffffff80fd2e67 at VOP_MKDIR_APV+0xf7 kernel: #7 0xffffffff80ae3b09 at kern_mkdirat+0x209 kernel: #8 0xffffffff80e7d2ae at amd64_syscall+0x2de kernel: #9 0xffffffff80e5da1b at Xfast_syscall+0xfb A bit later on: kernel: acquiring duplicate lock of same type: "kqueue" kernel: 1st kqueue @ /usr/src/sys/kern/kern_event.c:2026 kernel: 2nd kqueue @ /usr/src/sys/kern/subr_taskqueue.c:282 kernel: stack backtrace: kernel: #0 0xffffffff80a76bc3 at witness_debugger+0x73 kernel: #1 0xffffffff80a76ac3 at witness_checkorder+0xea3 kernel: #2 0xffffffff809fb2b8 at __mtx_lock_flags+0xa8 kernel: #3 0xffffffff80a6945f at taskqueue_enqueue+0x3f kernel: #4 0xffffffff809d04f9 at knote_enqueue+0x119 kernel: #5 0xffffffff809d28c1 at knote+0x1c1 kernel: #6 0xffffffff80ad9843 at vop_rename_post+0x143 kernel: #7 0xffffffff80fd2d3b at VOP_RENAME_APV+0x13b kernel: #8 0xffffffff80ae3896 at kern_renameat+0x4a6 kernel: #9 0xffffffff80e7d2ae at amd64_syscall+0x2de kernel: #10 0xffffffff80e5da1b at Xfast_syscall+0xfb and the machine locks up.
(In reply to Doug Luce from comment #13) First LOR is known, false positive and not related. Second LOR is more interesting, but it should be innocent. The cause is due to witness matching locks by names, so it is just formal and is yet another consequence of the taskqueue changes. Still, the patch attached should fix it. What program do you run which needs vnode notifications ?
Created attachment 170634 [details] Silence false LOR
Maybe btsync is leading to the second LOR? With the patch, I'm getting a locked machine after a couple minutes. Unfortunately it's an EC2 instance so no console access. Not sure what the best way to proceed would be...
A little more luck, got the panic message at least: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x118 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff8229551d stack pointer = 0x28:0xfffffe01232905e0 frame pointer = 0x28:0xfffffe0123290600 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 3233 (emacs-25.0.93) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01232900a0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123290150 vpanic() at vpanic+0x189/frame 0xfffffe01232901d0 panic() at panic+0x43/frame 0xfffffe0123290230 trap_fatal() at trap_fatal+0x36b/frame 0xfffffe0123290290 trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123290310 trap() at trap+0x4cc/frame 0xfffffe0123290520 calltrap() at calltrap+0x8/frame 0xfffffe0123290520 --- trap 0xc, rip = 0xffffffff8229551d, rsp = 0xfffffe01232905f0, rbp = 0xfffffe0123290600 --- rrm_exit() at rrm_exit+0xd/frame 0xfffffe0123290600 zfs_freebsd_rename() at zfs_freebsd_rename+0x12e/frame 0xfffffe0123290710 VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123290740 kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01232909a0 amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123290ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123290ab0 --- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp = 0x7fffffffc418, rbp = 0x7fffffffc450 --- Uptime: 28m4s Automatic reboot in 15 seconds - press a key on the console to abort --> Press a key on the console to reboot, --> or switch off the system now.
(In reply to Doug Luce from comment #17) As Konstantin said earlier those reported LORs are unlikely to have anything to do with the lock-up. The panic looks like quite unexpected too... Could you please disassemble zfs_freebsd_rename in kgdb (kgdb /path/to/zfs.ko) and attach the output? Thank you for testing!
Will this do? https://gist.github.com/anonymous/c18e9f08e7a36aef5c4ed2abb047e8f6
(In reply to Doug Luce from comment #19) Yes, thank you. Now I see one problem in the code, but it was present even without my patch and now something in the patch triggers that bug. So, I have a fix for that problem, but more investigation might be required.
Created attachment 170652 [details] delta patch
With that patch, the npm install that used to hang no longer does. The panic didn't happen until the thing had been up a few hours: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0xc0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82310db4 stack pointer = 0x28:0xfffffe01236e0600 frame pointer = 0x28:0xfffffe01236e0710 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2533 (emacs-25.0.93) trap number = 12 panic: page fault cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01236e00c0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe01236e0170 vpanic() at vpanic+0x189/frame 0xfffffe01236e01f0 panic() at panic+0x43/frame 0xfffffe01236e0250 trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01236e02b0 trap_pfault() at trap_pfault+0x200/frame 0xfffffe01236e0330 trap() at trap+0x4cc/frame 0xfffffe01236e0540 calltrap() at calltrap+0x8/frame 0xfffffe01236e0540 --- trap 0xc, rip = 0xffffffff82310db4, rsp = 0xfffffe01236e0610, rbp = 0xfffffe01236e0710 --- zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe01236e0710 VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe01236e0740 kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01236e09a0 amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe01236e0ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe01236e0ab0 --- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp = 0x7fffffffc418, rbp = 0x7fffffffc450 --- Uptime: 2h36m30s Dumping 1470 out of 4064 MB: (CTRL-C to abort) ..2%..11%..21%..31%..41%..51%..61%..71%..81%..91% Dump complete Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... cpu_reset: Restarting BSP cpu_reset_proxy: Stopped CPU 1
A second panic happened after only a few seconds of uptime after rebooting. Looks the same: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xc0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82310db4 stack pointer = 0x28:0xfffffe0123585600 frame pointer = 0x28:0xfffffe0123585710 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2734 (emacs-25.0.93) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01235850c0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123585170 vpanic() at vpanic+0x189/frame 0xfffffe01235851f0 panic() at panic+0x43/frame 0xfffffe0123585250 trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01235852b0 trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123585330 trap() at trap+0x4cc/frame 0xfffffe0123585540 calltrap() at calltrap+0x8/frame 0xfffffe0123585540 --- trap 0xc, rip = 0xffffffff82310db4, rsp = 0xfffffe0123585610, rbp = 0xfffffe0123585710 --- zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe0123585710 VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123585740 kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01235859a0 amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123585ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123585ab0 --- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp = 0x7fffffffcf88, rbp = 0x7fffffffcfc0 --- Uptime: 12m25s Dumping 634 out of 4064 MB: (CTRL-C to abort) ..3%..11%..21%..31%..41%..51%..61%..71%..81%..91% Dump complete
(In reply to Doug Luce from comment #22) I am afraid that I really need to be able to poke inside the crash dump to be able to understand what's happening here... Or, before that, could you please compile your kernel with DEBUG_VFS_LOCKS options if it's not enabled yet?
What's your SSH public key? I'll set you up with access. Recompiled with the flag, got a panic and a dump. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xc0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82311194 stack pointer = 0x28:0xfffffe0123717600 frame pointer = 0x28:0xfffffe0123717710 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2621 (emacs-25.0.93) trap number = 12 panic: page fault cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01237170c0 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0123717170 vpanic() at vpanic+0x189/frame 0xfffffe01237171f0 panic() at panic+0x43/frame 0xfffffe0123717250 trap_fatal() at trap_fatal+0x36b/frame 0xfffffe01237172b0 trap_pfault() at trap_pfault+0x200/frame 0xfffffe0123717330 trap() at trap+0x4cc/frame 0xfffffe0123717540 calltrap() at calltrap+0x8/frame 0xfffffe0123717540 --- trap 0xc, rip = 0xffffffff82311194, rsp = 0xfffffe0123717610, rbp = 0xfffffe0123717710 --- zfs_freebsd_rename() at zfs_freebsd_rename+0xe4/frame 0xfffffe0123717710 VOP_RENAME_APV() at VOP_RENAME_APV+0xfc/frame 0xfffffe0123717740 kern_renameat() at kern_renameat+0x4a6/frame 0xfffffe01237179a0 amd64_syscall() at amd64_syscall+0x2de/frame 0xfffffe0123717ab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0123717ab0 --- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x807c5fbaa, rsp = 0x7fffffffcdd8, rbp = 0x7fffffffce10 --- Uptime: 14m43s Dumping 565 out of 4064 MB: (CTRL-C to abort) ..3%..12%..23%..32%..43%..51%..63%..71%..83%..91% Dump complete
I've refreshed the phabricator request with a new version: https://reviews.freebsd.org/D6533 The new version contains quite a few fixes on top of the previous one.
Built and installed the patched kernel this morning. I was able to do the npm install that was failing before, and copies/moves between ZFS and UFS seem to be going just fine. While I don't feel qualified to provide a review on that diff, please let me know if it would be useful. I'll be happy to get myself versed on this stuff.
(In reply to Doug Luce from comment #27) Thank you very much again for testing and reporting! Your testing is a valuable data point for further work on this patch. Probably I'll now invite more ZFS / FS people to discuss and review the patch and we'll see how it goes.
I was wondering what the current state is. Which patch did Doug test? We've had some more explosions with poudriere even without npm and its threaded operations.
The last thing I tried was https://reviews.freebsd.org/D6533 on top of that day's head. Looks like there are some reviewers tagged on there but nobody seems to have gotten around to it yet.
(In reply to Doug Luce from comment #30) I'm currently testing this patch and so far it fixes the issue and doesn't cause any adverse affects, but it's only been running for a few hours. More testing would be good, but this looks promising to me.
Commits 301996 and 301997 made on 2016-06-17 to relieve pressure by ZFS on vfs.numvnodes should help moderate this problem. This is not to say that it will fully resolve the problem, so the ongoing work should continue.
Hi All We are seeing the same problem using NPM 6.2.2 within Jails with root-on-zfs. This has been a problem for some time, but seems to randomly pop up. We have run for months without this error happening, however today it seems to happen repeatedly and relatively reproducible. We have reverted back to Node v4 and NPM v2 and this bug does not seem to happen (at least thus far). Procstat trace below. # procstat -kk 12590 PID TID COMM TDNAME KSTACK 12590 101356 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d kern_kevent_fp+0x399 kern_kevent+0x9f sys_kevent+0x12a amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101358 node - mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101359 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101360 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101361 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101362 node V8 WorkerThread mi_switch+0xe1 sleepq_catch_signals+0xab sleepq_wait_sig+0xf _sleep+0x27d umtxq_sleep+0x125 do_sem_wait+0x4db __umtx_op_sem_wait+0x73 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101363 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f Xfast_syscall+0xfb 12590 101364 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x21f
Same with node 6.3.0: procstat -kk 95115 PID TID COMM TDNAME KSTACK 95115 101118 node - mi_switch+0xe1 thread_suspend_switch+0x170 thread_single+0x4e5 exit1+0xce sigexit+0x925 postsig+0x286 ast+0x427 doreti_ast+0x1f 95115 101294 node V8 WorkerThread mi_switch+0xe1 thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f 95115 101301 node V8 WorkerThread mi_switch+0xe1 thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f 95115 101303 node V8 WorkerThread mi_switch+0xe1 thread_suspend_switch+0x170 cursig+0x613 ast+0x42f doreti_ast+0x1f 95115 101305 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f Xfast_syscall+0xfb 95115 101312 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vget+0x73 cache_lookup+0x5d5 vfs_cache_lookup+0xac VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f Xfast_syscall+0xfb 95115 101313 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0xca0 vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 vputx+0x21f zfs_rename_unlock+0x3e zfs_freebsd_rename+0xe39 VOP_RENAME_APV+0xab kern_renameat+0x4a6 amd64_syscall+0x40f Xfast_syscall+0xfb 95115 101314 node - mi_switch+0xe1 sleepq_wait+0x3a sleeplk+0x15d __lockmgr_args+0x91a vop_stdlock+0x3c VOP_LOCK1_APV+0xab _vn_lock+0x43 zfs_lookup+0x45d zfs_freebsd_lookup+0x6d VOP_CACHEDLOOKUP_APV+0xa1 vfs_cache_lookup+0xd6 VOP_LOOKUP_APV+0xa1 lookup+0x5a1 namei+0x4d4 kern_renameat+0x1b7 amd64_syscall+0x40f Xfast_syscall+0xfb Nasty node, it really does something very unusual with files, whenever I met this issue it was always caused by node process, this is really funny...
Yes, after some npm modules were updated today, npm install hangs on every run after full system reboot. I have latest FreeBSD with all patches installed: FreeBSD 10.3-RELEASE-p4
Can we please get some feedback on the review for this? https://reviews.freebsd.org/D6533
The error in the following PR possible relates to this PR too: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209580
(In reply to op from comment #37) Most likely, no.
A commit references this bug: Author: avg Date: Fri Aug 5 06:23:07 UTC 2016 New revision: 303763 URL: https://svnweb.freebsd.org/changeset/base/303763 Log: zfs: honour and make use of vfs vnode locking protocol ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS. This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code. To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support PR: 209158 Reported by: many Tested by: many (thank you all!) MFC after: 5 days Sponsored by: HybridCluster / ClusterHQ Differential Revision: https://reviews.freebsd.org/D6533 Changes: head/sys/cddl/compat/opensolaris/sys/vnode.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_dir.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_vfsops.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_sa.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
A commit references this bug: Author: avg Date: Thu Aug 11 20:48:04 UTC 2016 New revision: 303970 URL: https://svnweb.freebsd.org/changeset/base/303970 Log: MFC r303763,303791,303869: zfs: honour and make use of vfs vnode locking protocol ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS. This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code. To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support PR: 209158 Approved by: re (gjb) Changes: _U stable/11/ stable/11/sys/cddl/compat/opensolaris/sys/vnode.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_dir.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_vfsops.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_sa.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
Hi, this bug will be fixed for 10.3 or 10-STABLE?
A commit references this bug: Author: avg Date: Tue Aug 23 07:55:06 UTC 2016 New revision: 304671 URL: https://svnweb.freebsd.org/changeset/base/304671 Log: MFC r303763,303791,303869: zfs: honour and make use of vfs vnode locking protocol PR: 209158 Changes: _U stable/10/ stable/10/sys/cddl/compat/opensolaris/sys/vnode.h stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_dir.h stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_vfsops.h stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_znode.h stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_sa.c stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c stable/10/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c
No new regressions have been reported in head and stable/11, so I've just merged this to stable/10. I am not planning MFC to stable/9 because of many merge conflicts. But if anyone does the merge and tests the result, then I can commit it.
Hi, Sorry to post it after PR closed, but i'm having an issue that could be linked to this problem related. My FreeBSD 11-RC2 with ZFS is crashing after some hours running with the same debug message. Let me know if i can help with more info: # kgdb kernel.debug /var/crash/vmcore.last GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 7; apic id = 0e fault virtual address = 0xffffffffffffffc0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff822cac63 stack pointer = 0x28:0xfffffe02370d02e0 frame pointer = 0x28:0xfffffe02370d0350 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1174 (syslogd) trap number = 12 panic: page fault cpuid = 5 KDB: stack backtrace: #0 0xffffffff80b33417 at kdb_backtrace+0x67 #1 0xffffffff80ae8812 at vpanic+0x182 #2 0xffffffff80ae8683 at panic+0x43 #3 0xffffffff81036d31 at trap_fatal+0x351 #4 0xffffffff81036f23 at trap_pfault+0x1e3 #5 0xffffffff810364cc at trap+0x26c #6 0xffffffff810191d1 at calltrap+0x8 #7 0xffffffff822a360d at dmu_buf_hold_array_by_dnode+0x34d #8 0xffffffff822a4703 at dmu_write_uio_dnode+0x43 #9 0xffffffff822a46a2 at dmu_write_uio_dbuf+0x42 #10 0xffffffff82342c48 at zfs_freebsd_write+0x7b8 #11 0xffffffff811a1dc5 at VOP_WRITE_APV+0x125 #12 0xffffffff80bcffec at vn_write+0x25c #13 0xffffffff80bcd5b2 at vn_io_fault1+0x1c2 #14 0xffffffff80bcb5a7 at vn_io_fault+0x197 #15 0xffffffff80b50667 at dofilewrite+0x87 #16 0xffffffff80b50348 at kern_writev+0x68 #17 0xffffffff80b505b6 at sys_writev+0x36 Uptime: 4h30m39s Dumping 1031 out of 8142 MB: (CTRL-C to abort) ..2%..11%..21%..32%..41%..52%..61%..72%..81%..92% Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...done. done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /usr/lib/debug//boot/kernel/opensolaris.ko.debug...done. done. Loaded symbols for /boot/kernel/opensolaris.ko Reading symbols from /boot/kernel/ng_UI.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_UI.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_UI.ko Reading symbols from /boot/kernel/netgraph.ko...Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug...done. done. Loaded symbols for /boot/kernel/netgraph.ko Reading symbols from /boot/kernel/ng_async.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_async.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_async.ko Reading symbols from /boot/kernel/ng_bpf.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_bpf.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_bpf.ko Reading symbols from /boot/kernel/ng_cisco.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_cisco.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_cisco.ko Reading symbols from /boot/kernel/ng_echo.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_echo.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_echo.ko Reading symbols from /boot/kernel/ng_ether.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_ether.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_ether.ko Reading symbols from /boot/kernel/ng_frame_relay.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_frame_relay.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_frame_relay.ko Reading symbols from /boot/kernel/ng_hole.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_hole.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_hole.ko Reading symbols from /boot/kernel/ng_iface.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_iface.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_iface.ko Reading symbols from /boot/kernel/ng_ksocket.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_ksocket.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_ksocket.ko Reading symbols from /boot/kernel/ng_l2tp.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_l2tp.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_l2tp.ko Reading symbols from /boot/kernel/ng_lmi.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_lmi.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_lmi.ko Reading symbols from /boot/kernel/ng_mppc.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_mppc.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_mppc.ko Reading symbols from /boot/kernel/rc4.ko...Reading symbols from /usr/lib/debug//boot/kernel/rc4.ko.debug...done. done. Loaded symbols for /boot/kernel/rc4.ko Reading symbols from /boot/kernel/ng_netflow.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_netflow.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_netflow.ko Reading symbols from /boot/kernel/ng_ppp.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_ppp.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_ppp.ko Reading symbols from /boot/kernel/ng_pppoe.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_pppoe.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_pppoe.ko Reading symbols from /boot/kernel/ng_pptpgre.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_pptpgre.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_pptpgre.ko Reading symbols from /boot/kernel/ng_rfc1490.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_rfc1490.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_rfc1490.ko Reading symbols from /boot/kernel/ng_socket.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_socket.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_socket.ko Reading symbols from /boot/kernel/ng_tee.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_tee.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_tee.ko Reading symbols from /boot/kernel/ng_tty.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_tty.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_tty.ko Reading symbols from /boot/kernel/ng_vjc.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_vjc.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_vjc.ko Reading symbols from /boot/kernel/coretemp.ko...Reading symbols from /usr/lib/debug//boot/kernel/coretemp.ko.debug...done. done. Loaded symbols for /boot/kernel/coretemp.ko Reading symbols from /boot/kernel/ng_tcpmss.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_tcpmss.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_tcpmss.ko Reading symbols from /boot/kernel/ng_nat.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_nat.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_nat.ko Reading symbols from /boot/kernel/ng_car.ko...Reading symbols from /usr/lib/debug//boot/kernel/ng_car.ko.debug...done. done. Loaded symbols for /boot/kernel/ng_car.ko #0 doadump (textdump=<value optimized out>) at pcpu.h:221 221 pcpu.h: No such file or directory. in pcpu.h (kgdb) list *0xffffffff822cac63 0xffffffff822cac63 is in dmu_zfetch (/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c:243). 238 239 rw_enter(&zf->zf_rwlock, RW_READER); 240 241 for (zs = list_head(&zf->zf_stream); zs != NULL; 242 zs = list_next(&zf->zf_stream, zs)) { 243 if (blkid == zs->zs_blkid) { 244 mutex_enter(&zs->zs_lock); 245 /* 246 * zs_blkid could have changed before we 247 * acquired zs_lock; re-check them here. Current language: auto; currently minimal (kgdb) bt #0 doadump (textdump=<value optimized out>) at pcpu.h:221 #1 0xffffffff80ae8299 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80ae884b in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 #3 0xffffffff80ae8683 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:690 #4 0xffffffff81036d31 in trap_fatal (frame=0xfffffe02370d0230, eva=18446744073709551552) at /usr/src/sys/amd64/amd64/trap.c:841 #5 0xffffffff81036f23 in trap_pfault (frame=0xfffffe02370d0230, usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691 #6 0xffffffff810364cc in trap (frame=0xfffffe02370d0230) at /usr/src/sys/amd64/amd64/trap.c:442 #7 0xffffffff810191d1 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff822cac63 in dmu_zfetch (zf=<value optimized out>, blkid=<value optimized out>, nblks=<value optimized out>, fetch_data=<value optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c:241 #9 0xffffffff822a360d in dmu_buf_hold_array_by_dnode (dn=<value optimized out>, offset=<value optimized out>, length=<value optimized out>, read=0, tag=0xffffffff8239f791, numbufsp=<value optimized out>, dbpp=0xffffffffffffffc0, flags=<value optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:463 #10 0xffffffff822a4703 in dmu_write_uio_dnode (dn=0xfffff8001bc532c8, uio=0xfffff8010f1a1600, size=47, tx=0xfffff8010f16aa00) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1173 #11 0xffffffff822a46a2 in dmu_write_uio_dbuf (zdb=0xfffff8001bba4438, uio=0xfffff8010f1a1600, size=47, tx=0xfffff8010f16aa00) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1244 #12 0xffffffff82342c48 in zfs_freebsd_write (ap=<value optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1117 #13 0xffffffff811a1dc5 in VOP_WRITE_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:1000 #14 0xffffffff80bcffec in vn_write (fp=<value optimized out>, uio=0xfffff8010f1a1600, active_cred=<value optimized out>, flags=<value optimized out>, td=0x0) at vnode_if.h:413 #15 0xffffffff80bcd5b2 in vn_io_fault1 () at /usr/src/sys/kern/vfs_vnops.c:1060 #16 0xffffffff80bcb5a7 in vn_io_fault (fp=<value optimized out>, uio=<value optimized out>, active_cred=<value optimized out>, flags=<value optimized out>, td=<value optimized out>) at /usr/src/sys/kern/vfs_vnops.c:1165 #17 0xffffffff80b50667 in dofilewrite (td=0xfffff8001ba19500, fd=19, fp=0xfffff8001b91cd70, auio=0xfffff8010f1a1600, offset=<value optimized out>, flags=0) at file.h:311 #18 0xffffffff80b50348 in kern_writev (td=0xfffff8001ba19500, fd=19, auio=0xfffff8010f1a1600) at /usr/src/sys/kern/sys_generic.c:506 #19 0xffffffff80b505b6 in sys_writev (td=0xfffff8001ba19500, uap=0xfffffe02370d0b80) at /usr/src/sys/kern/sys_generic.c:491 #20 0xffffffff8103768e in amd64_syscall (td=<value optimized out>, traced=0) at subr_syscall.c:135 #21 0xffffffff810194bb in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:396 #22 0x0000000800b87f8a in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb)
(In reply to Cassiano Peixoto from comment #44) Could you please open a new bug report for this problem? Also, in frame 10 could you please print *dn and dn->dn_zfetch? Preferably with pretty printing turned on. Thanks!
Hi Andriy, Thanks for your reply. Here is the new PR opened. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=212609 About your request, please can you send me the commands you need? because i'm not so familiar with kgdb commands. Thanks.
any idea if an EN is planned for this bug?
I believe i am seeing activity related to this PR We're finding that npm gets stuck in updates, and if the parent is killed, it leaves a process in uninterruptible state (this has not even been killlable during shutdown and a previous box with same symptoms has needed a hard reset) Child processes get stuck in a DN state, it does not seem to matter if npm or node was running prior to update, as npm is called as part of the update process. This also seems to be affecting separate periodic scripts such as 100.checksuid which also become stuck. We now have 5 boxes in this state, and i suspect all will need a hard reboot to clear. Systems are FreeBSD 10.3-REL on ZFS
*** Bug 217267 has been marked as a duplicate of this bug. ***
Is there a chance to get this into 10.3-RELEASE-pXX?
(In reply to Michael Gmelin from comment #50) +1 on this. Any chance to have the fix merged on 10.3? I can reliably reproduce the problem at any time on 10.3-RELEASE-p11/-p17 with an nmp install. It is quite a show blocker, and we're relying on LTS releases (10.3-RELEASE being the only one currently)...