[164936] Uptime: 1d21h48m56s [164937] [164937] [164937] Fatal trap 9: general protection fault while in kernel mode [164937] cpuid = 8; apic id = 08 [164937] instruction pointer = 0x20:0xffffffff80b70534 [164937] stack pointer = 0x28:0xfffffe009f13f950 [164937] frame pointer = 0x28:0xfffffe009f13f980 [164937] code segment = base rx0, limit 0xfffff, type 0x1b [164937] = DPL 0, pres 1, long 1, def32 0, gran 1 [164937] processor eflags = interrupt enabled, resume, IOPL = 0 [164937] current process = 0 (dbu_evict) Further details here: https://reviews.freebsd.org/P175
I have a similar crash o shutdown .. screenshot attached to the link above ( https://reviews.freebsd.org/P175) FreeBSD build3.local 12.0-CURRENT FreeBSD 12.0-CURRENT #3 r332467: Fri Apr 13 04:59:35 PDT 2018 root@build3.local:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
A similar report by Harry Schmalzbauer <freebsd@omnilan.de> on mailing lists: ... #8 0xffffffff80ae89fc in calltrap () at /usr/local/share/deploy-tools/HEAD/src/sys/amd64/amd64/exception.S:230 #9 0xffffffff81b13984 in spa_get_dsl () from /boot/kernel/zfs.ko #10 0xffffffff81b17d1f in txg_verify () from /boot/kernel/zfs.ko #11 0xffffffff81b1829f in txg_list_member () from /boot/kernel/zfs.ko #12 0xffffffff81acb969 in dsl_dir_evict_async () from /boot/kernel/zfs.ko #13 0xffffffff808260cc in taskqueue_run_locked (queue=0xfffff80004058800) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_taskqueue.c:465 #14 0xffffffff80826e98 in taskqueue_thread_loop (arg=<value optimized out>) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/subr_taskqueue.c:757 #15 0xffffffff8078add4 in fork_exit (callout=0xffffffff80826e10 <taskqueue_thread_loop>, arg=0xfffff80003b46e40, frame=0xfffffe0000575ac0) at /usr/local/share/deploy-tools/HEAD/src/sys/kern/kern_fork.c:1039
(In reply to Julian Elischer from comment #1) Bugzilla supports attachments too. In fact, I wish that at least the stack trace from the original report was here as well.
Created attachment 193474 [details] initial report and stacktrace
I added a copy of the report. I wish that bugzilla and phab were more closely linked :\
(In reply to Eitan Adler from comment #5) Do you still have the crash dump? If so, could you please provide full output of 'p *dd' ?
Unfortunately I no longer have this specific crash, but I do have two crashes that look similar on shutdown. ---- CrashB("shutdown"): #11 _rw_wlock_cookie (c=0xdeadc0dedeadc286, file=0xffffffff81286e05 "/usr/src/sys/net/if.c", line=3646) at /usr/src/sys/kern/kern_rwlock.c:279 rw = 0xdeadc0dedeadc26e v = <optimized out> tid = <optimized out> #12 0xffffffff80c6e48b in if_delmulti_ifma (ifma=0xfffff8001f2cfb00) at /usr/src/sys/net/if.c:3646 ifp = 0xdeadc0dedeadc0de lastref = <optimized out> #13 0xffffffff80daf699 in in6m_release (inm=<optimized out>) at /usr/src/sys/netinet6/in6_mcast.c:545 ifma = 0xfffff8001f2cfb00 ifp = <optimized out> saved_vnet = 0x0 #14 in6m_release_task (arg=<optimized out>) at /usr/src/sys/netinet6/in6_mcast.c:617 inm = <optimized out> in6m_free_tmp = { slh_first = 0xfffff800071bfc00 } tinm = 0x0 #15 0xffffffff80bb0c99 in gtaskqueue_run_locked (queue=0xfffff80003752b00) at /usr/src/sys/kern/subr_gtaskqueue.c:332 tb_first = <optimized out> gtask = 0xffffffff81f9d6d8 <free_gtask> tb = <optimized out> #16 0xffffffff80bb0a18 in gtaskqueue_thread_loop (arg=<optimized out>) at /usr/src/sys/kern/subr_gtaskqueue.c:507 tqp = <optimized out> tq = 0xfffff80003752b00 #17 0xffffffff80b2d074 in fork_exit (callout=0xffffffff80bb0990 <gtaskqueue_thread_loop>, arg=0xfffffe00bbcc6008, frame=0xfffffe00004b9ac0) at /usr/src/sys/kern/kern_fork.c:1039 td = 0xfffff8000387a000 p = 0xffffffff81feeaa0 <proc0> dtd = <optimized out> #18 <signal handler called> ------- CrashC("shutdown-crash-3") crashes gdb so no further debugging information. This is fro the minidump (kgdb) #0 __curthread () at ./machine/pcpu.h:231 #1 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:366 #2 0xffffffff80432eac in db_fncall_generic (addr=<optimized out>, rv=<optimized out>, nargs=<optimized out>, args=<optimized out>) at /usr/src/sys/ddb/db_command.c:609 #3 db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:657 #4 0xffffffff804329dd in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=<optimized out>) at /usr/src/sys/ddb/db_command.c:481 #5 0xffffffff80432774 in db_command_loop () at /usr/src/sys/ddb/db_command.c:534 #6 0xffffffff8043595f in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:250 #7 0xffffffff80b9c8b3 in kdb_trap (type=9, code=0, tf=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:697 #8 0xffffffff81010a8f in trap_fatal (frame=0xfffffe00bdb58890, eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:872 #9 0xffffffff8101006d in trap (frame=0xfffffe00bdb58890) at /usr/src/sys/amd64/amd64/trap.c:205 #10 <signal handler called> #11 _sx_xlock (sx=0xdeadc0dedeadd47e, opts=0, file=0xffffffff827445a7 "/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c", line=162) at /usr/src/sys/kern/kern_sx.c:320 #12 0xffffffff826772da in refcount_remove_many (rc=0xdeadc0dedeadd47e, number=1, holder=0xfffff80045e39400) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/refcount.c:162 #13 0xffffffff8264bd7c in dsl_dir_evict_async (dbu=0xfffff80045e39400) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:158 #14 0xffffffff80baf95c in taskqueue_run_locked (queue=0xfffff80035845300) at /usr/src/sys/kern/subr_taskqueue.c:465 #15 0xffffffff80bb0728 in taskqueue_thread_loop (arg=<optimized out>) at /usr/src/sys/kern/subr_taskqueue.c:757 #16 0xffffffff80b17f04 in fork_exit ( callout=0xffffffff80bb06a0 <taskqueue_thread_loop>, arg=0xfffff8000511d0b0, frame=0xfffffe00bdb58ac0) at /usr/src/sys/kern/kern_fork.c:1039 #17 <signal handler called>
CrashC looks interesting. Have you tried the legacy gdb in base and the ports gdb?
The crash is from ports kgdb and is being tracked at bug #228006. /usr/libexec/kgdb on "shutdown-crash-3" shows ∴/usr/libexec/kgdb kernel/kernel vmcore.5 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd".../home/eax/.gdbinit:4: Error in sourced command file: No symbol table is loaded. Use the "file" command. BFD: /home/eax/crashes/shutdown-crash-3/kernel/kernel: invalid relocation type 37 BFD: BFD 2.17.50 [FreeBSD] 2007-07-03 assertion fail /usr/src/gnu/usr.bin/binutils/libbfd/../../../../contrib/binutils/bfd/elf64-x86-64.c:276 BFD: /home/eax/crashes/shutdown-crash-3/kernel/kernel: invalid relocation type 37 BFD: BFD 2.17.50 [FreeBSD] 2007-07-03 assertion fail /usr/src/gnu/usr.bin/binutils/libbfd/../../../../contrib/binutils/bfd/elf64-x86-64.c:276 Cannot access memory at address 0x0 (kgdb) file kernel/kernel.debug Load new symbol table from "kernel/kernel.debug"? (y or n) y Reading symbols from kernel/kernel.debug...done. (kgdb) bt #0 0x0000000000000000 in ?? () Cannot access memory at address 0x0
(In reply to Andriy Gapon from comment #6) > Do you still have the crash dump? > If so, could you please provide full output of 'p *dd' ? I still observe the panic everyday, so I have a crash dump: (kgdb) frame 10 #10 0xffffffff8035f6dc in dsl_dir_evict_async (dbu=0xfffff80006b67400) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:158 158 spa_async_close(dd->dd_pool->dp_spa, dd); (kgdb) p *dd $7 = {dd_dbu = {dbu_tqent = {tqent_task = {ta_link = { stqe_next = 0xfffff8000689b400}, ta_pending = 0, ta_priority = 0, ta_func = 0xffffffff802f5410 <taskq_run_ent>, ta_context = 0xfffff80006b67400}, tqent_func = 0xffffffff8035f4e0 <dsl_dir_evict_async>, tqent_arg = 0xfffff80006b67400}, dbu_evict_func_sync = 0x0, dbu_evict_func_async = 0xffffffff8035f4e0 <dsl_dir_evict_async>, dbu_clear_on_evict_dbufp = 0xfffff80006b67458}, dd_object = 12, dd_pool = 0xfffff800066f5800, dd_dbuf = 0x0, dd_dirty_link = {tn_next = { 0x0, 0x0, 0x0, 0x0}, tn_member = "\000\000\000"}, dd_parent = 0xfffff80006b66c00, dd_lock = {lock_object = { lo_name = 0xffffffff80999c14 "dd->dd_lock", lo_flags = 577830912, lo_data = 0, lo_witness = 0x0}, sx_lock = 1}, dd_props = { list_size = 56, list_offset = 0, list_head = { list_next = 0xfffff80006b674c0, list_prev = 0xfffff80006b674c0}}, dd_snap_cmtime = {tv_sec = 1534644915, tv_nsec = 715064905}, dd_origin_txg = 0, dd_tempreserved = {0, 0, 0, 0}, dd_space_towrite = {0, 0, 0, 0}, dd_myname = "$ORIGIN", '\000' <repeats 248 times>} (kgdb) printf "%X\n", *(int *)dd->dd_pool DEADC0DE It looks like memory referenced by dd->dd_pool is already freed when spa_async_close() is called.
I've also seen this issue on FreeBSD 11.2-RELEASE-p2, not CURRENT in somewhat reproducible fashion; the panic seems to happen under heavy ZFS load reading and writing via NFS _eventually_ without fail. The manner which I trigger it each time is running a bittorrent client which is writing to the NFS share from another machine. The stack trace is always the same. Problem Machine: FreeBSD nas 11.2-RELEASE-p2 FreeBSD 11.2-RELEASE-p2 #0: Tue Aug 14 21:45:40 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 Entry in /var/log/messages on reboot: Sep 4 23:08:35 nas kernel: Fatal trap 9: general protection fault while in kernel mode Sep 4 23:08:35 nas kernel: cpuid = 0; apic id = 00 Sep 4 23:08:35 nas kernel: instruction pointer = 0x20:0xffffffff80f7572e Sep 4 23:08:35 nas kernel: stack pointer = 0x28:0xfffffe023273a9b0 Sep 4 23:08:35 nas kernel: frame pointer = 0x28:0xfffffe023273a9b0 Sep 4 23:08:35 nas kernel: code segment = base rx0, limit 0xfffff, type 0x 1b Sep 4 23:08:35 nas kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Sep 4 23:08:35 nas kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Sep 4 23:08:35 nas kernel: current process = 0 (zio_read_intr_0_4) Sep 4 23:08:35 nas kernel: trap number = 9 Sep 4 23:08:35 nas kernel: panic: general protection fault Sep 4 23:08:35 nas kernel: cpuid = 0 Sep 4 23:08:35 nas kernel: KDB: stack backtrace: Sep 4 23:08:35 nas kernel: #0 0xffffffff80b3d567 at kdb_backtrace+0x67 Sep 4 23:08:35 nas kernel: #1 0xffffffff80af6b07 at vpanic+0x177 Sep 4 23:08:35 nas kernel: #2 0xffffffff80af6983 at panic+0x43 Sep 4 23:08:35 nas kernel: #3 0xffffffff80f77fcf at trap_fatal+0x35f Sep 4 23:08:35 nas kernel: #4 0xffffffff80f7758e at trap+0x5e Sep 4 23:08:35 nas kernel: #5 0xffffffff80f57dac at calltrap+0x8 Sep 4 23:08:35 nas kernel: #6 0xffffffff8228a6b6 at abd_copy_off+0x156 Sep 4 23:08:35 nas kernel: #7 0xffffffff82307ffa at vdev_queue_agg_io_done+0x6a Sep 4 23:08:35 nas kernel: #8 0xffffffff8232ecfe at zio_done+0x90e Sep 4 23:08:35 nas kernel: #9 0xffffffff8232a74c at zio_execute+0xac Sep 4 23:08:35 nas kernel: #10 0xffffffff80b4ed74 at taskqueue_run_locked+0x154 Sep 4 23:08:35 nas kernel: #11 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98 Sep 4 23:08:35 nas kernel: #12 0xffffffff80aba083 at fork_exit+0x83 zpool status: root@nas:/var/log # zpool status pool: zroot state: ONLINE scan: scrub repaired 24K in 0h7m with 0 errors on Tue Sep 4 23:21:17 2018 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 ada4p3 ONLINE 0 0 0 ada5p3 ONLINE 0 0 0 ada6p3 ONLINE 0 0 0 ada7p3 ONLINE 0 0 0 errors: No known data errors I can collect a crash dump if it will be useful.
Is this still occurring on recent 12.0-ALPHA builds?
(In reply to Mark Johnston from comment #12) > Is this still occurring on recent 12.0-ALPHA builds? I will try to update my box to latest current and leave it powered for a night. The bug requires several hours of uptime to be triggered. ALPHA6 is affected by it.
(In reply to Mark Johnston from comment #12) 12.0-ALPHA9 r339280 still panics.
I took at a look at a vmcore provided by wulf@. At the time of the panic, the kernel was waiting for MOS dnode dbuf evictions to finsh: (kgdb) bt #0 sched_switch (td=0xfffff800035d3000, newtd=0xfffff800035d2580, flags=<optimized out>) at /usr/src/sys/kern/sched_ule.c:2112 #1 0xffffffff806a759f in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:439 #2 0xffffffff806f0d8d in sleepq_switch (wchan=0xfffffe008dffe390, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:613 #3 0xffffffff806f0c33 in sleepq_wait (wchan=0xfffffe008dffe390, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:692 #4 0xffffffff806381f3 in _cv_wait (cvp=0xfffffe008dffe390, lock=<optimized out>) at /usr/src/sys/kern/kern_condvar.c:146 #5 0xffffffff8039d5db in spa_evicting_os_wait (spa=<optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c:1959 #6 0xffffffff8038ad9b in spa_deactivate (spa=0xfffffe008dffe000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:1272 #7 0xffffffff80393b88 in spa_evict_all () at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:8350 #8 0xffffffff8039dade in spa_fini () at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c:2141 #9 0xffffffff803e6bdc in zfs__fini () at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:7109 #10 0xffffffff8069bf86 in kern_reboot (howto=16392) at /usr/src/sys/kern/kern_shutdown.c:443 #11 0xffffffff8069bb4a in sys_reboot (td=<optimized out>, uap=0xfffff800035d33c0) at /usr/src/sys/kern/kern_shutdown.c:280 At this point, the spa_unload() call preceding the spa_deactivate() call had already freed the pool. However, dsl_pool_close() calls dmu_buf_user_evict_wait() after kicking off evictions of top-level directories: 452 /* 453 * Drop our references from dsl_pool_open(). 454 * 455 * Since we held the origin_sintnap from "syncing" context (which 456 * includes pool-opening context), it actually only got a "ref" 457 * and not a hold, so just drop that here. 458 */ 459 if (dp->dp_origin_snap != NULL) 460 dsl_dataset_rele(dp->dp_origin_snap, dp); 461 if (dp->dp_mos_dir != NULL) 462 dsl_dir_rele(dp->dp_mos_dir, dp); 463 if (dp->dp_free_dir != NULL) 464 dsl_dir_rele(dp->dp_free_dir, dp); 465 if (dp->dp_leak_dir != NULL) 466 dsl_dir_rele(dp->dp_leak_dir, dp); 467 if (dp->dp_root_dir != NULL) 468 dsl_dir_rele(dp->dp_root_dir, dp); ... 496 dmu_buf_user_evict_wait(); Looking a bit at the dbuf: (kgdb) frame 12 #12 0xffffffff8036221c in dsl_dir_evict_async (dbu=0xfffff800053da400) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:158 158 spa_async_close(dd->dd_pool->dp_spa, dd); (kgdb) p dd->dd_myname $42 = "$ORIGIN", '\000' <repeats 248 times> (kgdb) p dd->dd_parent->dd_myname $43 = "u01", '\000' <repeats 252 times> I'm not sure what $ORIGIN is; I guess it's some ZFS metadata. I looked at taskq_wait() in FreeBSD vs. illumos. On FreeBSD it will only wait for currently queued tasks to finish; anything enqueued after the drain starts may not be finished by the time we return. On illumos it looks like taskq_wait() will wait until the queue is completely empty. So, if the async evictions queue some additional evictions, on FreeBSD we won't recursively wait, and the taskq_wait() will return early. I can't tell if ZFS is making this assumption though.
(In reply to Mark Johnston from comment #15) $ORIGIN is a special hidden dataset that has no data and has a single snapshot $ORIGIN@$ORIGIN. All datasets that are not clones of any "real" dataset are created as clones of $ORIGIN@$ORIGIN. Good observation on the taskq_wait() behavior. Maybe that explains the problem.
Created attachment 199035 [details] proposed patch Would anyone hitting the problem be willing to test the attached patch? I have only compile-tested it; please be correspondingly cautious when testing.
(In reply to Mark Johnston from comment #17) No panics so far.
(In reply to Vladimir Kondratyev from comment #18) Thanks. https://reviews.freebsd.org/D17975
A commit references this bug: Author: markj Date: Wed Nov 21 17:18:28 UTC 2018 New revision: 340730 URL: https://svnweb.freebsd.org/changeset/base/340730 Log: Add a taskqueue_quiesce(9) KPI. This is similar to taskqueue_drain_all(9) but will wait for the queue to become idle before returning instead of only waiting for already-enqueued tasks to finish. This will be used in the opensolaris compat layer. PR: 227784 Reviewed by: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17975 Changes: head/share/man/man9/Makefile head/share/man/man9/taskqueue.9 head/sys/kern/subr_taskqueue.c head/sys/sys/taskqueue.h
A commit references this bug: Author: markj Date: Wed Nov 21 17:19:09 UTC 2018 New revision: 340731 URL: https://svnweb.freebsd.org/changeset/base/340731 Log: Use taskqueue_quiesce(9) to implement taskq_wait(). PR: 227784 Reviewed by: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17975 Changes: head/sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c
A commit references this bug: Author: markj Date: Wed Nov 28 16:52:43 UTC 2018 New revision: 341152 URL: https://svnweb.freebsd.org/changeset/base/341152 Log: MFC r340730, r340731: Add taskqueue_quiesce(9) and use it to implement taskq_wait(). PR: 227784 Changes: _U stable/12/ stable/12/share/man/man9/Makefile stable/12/share/man/man9/taskqueue.9 stable/12/sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c stable/12/sys/kern/subr_taskqueue.c stable/12/sys/sys/taskqueue.h
A commit references this bug: Author: markj Date: Wed Nov 28 17:00:19 UTC 2018 New revision: 341154 URL: https://svnweb.freebsd.org/changeset/base/341154 Log: MFC r340730, r340731: Add taskqueue_quiesce(9) and use it to implement taskq_wait(). PR: 227784 Changes: _U stable/11/ stable/11/share/man/man9/Makefile stable/11/share/man/man9/taskqueue.9 stable/11/sys/kern/subr_taskqueue.c stable/11/sys/sys/taskqueue.h