While rsync'ing some files, the system eventually panics with: > panic: solaris assert: sa.sa_magic == 0x2F505A (0x51786a88 == 0x2f505a), file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c, line: 610 The backtrace is: > #0 0xffffffff804f23c0 at kdb_backtrace+0x60 > #1 0xffffffff804b3824 at vpanic+0x124 > #2 0xffffffff804b36f3 at panic+0x43 > #3 0xffffffff8109122f at assfail3+0x2f > #4 0xffffffff80e4d417 at zfs_space_delta_cb+0x107 > #5 0xffffffff80dc156f at dmu_objset_userquota_get_ids+0x36f > #6 0xffffffff80dc952f at dnode_sync+0x9f > #7 0xffffffff80dc0aeb at dmu_objset_sync_dnodes+0x2b > #8 0xffffffff80dc0920 at dmu_objset_sync+0x220 > #9 0xffffffff80dd241a at dsl_dataset_sync+0x1ca > #10 0xffffffff80ddabca at dsl_pool_sync+0x7a > #11 0xffffffff80dfa77d at spa_sync+0x57d > #12 0xffffffff80e03f56 at txg_sync_thread+0x3a6 > #13 0xffffffff8048037a at fork_exit+0x9a > #14 0xffffffff806ccfde at fork_trampoline+0xe This happens on: > FreeBSD fred 10.3-STABLE FreeBSD 10.3-STABLE #1 r308740: Sat Nov 19 21:15:27 GMT 2016 root@fred:/usr/obj/usr/src/sys/FRED amd64 and as well on a newer version: > FreeBSD fred 10.3-STABLE FreeBSD 10.3-STABLE #0 r312924: Sun Jan 29 01:45:50 GMT 2017 root@fred:/usr/obj/usr/src/sys/FRED amd64 The machine has ECC RAM, and a zpool scrub did not report any errors. The pool is a 2-way-mirror. The actual value of sa.sa_magic changes (not everytime though): e.g. 0x51786a88 or 0x5178691e. The code in question at zfs_vfsops.c, line 610 is: > static int > zfs_space_delta_cb(dmu_object_type_t bonustype, void *data, > uint64_t *userp, uint64_t *groupp) > { > /* > * Is it a valid type of object to track? > */ > if (bonustype != DMU_OT_ZNODE && bonustype != DMU_OT_SA) > return (SET_ERROR(ENOENT)); > > /* > * If we have a NULL data pointer > * then assume the id's aren't changing and > * return EEXIST to the dmu to let it know to > * use the same ids > */ > if (data == NULL) > return (SET_ERROR(EEXIST)); > > if (bonustype == DMU_OT_ZNODE) { > znode_phys_t *znp = data; > *userp = znp->zp_uid; > *groupp = znp->zp_gid; > } else { > int hdrsize; > sa_hdr_phys_t *sap = data; > sa_hdr_phys_t sa = *sap; > boolean_t swap = B_FALSE; > > ASSERT(bonustype == DMU_OT_SA); > > if (sa.sa_magic == 0) { > /* > * This should only happen for newly created > * files that haven't had the znode data filled > * in yet. > */ > *userp = 0; > *groupp = 0; > return (0); > } > if (sa.sa_magic == BSWAP_32(SA_MAGIC)) { > sa.sa_magic = SA_MAGIC; > sa.sa_layout_info = BSWAP_16(sa.sa_layout_info); > swap = B_TRUE; > } else { > /*610*/ VERIFY3U(sa.sa_magic, ==, SA_MAGIC); // FAILS > } > > hdrsize = sa_hdrsize(&sa); > VERIFY3U(hdrsize, >=, sizeof (sa_hdr_phys_t)); > *userp = *((uint64_t *)((uintptr_t)data + hdrsize + > SA_UID_OFFSET)); > *groupp = *((uint64_t *)((uintptr_t)data + hdrsize + > SA_GID_OFFSET)); > if (swap) { > *userp = BSWAP_64(*userp); > *groupp = BSWAP_64(*groupp); > } > } > return (0); > } The stacktrace has dmu_objset_userquota_get_ids listed. And the target file system (into which rsync is writing) has quota enabled: > NAME PROPERTY VALUE SOURCE > freddata/backup/alexs-imac type filesystem - > freddata/backup/alexs-imac creation Wed Mar 7 18:01 2012 - > freddata/backup/alexs-imac used 521G - > freddata/backup/alexs-imac available 29.2G - > freddata/backup/alexs-imac referenced 213G - > freddata/backup/alexs-imac compressratio 1.11x - > freddata/backup/alexs-imac mounted yes - > freddata/backup/alexs-imac quota 550G local > freddata/backup/alexs-imac reservation 300G local > freddata/backup/alexs-imac recordsize 128K default > freddata/backup/alexs-imac mountpoint /backup/alexs-imac inherited from freddata/backup > freddata/backup/alexs-imac sharenfs off default > freddata/backup/alexs-imac checksum fletcher4 inherited from freddata > freddata/backup/alexs-imac compression lz4 local > freddata/backup/alexs-imac atime off inherited from freddata/backup > freddata/backup/alexs-imac devices on default > freddata/backup/alexs-imac exec on default > freddata/backup/alexs-imac setuid on default > freddata/backup/alexs-imac readonly off default > freddata/backup/alexs-imac jailed off default > freddata/backup/alexs-imac snapdir hidden default > freddata/backup/alexs-imac aclmode discard default > freddata/backup/alexs-imac aclinherit restricted default > freddata/backup/alexs-imac canmount on default > freddata/backup/alexs-imac xattr off temporary > freddata/backup/alexs-imac copies 1 default > freddata/backup/alexs-imac version 5 - > freddata/backup/alexs-imac utf8only off - > freddata/backup/alexs-imac normalization none - > freddata/backup/alexs-imac casesensitivity sensitive - > freddata/backup/alexs-imac vscan off default > freddata/backup/alexs-imac nbmand off default > freddata/backup/alexs-imac sharesmb off default > freddata/backup/alexs-imac refquota none default > freddata/backup/alexs-imac refreservation none default > freddata/backup/alexs-imac primarycache all default > freddata/backup/alexs-imac secondarycache all default > freddata/backup/alexs-imac usedbysnapshots 308G - > freddata/backup/alexs-imac usedbydataset 213G - > freddata/backup/alexs-imac usedbychildren 0 - > freddata/backup/alexs-imac usedbyrefreservation 0 - > freddata/backup/alexs-imac logbias latency default > freddata/backup/alexs-imac dedup off default > freddata/backup/alexs-imac mlslabel - > freddata/backup/alexs-imac sync standard default > freddata/backup/alexs-imac refcompressratio 1.06x - > freddata/backup/alexs-imac written 145G - > freddata/backup/alexs-imac logicalused 569G - > freddata/backup/alexs-imac logicalreferenced 221G - > freddata/backup/alexs-imac volmode default default > freddata/backup/alexs-imac filesystem_limit none default > freddata/backup/alexs-imac snapshot_limit none default > freddata/backup/alexs-imac filesystem_count none default > freddata/backup/alexs-imac snapshot_count none default > freddata/backup/alexs-imac redundant_metadata all default
Did you do 'zfs upgrade' any time recently?
Haven't done any zfs/zpool upgrades in a long time. zfs version is 5 (for this and all other fs in the pool). zpool version is: > NAME PROPERTY VALUE SOURCE > [...snip...] > freddata version - default > freddata feature@async_destroy enabled local > freddata feature@empty_bpobj active local > freddata feature@lz4_compress active local > freddata feature@multi_vdev_crash_dump enabled local > freddata feature@spacemap_histogram active local > freddata feature@enabled_txg active local > freddata feature@hole_birth active local > freddata feature@extensible_dataset enabled local > freddata feature@embedded_data disabled local > freddata feature@bookmarks enabled local > freddata feature@filesystem_limits disabled local > freddata feature@large_blocks disabled local
Could it be that you have some old files that were created before your last zfs upgrade (even if that was a long time ago)? Could it be that rsync updates those files? You might find this issue (close, but no associated commit) to be interesting: https://github.com/zfsonlinux/zfs/issues/2025 I see that in your case the wrong magic values also look like Unix timestamps, both from 25 April 2013. Could you please check if you still have more files from that era? P.S. Some other similar reports in ZoL repository: https://github.com/zfsonlinux/zfs/issues/1303 https://github.com/zfsonlinux/zfs/issues/3968
Thanks Andriy. There are indeed lots of (very) old files there. I had an earlier snapshot from a few years ago (that refs all these files). Deleted all files in the fs, and rsync'd everything anew. No panic. I've also tried (first rolling back to the old snapshot with all old files present) rsync with either --xattrs present or not. Panics for both cases. But the time until panic varies a lot. So looks like a race condition. Question is now, is there bogus data (old format that wasn't properly upgraded on-the-fly) with the correct checksum on the disk, and the current code trips over it. Or is the race condition still present when it tries to on-the-fly-upgrade old files?
(In reply to johannes from comment #4) I believe that it is the race condition. https://github.com/zfsonlinux/zfs/issues/2025#issuecomment-40459546 I think that what I wrote there still holds and there hasn't been any fix in this area.
Haven't encountered this problem in a long time. Closing it.
(In reply to johannes from comment #6) I believe that the problem still exists, but it's very specific (post zfs upgrade) and rare.
(In reply to Andriy Gapon from comment #7) We hit this problem 2 times last week. One machine is based on 12.0, second one is based on 13.0 (before ZoL migration). I looked at the ZoL code, it seems to me, that the code didn't changed here. So, I think it is possible to get such panic on the up to date OpenZFS too. I have core dump from the one panic. Let me know if you are interested to see something. panic: solaris assert: sa.sa_magic == 0x2F505A (0xb656f0e0 == 0x2f505a), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c, line: 609 cpuid = 4 time = 1623060642 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe3fd9321780 vpanic() at vpanic+0x194/frame 0xfffffe3fd93217e0 panic() at panic+0x43/frame 0xfffffe3fd9321840 assfail3() at assfail3+0x2c/frame 0xfffffe3fd9321860 zfs_space_delta_cb() at zfs_space_delta_cb+0x100/frame 0xfffffe3fd93218a0 dmu_objset_userquota_get_ids() at dmu_objset_userquota_get_ids+0x1b7/frame 0xfffffe3fd93218f0 dnode_sync() at dnode_sync+0xa6/frame 0xfffffe3fd9321980 sync_dnodes_task() at sync_dnodes_task+0x92/frame 0xfffffe3fd93219c0 taskq_run() at taskq_run+0x10/frame 0xfffffe3fd93219e0 taskqueue_run_locked() at taskqueue_run_locked+0x147/frame 0xfffffe3fd9321a40 taskqueue_thread_loop() at taskqueue_thread_loop+0xb8/frame 0xfffffe3fd9321a70 fork_exit() at fork_exit+0x86/frame 0xfffffe3fd9321ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe3fd9321ab0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Uptime: 14d8h32m53s Dumping 14912 out of 261996 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% __curthread () at ./machine/pcpu.h:232 232 ./machine/pcpu.h: No such file or directory. (kgdb) bt #0 __curthread () at ./machine/pcpu.h:232 #1 doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:318 #2 0xffffffff80aa31c3 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:386 #3 0xffffffff80aa36ae in vpanic (fmt=<optimized out>, ap=0xfffffe3fd9321820) at /usr/src/sys/kern/kern_shutdown.c:779 #4 0xffffffff80aa34d3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:710 #5 0xffffffff822df23c in assfail3 (a=<unavailable>, lv=<unavailable>, op=<unavailable>, rv=<unavailable>, f=<unavailable>, l=<optimized out>) at /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_cmn_err.c:91 #6 0xffffffff8209e9c0 in zfs_space_delta_cb (bonustype=<optimized out>, data=0xfffff803e64b2f40, userp=0xfffff8052a3b2278, groupp=0xfffff8052a3b2280) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:609 #7 0xffffffff8200d917 in dmu_objset_userquota_get_ids (dn=0xfffff8052a3b2000, before=0, tx=<optimized out>) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1592 #8 0xffffffff82015396 in dnode_sync (dn=0xfffff8052a3b2000, tx=0xfffff803ae49e800) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c:570 #9 0xffffffff8200d192 in dmu_objset_sync_dnodes (list=0xfffff801763bb420, tx=0xfffff803ae49e800) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1093 #10 sync_dnodes_task (arg=0xfffff82419cdd260) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:1160 #11 0xffffffff81fd3750 in taskq_run (arg=0xfffff801c66d08d0, pending=<unavailable>) at /usr/src/sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c:109 #12 0xffffffff80afc0e7 in taskqueue_run_locked (queue=0xfffff8016b01cd00) at /usr/src/sys/kern/subr_taskqueue.c:463 #13 0xffffffff80afd2c8 in taskqueue_thread_loop (arg=<optimized out>) at /usr/src/sys/kern/subr_taskqueue.c:755 #14 0xffffffff80a65796 in fork_exit (callout=0xffffffff80afd210 <taskqueue_thread_loop>, arg=0xfffff801281c3f20, frame=0xfffffe3fd9321ac0) at /usr/src/sys/kern/kern_fork.c:1038 #15 <signal handler called>
(In reply to Andrey V. Elsukov from comment #8) I think that at this point there is enough information about the problem. So far I could not come up with a good way to fix it, but to be honest I did not try very hard.
The problem has not been fixed, so no reason to close this report. Also see: - https://github.com/zfsonlinux/zfs/issues/2025 (closed) - https://github.com/openzfs/zfs/issues/6332 (open) - https://github.com/openzfsonosx/openzfs/issues/5 (open)