Summary: | zfs acl lost after zfs send-receive. Kernel panic | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Yuriy Tabolin <danmer> | ||||
Component: | kern | Assignee: | Alexander Motin <mav> | ||||
Status: | In Progress --- | ||||||
Severity: | Affects Only Me | CC: | acksist, avg, enricocavalli+freebsdbugzilla, fxguidet, jbreitman, mav, nhuff, o.bende, trasz | ||||
Priority: | --- | Keywords: | crash | ||||
Version: | 10.1-RELEASE | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
Attachments: |
|
Description
Yuriy Tabolin
2015-03-09 16:14:40 UTC
I experienced the same problem in 2014 and tried today again with the same result. Are there any news regarding this? I cannot believe, there are only 3 persons in the world having this issue. Link to other Bug-Tickets: https://bugs.freenas.org/issues/5225 https://forums.freenas.org/index.php?threads/zfs-replication-corrupts-entire-zfs-volume-warning.17566/#post-126243 So far, only posted in Freenas bug-trackers - but same issue here. Probably, this is a similar/same issue: http://markmail.org/message/4nl4dzkmuo7gidlu So, from my reading, it panics, because for some reason it cannot gracefully handle error returned by zfs_acl_node_read(). Why? No idea. Note however, that this is code that's not specific to FreeBSD - it might be a good idea to consult Nexenta or ZoL folks. Thanks for your Comment. I think the main problem is not the handling of the error, but that there is an error! I can (probably) understand, why the system should panic because of such a file-system issue - but why is there this issue? Summary: Why are the ACLs broken after replication? And another question: Why do you think we should contact ZoL? I always thought, ZoL and FreeBSD ZFS are two different systems?! Actually, I did not choose ZoL because it is not really defined production-ready - but FreeBSD ZFS is. (In reply to o.bende from comment #4) No idea. I'm guessing it's some kind of corruption, but one that for some reason only affects ACLs on send/receive. My advice about talking to Nexenta/ZoL folks is that because the code in question is not specific to FreeBSD, it might also affect them. They may have already found and fixed the bug. Otherwise... This is pure speculation, but user/group identifiers used natively by ZFS are somewhat complicated (ZFS supports various identifier schemes, eg Windows SIDs), and there might be some FUID-related problem? Again - shot in the dark. Hi there , any update on this bug ? we do have the same issue on FreeBSD 10.1 , no kernel panic but some ACL are loose. To get access to the file again we need to do a mv and then reconfigure ACL on the file. Hi. I have found this same issue to still be present in 11.0-RELEASE. A replicated pool with several TB of data, several volumes, and some 50 snapshots was sent to a new pool on another system, all the files were verified on both pools in the most recent snapshot, md5 hashes generated with cfv matched. This comparison was run as root and access to the files caused no problem. Then the new pool was put into production, supplying a samba volume for windows backups with robocopy (inluding acls). This was meant to replace the original pool. The kernel always crashes shortly after the backup starts, with: panic: solaris assert: 0 == zfs_acl_node_read(dzp, &paclp, B_FALSE), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c, line: 1692 cpuid = 0 KDB: stack backtrace: #0 0xffffffff80b24477 at kdb_backtrace+0x67 #1 0xffffffff80ad97e2 at vpanic+0x182 #2 0xffffffff80ad9653 at panic+0x43 #3 0xffffffff824b520a at assfail+0x1a #4 0xffffffff82263084 at zfs_acl_ids_create+0x1b4 #5 0xffffffff822689d0 at zfs_make_xattrdir+0x40 #6 0xffffffff82268c95 at zfs_get_xattrdir+0xc5 #7 0xffffffff8227e7e6 at zfs_lookup+0x106 #8 0xffffffff822871d1 at zfs_setextattr+0x181 #9 0xffffffff8110f03f at VOP_SETEXTATTR_APV+0x8f #10 0xffffffff80b9c404 at extattr_set_vp+0x134 #11 0xffffffff80b9c544 at sys_extattr_set_file+0xf4 #12 0xffffffff80fa26ae at amd64_syscall+0x4ce #13 0xffffffff80f8488b at Xfast_syscall+0xfb I have not yet pinned exactly which files are hit when the crash happens, but the backtrace is always the same. I'm guessing this bug is not found more often because most people do not put the replicas into production, and the data seems to be copied correctly anyway. It's the metadata, extended attributes that get corrupted. So this will mostly hit people who expose and use volumes in the received pool through samba. A commit references this bug: Author: mav Date: Sat Nov 3 03:10:06 UTC 2018 New revision: 340096 URL: https://svnweb.freebsd.org/changeset/base/340096 Log: 9952 Block size change during zfs receive drops spill block Replication code in receive_object() falsely assumes that if received object block size is different from local, then it must be a new object and calls dmu_object_reclaim() to wipe it out. In most cases it is not a problem, since all dnode, bonus buffer and data block(s) are immediately rewritten any way, but the problem is that spill block (if used) is not. This means loss of ACLs, extended attributes, etc. This issue can be triggered in very simple way: 1. create 4KB file with 10+ ACL entries; 2. take snapshot and send it to different dataset; 3. append another 4KB to the file; 4. take another snapshot and send incrementally; 5. witness ACL loss on receive side. PR: 198457 Discussed with: mahrens MFC after: 2 weeks Sponsored by: iXsystems, Inc. Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c A commit references this bug: Author: mav Date: Wed Nov 21 18:18:57 UTC 2018 New revision: 340737 URL: https://svnweb.freebsd.org/changeset/base/340737 Log: Revert r340096: 9952 Block size change during zfs receive drops spill block It was reported, and I easily reproduced it, that this change triggers panic when receiving replication stream with enabled embedded blocks, when short file compressing into one embedded block changes its block size. I am not sure that the problem is in this particuler patch, not just triggered by it, but since investigation and fix will take some time, I've decided to revert this for now. PR: 198457, 233277 Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c By the way, here is how the problem was fixed in ZoL: https://github.com/zfsonlinux/zfs/commit/caf9dd209fdcfccabc2f32b3f23c5386ccfb896c Created attachment 210363 [details]
spill block send patch
I looked at the ZoL patch and based on it and the discussion in that pull request I think the attached patch should be the minimal change necessary to fix this issue.
I have recently run into this issue and was wondering when a fix will be deployed. I am running 11.2-RELEASE-p4 on the source server and 12.1-RELEASE-p5 on the destination server. A rsync from the source to destination server crashes the server with the panic mentioned in this ticket. I am able to resolve the issue be clearing and then applying the FACL on the source server using setfacl, deleting the directories that return "No such file or directory" on the destination side and then running rsync. I do see the "No such file or directory" on ZFS Replica file systems, but a crash does not occur. |