Bug 198457

Summary:

zfs acl lost after zfs send-receive. Kernel panic

Product:

Base System

Reporter:

Yuriy Tabolin <danmer>

Component:

kern

Assignee:

Alexander Motin <mav>

Status:

In Progress ---

Severity:

Affects Only Me

CC:

acksist, avg, enricocavalli+freebsdbugzilla, fxguidet, jbreitman, mav, nhuff, o.bende, trasz

Priority:

---

Keywords:

crash

Version:

10.1-RELEASE

Hardware:

Any

OS:

Any

Attachments:

Description	Flags
spill block send patch	none

Description Yuriy Tabolin 2015-03-09 16:14:40 UTC

I have a server FreeBSD 10.1-RELEASE with some zfs pools and datasets on it. There are acl on datasets on source server:

# ls -l /pool1/samba/IT
-rw-rwx---+   1 tabolin   domain admins   10244  4 мар 14:39 .DS_Store
-rw-rwx---+   1 tabolin   domain admins    4096  3 мар 18:51 ._.DS_Store
drwxrwx---+   3 tabolin   domain admins       3 14 май  2012 Books
drwxrwx---+   3 tabolin   domain admins       3  5 мар  2014 Common
drwxrwx---+   5 root      domain admins       5 11 фев  2011 GVP
-rw-rwx---+   1 tabolin   domain admins  609280 13 дек  2012 IT-Servers.vsd
drwxrwx---+   9 tabolin   domain admins      16  3 окт 08:50 Other
drwxr-x---+ 135 root      wheel             137 18 фев 17:00 Print-History
drwxrwx---+   9 gavrilov  domain admins       9  1 окт  2011 SF
drwxrwx---+   8 tabolin   domain admins     240  4 мар 13:14 Showroom-video
drwxrwx---+  10 gavrilov  domain admins      13  6 ноя 15:57 Software
d---rwx---+ 101 root      domain admins     128 10 фев 11:19 TS
drwxrwx---+  21 tabolin   domain admins      31 16 фев 15:49 Ustorage

# getfacl /pool1/samba/IT/Showroom-video
# file: /pool1/samba/IT/Showroom-video
# owner: tabolin
# group: domain admins
     user:gurashov:rwxpDdaARWc--s:fd----:allow
      user:account:rwxpDdaARWc--s:fd----:allow
         user:raev:rwxpDdaARWc--s:fd----:allow
       user:becker:rwxpDdaARWc--s:fd----:allow
   user:zaretskaya:rwxpDdaARWc--s:fd----:allow
    user:dmitrieva:rwxpDdaARWc--s:fd----:allow
          user:ddv:rwxpDdaARWc--s:fd----:allow
         everyone@:------a-R-c--s:------:allow
            owner@:rwxpD-aARWcCos:------:allow
   user:ovcharenko:rwxpDdaARWc--s:fd----:allow
      user:stepkin:rwxpDdaARWc--s:fd----:allow
      user:khitrov:rwxpDdaARWc--s:fd----:allow
         user:ivan:rwxpDdaARWc--s:fd----:allow
     user:egorov-s:rwxpDdaARWc--s:fd----:allow
        user:julia:rwxpDdaARWc--s:fd----:allow
    user:polyakova:rwxpDdaARWc--s:fd----:allow
        user:koval:rwxpDdaARWc--s:fd----:allow
       user:victor:rwxpDdaARWc--s:fd----:allow
         user:korg:rwxpDdaARWc--s:fd----:allow
       user:zharov:rwxpDdaARWc--s:fd----:allow
    user:semenov-y:rwxpDdaARWc--s:fd----:allow
    user:kiselev-v:rwxpDdaARWc--s:fd----:allow
            group@:rwxpDdaARWcCos:fd----:allow
group:domain users:r-x---a-R-c--s:fd----:allow


I used zfs send-receive for backup datasets to another server

# zfs send -vR pool1/samba/IT@-2015-02-24 | ssh tabolin@stor sudo zfs recv -v pool1/samba/IT

Send-receive ends without any error, but on destination server on same files and folders acl is lost (see Showroom-video folder)

# ls -l /pool1/samba/IT
total 3264
-rw-rwx---+   1 gavrilov  10007   10244  4 мар 14:39 .DS_Store
-rw-rwx---+   1 gavrilov  10007    4096  3 мар 18:51 ._.DS_Store
drwxrwx---+   3 gavrilov  10007       3 14 май  2012 Books
drwxrwx---+   3 gavrilov  10007       3  5 мар  2014 Common
drwxrwx---+   5 root      10007       5 11 фев  2011 GVP
-rw-rwx---+   1 gavrilov  10007  609280 13 дек  2012 IT-Servers.vsd
drwxrwx---+   9 gavrilov  10007      16  3 окт 08:50 Other
drwxr-x---+ 135 root      wheel     137 18 фев 17:00 Print-History
drwxrwx---+   9 10000     10007       9  1 окт  2011 SF
ls: /pool1/samba/IT/Showroom-video: No such file or directory
drwxrwx---    8 gavrilov  10007     240  4 мар 13:14 Showroom-video
drwxrwx---+  10 10000     10007      13  6 ноя 15:57 Software
d---rwx---+ 101 root      10007     128 10 фев 11:19 TS
drwxrwx---+  21 gavrilov  10007      31 16 фев 15:49 Ustorage

# ls -l /pool1/samba/IT/Showroom-video
total 1515210995                                                                      
ls: /pool1/samba/IT/Showroom-video/.DS_Store: No such file or directory
-rwxrwx---  1 gavrilov  10007        24580 18 апр  2014 .DS_Store
-rwxrwx---+ 1 gavrilov  10007         4096 15 янв  2014 ._.DS_Store
-rwxrwx---+ 1 gavrilov  10007         4096  3 дек  2012 ._AV00_01_30-02_54_04.avi
-rwxrwx---+ 1 gavrilov  10007         4096  3 дек  2012 ._AV00_02_24-02_01_10.avi
----rwx---+ 1 gavrilov  10007         4096 18 апр  2014 ._йНПФ б. нАСВЕМХЕ дсо ОН Jira Х Confluence.avi
-rwxrwx---+ 1 gavrilov  10007   1421801430 17 сен  2013 13_09_17-11_05_59_аЕКНГЕПВХЙ_HD.mp4
-rwxrwx---+ 1 gavrilov  10007    235532535 11 фев 12:09 2015-02-10_15.32_дЕЛНМЯРПЮЖХЪ_ЮМЮКХРХЙХ_SAS._аЕКНГЕПВХЙ.wmv
-rwxrwx---+ 1 gavrilov  10007    594386969 22 янв 14:51 22-01-2015_13-26-38_гСАНБ.mp4
----rwx---+ 1 10001     10007   5190520832 28 ноя  2011 AV00_01_01-00_24_30.avi
----rwx---+ 1 gavrilov  10007  13026264576 28 фев  2012 AV00_01_01-03_26_14.avi


# getfacl /pool1/samba/IT/Showroom-video
# file: /pool1/samba/IT/Showroom-video
# owner: gavrilov
# group: 10007
getfacl: /pool1/samba/IT/Showroom-video: No such file or directory

# chmod o+r  /pool1/samba/IT/Showroom-video
/pool1/samba/IT/Showroom-video: No such file or directory

Same problem is with hundreds of folders and files on this pool and another one. After I found this bug, I checked another couple of FreeBSD zfs servers. And there I was found the same problem with lost acl after send-receive data.


Second part of problem is a kernel panic when I create file or directory in any of thats "No such file or directory".
# cd /pool1/samba/IT/Showroom-video
# mkdir 1111

panic: solaris assert: 0 == zfs_acl_node_read(dzp, B_TRUE, &paclp, В_FALSЕ)‚ file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c, line: 1718
cpuid = 16
KDB: stack backtrace:
#0 0xffffffff8096cb00 at kdb_backtrace+0x60
#1 0xffffffff80931c25 at panic+0x155
#2 0xffffffff81b7c1fd at assfail+0x1d
#3 0xffffffff81ab28af at zfs_acl_ids_create+0x1ef
#4 0xffffffff81ad292a at zfs_freebsd_mkdir+0x21a
#5 0xffffffff80e17dd7 at VOP_MKDIR_APV+0xa7
#6 0хгггггггг809dde49 at kern_mkdirat+0x209
#7 0xffffffff80cfa581 at amd64_syscall+0x351
#8 0xffffffff80cdf79b at Xfast_syscall+0xfb
Uptime: 11d1h0m35s

Kernel panic appears always when I create file or directory in directory, which have lost acl.

Thanks for any help!

Comment 1 o.bende 2015-09-22 14:40:38 UTC

I experienced the same problem in 2014 and tried today again with the same result. Are there any news regarding this? I cannot believe, there are only 3 persons in the world having this issue.

Comment 2 o.bende 2015-09-22 14:44:36 UTC

Link to other Bug-Tickets:
https://bugs.freenas.org/issues/5225
https://forums.freenas.org/index.php?threads/zfs-replication-corrupts-entire-zfs-volume-warning.17566/#post-126243

So far, only posted in Freenas bug-trackers - but same issue here.

Probably, this is a similar/same issue:
http://markmail.org/message/4nl4dzkmuo7gidlu

Comment 3 Edward Tomasz Napierala freebsd_committer

2015-09-23 16:16:41 UTC

So, from my reading, it panics, because for some reason it cannot gracefully handle error returned by zfs_acl_node_read().  Why?  No idea.  Note however, that this is code that's not specific to FreeBSD - it might be a good idea to consult Nexenta or ZoL folks.

Comment 4 o.bende 2015-09-23 16:38:22 UTC

Thanks for your Comment.
I think the main problem is not the handling of the error, but that there is an error! I can (probably) understand, why the system should panic because of such a file-system issue - but why is there this issue? Summary:
Why are the ACLs broken after replication?

And another question: Why do you think we should contact ZoL? I always thought, ZoL and FreeBSD ZFS are two different systems?! Actually, I did not choose ZoL because it is not really defined production-ready - but FreeBSD ZFS is.

Comment 5 Edward Tomasz Napierala freebsd_committer

2015-09-23 21:14:11 UTC

(In reply to o.bende from comment #4)

No idea.  I'm guessing it's some kind of corruption, but one that for some reason only affects ACLs on send/receive.  My advice about talking to Nexenta/ZoL folks is that because the code in question is not specific to FreeBSD, it might also affect them.  They may have already found and fixed the bug.

Otherwise...  This is pure speculation, but user/group identifiers used natively by ZFS are somewhat complicated (ZFS supports various identifier schemes, eg Windows SIDs), and there might be some FUID-related problem?  Again - shot in the dark.

Comment 6 fx guidet 2016-03-08 13:07:30 UTC

Hi there , any update on this bug ? we do have the same issue on FreeBSD 10.1 , no kernel panic but some ACL are loose. To get access to the file again  we need to do a mv and then reconfigure ACL on the file.

Comment 7 Jose.n 2017-07-12 18:50:09 UTC

Hi. I have found this same issue to still be present in 11.0-RELEASE. A replicated pool with several TB of data, several volumes, and some 50 snapshots was sent to a new pool on another system, all the files were verified on both pools in the most recent snapshot, md5 hashes generated with cfv matched. This comparison was run as root and access to the files caused no problem.

Then the new pool was put into production, supplying a samba volume for windows backups with robocopy (inluding acls). This was meant to replace the original pool. The kernel always crashes shortly after the backup starts, with:
panic: solaris assert: 0 == zfs_acl_node_read(dzp, &paclp, B_FALSE), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c, line: 1692
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff80b24477 at kdb_backtrace+0x67
#1 0xffffffff80ad97e2 at vpanic+0x182
#2 0xffffffff80ad9653 at panic+0x43
#3 0xffffffff824b520a at assfail+0x1a
#4 0xffffffff82263084 at zfs_acl_ids_create+0x1b4
#5 0xffffffff822689d0 at zfs_make_xattrdir+0x40
#6 0xffffffff82268c95 at zfs_get_xattrdir+0xc5
#7 0xffffffff8227e7e6 at zfs_lookup+0x106
#8 0xffffffff822871d1 at zfs_setextattr+0x181
#9 0xffffffff8110f03f at VOP_SETEXTATTR_APV+0x8f
#10 0xffffffff80b9c404 at extattr_set_vp+0x134
#11 0xffffffff80b9c544 at sys_extattr_set_file+0xf4
#12 0xffffffff80fa26ae at amd64_syscall+0x4ce
#13 0xffffffff80f8488b at Xfast_syscall+0xfb

I have not yet pinned exactly which files are hit when the crash happens, but the backtrace is always the same. I'm guessing this bug is not found more often because most people do not put the replicas into production, and the data seems to be copied correctly anyway. It's the metadata, extended attributes that get corrupted. So this will mostly hit people who expose and use volumes in the received pool through samba.

Comment 8 commit-hook freebsd_committer

2018-11-03 03:10:46 UTC

A commit references this bug:

Author: mav
Date: Sat Nov  3 03:10:06 UTC 2018
New revision: 340096
URL: https://svnweb.freebsd.org/changeset/base/340096

Log:
  9952 Block size change during zfs receive drops spill block

  Replication code in receive_object() falsely assumes that if received
  object block size is different from local, then it must be a new object
  and calls dmu_object_reclaim() to wipe it out. In most cases it is not a
  problem, since all dnode, bonus buffer and data block(s) are immediately
  rewritten any way, but the problem is that spill block (if used) is not.
  This means loss of ACLs, extended attributes, etc.

  This issue can be triggered in very simple way:
  1. create 4KB file with 10+ ACL entries;
  2. take snapshot and send it to different dataset;
  3. append another 4KB to the file;
  4. take another snapshot and send incrementally;
  5. witness ACL loss on receive side.

  PR:		198457
  Discussed with:	mahrens
  MFC after:	2 weeks
  Sponsored by:	iXsystems, Inc.

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c

Comment 9 commit-hook freebsd_committer

2018-11-21 18:19:50 UTC

A commit references this bug:

Author: mav
Date: Wed Nov 21 18:18:57 UTC 2018
New revision: 340737
URL: https://svnweb.freebsd.org/changeset/base/340737

Log:
  Revert r340096: 9952 Block size change during zfs receive drops spill block

  It was reported, and I easily reproduced it, that this change triggers panic
  when receiving replication stream with enabled embedded blocks, when short
  file compressing into one embedded block changes its block size.  I am not
  sure that the problem is in this particuler patch, not just triggered by it,
  but since investigation and fix will take some time, I've decided to revert
  this for now.

  PR:		198457, 233277

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c

Comment 10 Andriy Gapon freebsd_committer

2019-09-19 13:30:54 UTC

By the way, here is how the problem was fixed in ZoL:
https://github.com/zfsonlinux/zfs/commit/caf9dd209fdcfccabc2f32b3f23c5386ccfb896c

Comment 11 Nathan Huff 2019-12-31 23:59:30 UTC

Created attachment 210363 [details]
spill block send patch

I looked at the ZoL patch and based on it and the discussion in that pull request I think the attached patch should be the minimal change necessary to fix this issue.

Comment 12 Jason 2020-12-03 02:54:04 UTC

I have recently run into this issue and was wondering when a fix will be deployed.  I am running 11.2-RELEASE-p4 on the source server and 12.1-RELEASE-p5 on the destination server. A rsync from the source to destination server crashes the server with the panic mentioned in this ticket.  I am able to resolve the issue be clearing and then applying the FACL on the source server using setfacl, deleting the directories that return "No such file or directory" on the destination side and then running rsync.  I do see the "No such file or directory" on ZFS Replica file systems, but a crash does not occur.