|Summary:||Kernel panic when writing extended attributes with soft updates enabled|
|Component:||kern||Assignee:||freebsd-fs (Nobody) <fs>|
|Severity:||Affects Some People||CC:||cem, chris, dewayne, drb, intellisun, joelh, koro, markj, mckusick, pho|
Description 2t8mr7kx9f 2018-08-27 20:54:57 UTC
Created attachment 196617 [details] A screenshot of the panic message. This is a continuation of the following bugreport: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230732 I did some digging to try and get to the root cause. It appears that it has nothing to do with GELI, I can also reproduce this on a plain UFS volume. The culprit seem to be extended attributes. Freebsd panics reproducible with "panic: softdep_deallocate_dependencies: dangling deps" when using rsync with the -X option ("preserve extended attributes") to a volume with soft updates enabled. This appears to be some kind of race condition - transferring the files one-by-one works, the panic occurs only when transferring the whole directory. Turning off soft updates (-n disable) or using rsync without -X prevents the panic from occurring. The exact same command worked with 11.1-RELEASE. The attached screenshot shows the panic message. Please let me know if I can be of further assistance!
Comment 1 intellisun 2018-09-19 10:23:32 UTC
I have the same problem with soft-updates and acls activated on UFS volume. FreeBSD version 11.2-STABLE, r335822. If I disable either soft-updates or acls, the system behaves normally.
Comment 2 Kirk McKusick 2018-09-19 14:43:54 UTC
Do either of you have a test case / example that will trigger the panic? We have not been able to reproduce the problem and so have no way to understand what is causing it.
Comment 3 koro 2018-09-20 07:49:28 UTC
I too am affected by this, it was so bad I had to rollback to 11.1. I've been trying, to no avail, to reproduce it in a VM. It definitely happened on a real machine with a real 8TB drive though. I don't think transferring a whole disk image of this size just for the sake of debugging would be feasible, however if there was a way to make a sparse image that only contained filesystem metadata (but not file data or unused blocks), either with an existing tool or by coding it myself (with some guidance as to how to obtain said list of blocks), I'd be willing to share (privately) such a disk image.
Comment 4 Kirk McKusick 2018-09-20 15:02:19 UTC
(In reply to koro from comment #3) We think we have a way to reproduce it, but not yet sure. So hold on to that disk image for now. But hopefully we will not need it. Stay tuned.
Comment 5 koro 2018-12-05 04:14:14 UTC
Is there any progress on this? With 11.1 not being supported anymore it's becoming harder and harder to stay on it.
Comment 6 Kirk McKusick 2018-12-05 06:49:29 UTC
(In reply to koro from comment #5) I do have a way to reproduce it, but have gotten side-tracked on other issues so have not had time to dig into it. I'll try to move it up on my priority list.
Comment 7 commit-hook 2018-12-13 06:38:29 UTC
A commit references this bug: Author: pho Date: Thu Dec 13 06:37:36 UTC 2018 New revision: 342027 URL: https://svnweb.freebsd.org/changeset/base/342027 Log: Added a new test scenario for FFS extended attributes. PR: 230962 Changes: user/pho/stress2/misc/extattr2.sh
Comment 8 2t8mr7kx9f 2019-01-20 17:10:23 UTC
I can now confirm that this bug still exists in 12.0-RELEASE-p2. Is there anything we can do to help? The panic currently prevents me from using any of my systems with softupdates and / or journaling enabled, which leads to quite long fsck times in the event of a crash.
Comment 9 Kirk McKusick 2019-01-21 07:43:34 UTC
Work still in progress on a fix for this bug. Hope to have a test patch soon.
Comment 10 Kirk McKusick 2019-01-26 21:46:52 UTC
Created attachment 201424 [details] Proposed patch to fix bug. Here is my proposed patch to fix this bug. Please let me know if it helps.
Comment 11 commit-hook 2019-01-28 21:37:01 UTC
A commit references this bug: Author: mckusick Date: Mon Jan 28 21:36:46 UTC 2019 New revision: 343536 URL: https://svnweb.freebsd.org/changeset/base/343536 Log: This bug was introduced with the change to use softdep_bp_to_mp() in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp() function failed to include VFIFO as one of the valid cases. Although fifo's do not allocate blocks in the filesystem, they will allocate blocks if they use extended attributes (such as ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount pointer when presented with a fifo vnode so that the soft updates write complete will properly process the soft updates structures associated with the extended attribute blocks. It was the failure to process these soft updates structures, thus leaving them hanging off the buffer, which lead to the "panic: softdep_deallocate_dependencies: dangling deps" when trying to clean up the buffer after it was written. PR: 230962 Reported by: firstname.lastname@example.org Reviewed by: kib Tested by: Peter Holm MFC after: 1 week Sponsored by: Netflix Changes: head/sys/ufs/ffs/ffs_softdep.c
Comment 12 koro 2019-02-02 03:12:55 UTC
I have built a 11.2 kernel with your fix in a VM, transferred it on the affected machine by replacing /boot/kernel and booted it. Userspace was still on 11.1 though. The system booted fine and all the services started, but as soon as I/O started to pick up, I immediately got the panic again. Given that your patch concerns FIFOs and I don't have any on my filesystems which also make use of POSIX ACLs, I think it might be a different bug.
Comment 13 Conrad Meyer 2019-02-02 04:56:43 UTC
(In reply to commit-hook from comment #7) What is the kern.features.ufs_extattr sysctl? It doesn't seem to exist on CURRENT.
Comment 14 Conrad Meyer 2019-02-02 05:08:11 UTC
With extattr2.sh, disabling the ufs_extattr feature check, I run into an ffs_truncate3 panic.
Comment 15 Conrad Meyer 2019-02-02 05:47:28 UTC
(In reply to Conrad Meyer from comment #14) (On current, not 11.x.)
Comment 16 Peter Holm 2019-02-02 13:31:50 UTC
(In reply to Conrad Meyer from comment #14) I can not repeat the original problem. But I see the ffs_truncate3 panic: https://people.freebsd.org/~pho/stress/log/extattr2.txt The check for "ufs extended attribute" does not belong in the extattr2.sh test.
Comment 17 Kirk McKusick 2019-02-04 21:36:45 UTC
(In reply to koro from comment #12) Peter Holm has managed to trigger the panic even with the fifo fix, so I am continuing to look at this problem.
Comment 18 commit-hook 2019-03-12 19:08:49 UTC
A commit references this bug: Author: mckusick Date: Tue Mar 12 19:08:42 UTC 2019 New revision: 345077 URL: https://svnweb.freebsd.org/changeset/base/345077 Log: This is an additional fix for bug report 230962. When using extended attributes, the kernel can panic with either "ffs_truncate3" or with "softdep_deallocate_dependencies: dangling deps". The problem arises because the flushbuflist() function which is called to clear out buffers is passed either the V_NORMAL flag to indicate that it should flush buffer associated with the contents of the file or the V_ALT flag to indicate that it should flush the buffers associated with the extended attribute data. The buffers containing the extended attribute data are identified by having their BX_ALTDATA flag set in the buffer's b_xflags field. The BX_ALTDATA flag is set on the buffer when the extended attribute block is first allocated or when its contents are read in from the disk. On a busy system, a buffer may be reused for another purpose, but the contents of the block that it contained continues to be held in the main page cache. Each physical page is identified as holding the contents of a logical block within a specified file (identified by a vnode). When a request is made to read a file, the kernel first looks for the block in the existing buffers. If it is not found there, it checks the page cache to see if it is still there. If it is found in the page cache, then it is remapped into a new buffer thus avoiding the need to read it in from the disk. The bug is that when a buffer request made for an extended attribute is fulfilled by reconstituting a buffer from the page cache rather than reading it in from disk, the BX_ALTDATA flag was not being set. Thus the flushbuflist() function would never clear it out and the "ffs_truncate3" panic would occur because the vnode being cleared still had buffers on its clean-buffer list. If the extended attribute was being updated, it is first read, then updated, and finally written. If the read is fulfilled by reconstituting the buffer from the page cache the BX_ALTDATA flag was not set and thus the dirty buffer would never be flushed by flushbuflist(). Eventually the buffer would be recycled. Since it was never written it would have an unfinished dependency which would trigger the "softdep_deallocate_dependencies: dangling deps" panic. The fix is to ensure that the BX_ALTDATA flag is set when a buffer has been reconstituted from the page cache. PR: 230962 Reported by: email@example.com Reviewed by: kib Tested by: Peter Holm MFC after: 1 week Sponsored by: Netflix Changes: head/sys/kern/vfs_bio.c
Comment 19 Kirk McKusick 2019-03-12 19:15:43 UTC
(In reply to koro from comment #12) My latest fix is to the head of the tree (13.0), but should easily apply to 11-stable and 12-stable systems. Please check it out to see if it solves your problem.
Comment 20 koro 2019-03-13 01:54:18 UTC
(In reply to Kirk McKusick from comment #19) I have applied your patch to the 11.2 kernel the same way as the other time and did a full rebuild to make sure. Sadly, same issue. As soon as boot is finished and I/O picks up, panic.
Comment 21 commit-hook 2019-03-20 23:11:38 UTC
A commit references this bug: Author: mckusick Date: Wed Mar 20 23:11:05 UTC 2019 New revision: 345352 URL: https://svnweb.freebsd.org/changeset/base/345352 Log: This is an additional and hopefully final fix for bug report 230962. This bug was introduced with the change to use softdep_bp_to_mp() in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp() function failed to include VSOCK as one of the valid cases. Although local-domain sockets do not allocate blocks in the filesystem, they will allocate blocks if they use extended attributes (such as ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount pointer when presented with a socket vnode so that the soft updates write complete will properly process the soft updates structures associated with the extended attribute blocks. It was the failure to process these soft updates structures, thus leaving them hanging off the buffer, which lead to the "panic: softdep_deallocate_dependencies: dangling deps" when trying to clean up the buffer after it was written. PR: 230962 Reported by: firstname.lastname@example.org Reviewed by: kib Tested by: Peter Holm MFC after: 1 week Sponsored by: Netflix Changes: head/sys/ufs/ffs/ffs_softdep.c
Comment 22 Kirk McKusick 2019-03-29 01:25:11 UTC
The three commit-hooks below collectively solve this problem for all testers. These changes have now been committed to 11-STABLE and 12-STABLE.
Comment 23 koro 2019-05-07 04:19:26 UTC
Thanks for your work. Could this be backported as an errata for 11.2 and 12.0? It does have the potential to crash the machine after all.