Bug 230962 - Kernel panic when writing extended attributes with soft updates enabled
Summary: Kernel panic when writing extended attributes with soft updates enabled
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs mailing list
URL:
Keywords: panic
Depends on:
Blocks:
 
Reported: 2018-08-27 20:54 UTC by 2t8mr7kx9f
Modified: 2019-10-04 14:46 UTC (History)
10 users (show)

See Also:


Attachments
A screenshot of the panic message. (369.35 KB, image/jpeg)
2018-08-27 20:54 UTC, 2t8mr7kx9f
no flags Details
Proposed patch to fix bug. (446 bytes, patch)
2019-01-26 21:46 UTC, Kirk McKusick
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description 2t8mr7kx9f 2018-08-27 20:54:57 UTC
Created attachment 196617 [details]
A screenshot of the panic message.

This is a continuation of the following bugreport:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230732

I did some digging to try and get to the root cause. It appears that it has nothing to do with GELI, I can also reproduce this on a plain UFS volume. The culprit seem to be extended attributes.

Freebsd panics reproducible with 

"panic: softdep_deallocate_dependencies: dangling deps" 

when using rsync with the -X option ("preserve extended attributes") to a volume with soft updates enabled.

This appears to be some kind of race condition - transferring the files one-by-one works, the panic occurs only when transferring the whole directory.

Turning off soft updates (-n disable) or using rsync without -X prevents the panic from occurring.

The exact same command worked with 11.1-RELEASE.

The attached screenshot shows the panic message. Please let me know if I can be of further assistance!
Comment 1 intellisun 2018-09-19 10:23:32 UTC
I have the same problem with soft-updates and acls activated on UFS volume.
FreeBSD version 11.2-STABLE, r335822. If I disable either soft-updates or acls, the system behaves normally.
Comment 2 Kirk McKusick freebsd_committer 2018-09-19 14:43:54 UTC
Do either of you have a test case / example that will trigger the panic?
We have not been able to reproduce the problem and so have no way to understand what is causing it.
Comment 3 koro 2018-09-20 07:49:28 UTC
I too am affected by this, it was so bad I had to rollback to 11.1.

I've been trying, to no avail, to reproduce it in a VM.

It definitely happened on a real machine with a real 8TB drive though.

I don't think transferring a whole disk image of this size just for the sake of debugging would be feasible, however if there was a way to make a sparse image that only contained filesystem metadata (but not file data or unused blocks), either with an existing tool or by coding it myself (with some guidance as to how to obtain said list of blocks), I'd be willing to share (privately) such a disk image.
Comment 4 Kirk McKusick freebsd_committer 2018-09-20 15:02:19 UTC
(In reply to koro from comment #3)
We think we have a way to reproduce it, but not yet sure. So hold on to that disk image for now. But hopefully we will not need it. Stay tuned.
Comment 5 koro 2018-12-05 04:14:14 UTC
Is there any progress on this?

With 11.1 not being supported anymore it's becoming harder and harder to stay on it.
Comment 6 Kirk McKusick freebsd_committer 2018-12-05 06:49:29 UTC
(In reply to koro from comment #5)
I do have a way to reproduce it, but have gotten side-tracked on other issues so have not had time to dig into it. I'll try to move it up on my priority list.
Comment 7 commit-hook freebsd_committer 2018-12-13 06:38:29 UTC
A commit references this bug:

Author: pho
Date: Thu Dec 13 06:37:36 UTC 2018
New revision: 342027
URL: https://svnweb.freebsd.org/changeset/base/342027

Log:
  Added a new test scenario for FFS extended attributes.

  PR:		230962

Changes:
  user/pho/stress2/misc/extattr2.sh
Comment 8 2t8mr7kx9f 2019-01-20 17:10:23 UTC
I can now confirm that this bug still exists in 12.0-RELEASE-p2.

Is there anything we can do to help? The panic currently prevents me from using any of my systems with softupdates and / or journaling enabled, which leads to quite long fsck times in the event of a crash.
Comment 9 Kirk McKusick freebsd_committer 2019-01-21 07:43:34 UTC
Work still in progress on a fix for this bug. Hope to have a test patch soon.
Comment 10 Kirk McKusick freebsd_committer 2019-01-26 21:46:52 UTC
Created attachment 201424 [details]
Proposed patch to fix bug.

Here is my proposed patch to fix this bug. Please let me know if it helps.
Comment 11 commit-hook freebsd_committer 2019-01-28 21:37:01 UTC
A commit references this bug:

Author: mckusick
Date: Mon Jan 28 21:36:46 UTC 2019
New revision: 343536
URL: https://svnweb.freebsd.org/changeset/base/343536

Log:
  This bug was introduced with the change to use softdep_bp_to_mp() in
  January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
  function failed to include VFIFO as one of the valid cases.

  Although fifo's do not allocate blocks in the filesystem, they will
  allocate blocks if they use extended attributes (such as ACLs). Thus,
  softdep_bp_to_mp() needs to return a non-NULL mount pointer when
  presented with a fifo vnode so that the soft updates write complete
  will properly process the soft updates structures associated with the
  extended attribute blocks. It was the failure to process these soft
  updates structures, thus leaving them hanging off the buffer, which
  lead to the "panic: softdep_deallocate_dependencies: dangling deps"
  when trying to clean up the buffer after it was written.

  PR:           230962
  Reported by:  2t8mr7kx9f@protonmail.com
  Reviewed by:  kib
  Tested by:    Peter Holm
  MFC after:    1 week
  Sponsored by: Netflix

Changes:
  head/sys/ufs/ffs/ffs_softdep.c
Comment 12 koro 2019-02-02 03:12:55 UTC
I have built a 11.2 kernel with your fix in a VM, transferred it on the affected machine by replacing /boot/kernel and booted it. Userspace was still on 11.1 though.

The system booted fine and all the services started, but as soon as I/O started to pick up, I immediately got the panic again. Given that your patch concerns FIFOs and I don't have any on my filesystems which also make use of POSIX ACLs, I think it might be a different bug.
Comment 13 Conrad Meyer freebsd_committer 2019-02-02 04:56:43 UTC
(In reply to commit-hook from comment #7)
What is the kern.features.ufs_extattr sysctl?  It doesn't seem to exist on CURRENT.
Comment 14 Conrad Meyer freebsd_committer 2019-02-02 05:08:11 UTC
With extattr2.sh, disabling the ufs_extattr feature check, I run into an ffs_truncate3 panic.
Comment 15 Conrad Meyer freebsd_committer 2019-02-02 05:47:28 UTC
(In reply to Conrad Meyer from comment #14)
(On current, not 11.x.)
Comment 16 Peter Holm freebsd_committer 2019-02-02 13:31:50 UTC
(In reply to Conrad Meyer from comment #14)
I can not repeat the original problem.
But I see the ffs_truncate3 panic:
https://people.freebsd.org/~pho/stress/log/extattr2.txt

The check for "ufs extended attribute" does not belong in the extattr2.sh test.
Comment 17 Kirk McKusick freebsd_committer 2019-02-04 21:36:45 UTC
(In reply to koro from comment #12)
Peter Holm has managed to trigger the panic even with the fifo fix, so I am continuing to look at this problem.
Comment 18 commit-hook freebsd_committer 2019-03-12 19:08:49 UTC
A commit references this bug:

Author: mckusick
Date: Tue Mar 12 19:08:42 UTC 2019
New revision: 345077
URL: https://svnweb.freebsd.org/changeset/base/345077

Log:
  This is an additional fix for bug report 230962. When using
  extended attributes, the kernel can panic with either "ffs_truncate3"
  or with "softdep_deallocate_dependencies: dangling deps".

  The problem arises because the flushbuflist() function which is
  called to clear out buffers is passed either the V_NORMAL flag to
  indicate that it should flush buffer associated with the contents
  of the file or the V_ALT flag to indicate that it should flush the
  buffers associated with the extended attribute data. The buffers
  containing the extended attribute data are identified by having
  their BX_ALTDATA flag set in the buffer's b_xflags field. The
  BX_ALTDATA flag is set on the buffer when the extended attribute
  block is first allocated or when its contents are read in from the
  disk.

  On a busy system, a buffer may be reused for another purpose, but
  the contents of the block that it contained continues to be held
  in the main page cache. Each physical page is identified as holding
  the contents of a logical block within a specified file (identified
  by a vnode). When a request is made to read a file, the kernel first
  looks for the block in the existing buffers.  If it is not found
  there, it checks the page cache to see if it is still there. If
  it is found in the page cache, then it is remapped into a new
  buffer thus avoiding the need to read it in from the disk.

  The bug is that when a buffer request made for an extended attribute
  is fulfilled by reconstituting a buffer from the page cache rather
  than reading it in from disk, the BX_ALTDATA flag was not being
  set. Thus the flushbuflist() function would never clear it out and
  the "ffs_truncate3" panic would occur because the vnode being cleared
  still had buffers on its clean-buffer list. If the extended attribute
  was being updated, it is first read, then updated, and finally
  written. If the read is fulfilled by reconstituting the buffer
  from the page cache the BX_ALTDATA flag was not set and thus the
  dirty buffer would never be flushed by flushbuflist(). Eventually
  the buffer would be recycled. Since it was never written it would
  have an unfinished dependency which would trigger the
  "softdep_deallocate_dependencies: dangling deps" panic.

  The fix is to ensure that the BX_ALTDATA flag is set when a buffer
  has been reconstituted from the page cache.

  PR:           230962
  Reported by:  2t8mr7kx9f@protonmail.com
  Reviewed by:  kib
  Tested by:    Peter Holm
  MFC after:    1 week
  Sponsored by: Netflix

Changes:
  head/sys/kern/vfs_bio.c
Comment 19 Kirk McKusick freebsd_committer 2019-03-12 19:15:43 UTC
(In reply to koro from comment #12)
My latest fix is to the head of the tree (13.0), but should easily apply to 11-stable and 12-stable systems. Please check it out to see if it solves your problem.
Comment 20 koro 2019-03-13 01:54:18 UTC
(In reply to Kirk McKusick from comment #19)
I have applied your patch to the 11.2 kernel the same way as the other time and did a full rebuild to make sure. Sadly, same issue. As soon as boot is finished and I/O picks up, panic.
Comment 21 commit-hook freebsd_committer 2019-03-20 23:11:38 UTC
A commit references this bug:

Author: mckusick
Date: Wed Mar 20 23:11:05 UTC 2019
New revision: 345352
URL: https://svnweb.freebsd.org/changeset/base/345352

Log:
  This is an additional and hopefully final fix for bug report 230962.
  This bug was introduced with the change to use softdep_bp_to_mp()
  in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
  function failed to include VSOCK as one of the valid cases.

  Although local-domain sockets do not allocate blocks in the filesystem,
  they will allocate blocks if they use extended attributes (such as
  ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
  pointer when presented with a socket vnode so that the soft updates
  write complete will properly process the soft updates structures
  associated with the extended attribute blocks. It was the failure
  to process these soft updates structures, thus leaving them hanging
  off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
  dangling deps" when trying to clean up the buffer after it was written.

  PR:           230962
  Reported by:  2t8mr7kx9f@protonmail.com
  Reviewed by:  kib
  Tested by:    Peter Holm
  MFC after:    1 week
  Sponsored by: Netflix

Changes:
  head/sys/ufs/ffs/ffs_softdep.c
Comment 22 Kirk McKusick freebsd_committer 2019-03-29 01:25:11 UTC
The three commit-hooks below collectively solve this problem for all testers.
These changes have now been committed to 11-STABLE and 12-STABLE.
Comment 23 koro 2019-05-07 04:19:26 UTC
Thanks for your work.

Could this be backported as an errata for 11.2 and 12.0?

It does have the potential to crash the machine after all.