Bug 248245 - arm64 kernel panic under heavy UFS writes (getblk -> ffs_balloc_ufs2)
Summary: arm64 kernel panic under heavy UFS writes (getblk -> ffs_balloc_ufs2)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-24 15:05 UTC by Gordon Bergling
Modified: 2020-07-24 19:40 UTC (History)
2 users (show)

See Also:


Attachments
stacktrace arm64 UFS panic (252.47 KB, image/jpeg)
2020-07-24 15:11 UTC, Gordon Bergling
no flags Details
second, somewhat different stacktrace on arm64 with heavy UFS writes (309.99 KB, image/jpeg)
2020-07-24 16:14 UTC, Gordon Bergling
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 15:05:32 UTC
On a recent -CURRENT I get a kernel panic on arm64 while running the installworld target, which panics the system (RPi4b) after a few seconds with the following, yet incomplete, stacktrace.

getblk
ffs_balloc_ufs2
ffs_write
VOP_WRITE_APV
vn_write
vn_io_fault_doio
vn_io_fault1
vn_io_fault
dofilewrite
kern_writev
sys_write
do_el0_sync
handle_el0_sync

I was unable to extract a crashdump due to no swap partition and netdump wasn't supported by the genet0 driver.

The GENERIC kernel from 2nd of July is working without problems and a custom kernel with some extra TCP options from the 14th of July was also working flawlessly. So I would assume that somewhere in the timeframe from 14th of July to the 24th of July a bug was introduced.
Comment 1 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 15:11:01 UTC
Created attachment 216745 [details]
stacktrace arm64 UFS panic
Comment 2 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 16:10:19 UTC
After some fiddling with the ddb I was able to get the panic message.

getnewbuf_empty: Locked buf 0xffff0000406b1390 on free queue.
Comment 3 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 16:13:53 UTC
After some fiddling with the ddb I was able to get the panic message.

getnewbuf_empty: Locked buf 0xffff0000406b1390 on free queue.
Comment 4 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 16:14:41 UTC
Created attachment 216747 [details]
second, somewhat different stacktrace on arm64 with heavy UFS writes
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2020-07-24 16:54:38 UTC
hmm, I just hit this as well while running GELI tests on an arm64 platform.  It should be a recent regression.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2020-07-24 16:56:25 UTC
(In reply to Mark Johnston from comment #5)
This happened on an NFS root, no UFS involved.  So presumably this is a bug in the buffer cache or lockmgr.
Comment 7 Mark Johnston freebsd_committer freebsd_triage 2020-07-24 17:18:03 UTC
I think the problem is in r363415.  It converted some lockmgr code to use atomic_fcmpset instead of atomic_cmpset.  The former can fail spuriously on LL/SC platforms, so a tryxlock operation can fail even when the buf is unlocked.  lockmgr should take care to retry if fcmpset fails but returns the "expected" value.
Comment 8 commit-hook freebsd_committer freebsd_triage 2020-07-24 17:28:31 UTC
A commit references this bug:

Author: mjg
Date: Fri Jul 24 17:28:24 UTC 2020
New revision: 363480
URL: https://svnweb.freebsd.org/changeset/base/363480

Log:
  lockmgr: add missing 'continue' to account for spuriously failed fcmpset

  PR:		248245
  Reported by:	gbe
  Noted by:	markj
  Fixes by:	r363415 ("lockmgr: add adaptive spinning")

Changes:
  head/sys/kern/kern_lock.c
Comment 9 Mateusz Guzik freebsd_committer freebsd_triage 2020-07-24 17:29:38 UTC
Please try on r363480.

I think it's high time we get a debug version of the routine for amd64 which fails at random. I'll hack it up later.
Comment 10 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 17:51:49 UTC
(In reply to Mark Johnston from comment #6)

This could be indeed NFS related. I have the following build setup for the RPi4:

A writable NFS share is exported from a FreeBSD 12-STABLE VM and mounted on the RPI4: /tank/nfs_public. This share has the following subdirectories that are symlinked on the RPi4 for /usr/src.

/tank/nfs_public/tiny/src
/tank/nfs_public/tiny/obj

The obj directory is changed via MAKEOBJDIRPREFIX to the NFS share.

This is mostly done to save disk space and writes on the RPi4 SDCard.
Comment 11 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 17:52:55 UTC
(In reply to commit-hook from comment #8)

I have a build running and report back, if this revision solves the issue.
Comment 12 Gordon Bergling freebsd_committer freebsd_triage 2020-07-24 19:37:01 UTC
(In reply to commit-hook from comment #8)

After your last commit I was able to successful build and install a more then recent kernel and world via NFS on the RPi4b.

Thank You! :)