Bug 220693

Summary: head -r320570 & -r320760 (e.g.): ufs snapshot creation broken & leads to fsck -B related SSD-trim "freeing free block" panics; more
Product: Base System Reporter: Mark Millard <marklmi26-fbsd>
Component: kernAssignee: freebsd-fs (Nobody) <fs>
Status: Closed FIXED    
Severity: Affects Some People CC: kib, pho
Priority: --- Keywords: regression
Version: CURRENT   
Hardware: Any   
OS: Any   

Description Mark Millard 2017-07-12 22:12:25 UTC
See also the exchange of list submittals associated
with:

https://lists.freebsd.org/pipermail/freebsd-current/2017-July/066505.html
and:
https://lists.freebsd.org/pipermail/freebsd-current/2017-July/066508.html

I free quote material from these without attribution here. . .


Basic context material . . .

As I remember it happened to be that the reporting folks
were using non-debug/non-invariant kernel builds. Multiple
TARGET_ARCH's, 32-bit and 64-bit, little-endian and
big-endian.

The basic create-snapshot test that fails:

After a short pause with disk activity, the same sorts of errors are 
logged when using "mksnap_ffs /.snap2" where .snap2 did  not previously 
exist

The type of messages was (e.g.):

g_vfs_done():ada0s3a[READ(offset=6050375794688, length=32768)]error = 5
Jul  7 00:10:24 toshi kernel

Note the huge offset: such is true of the messages in general.

Also the messages are from the kernel and its nmount related
snapshot creation activity, not from the user-space program.

The original list-notice was about dump (and its snapshot
creation) but the issue is not specific to dump.


fsck -B related panic material. . .

My original context for this: 32-bit powerpc.

<Prior failed multi-user boot from system problem
leaves root (only) file system not marked clean
so fsck -B will actually do something below>

boot -s (so: single user mode)
# The next 3 lines are the content of a generic, manually-run script.
mount -u /
mount -a -t ufs (but there is no other file system)
swapon -a       (there is a swap partition)
#
fsck -B

That "fsck -B" caused the same kinds of lines
reported by Michael Butler, happening as fsck
makes a snapshot for the background processing
to use.

After the g_vfs_done lines was text like (typed
in from an example camera picture):

** //.snap/fsck_snapshot
** Last Mount on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
Reclaimed: 0 directories, 1 files, 22680 fragments
780914 files, 4797127 used, 19552199 free (443479 frags, 3288590 blocks, 1.8% fragmentation)

***** FILE SYSTEM MARKED CLEAN *****

But always waiting a while leads to a panic
that looks like (showing an example):
(Note: context is an SSD with trim enabled)
(typed in from camera picture)

panic: ffs_blkfree_cq: freeing free block
cpuid = 2 (varies, of course)
time = (varies)
KDB: stack backtrace
(stack addresses can vary: just an example here)
0xd23b17e0: at kdb_backtrace+0x5c
0xd23b1850: at vpanic+0x1e8
0xd23b18c0: at panic+0x54
0xd23b1910: at ffs_blkfree_cq+0x278
0xd23b1980: at ffs_blkfree_trim_task+0x60
0xd23b19b0: at taskqueue_run_locked+0x10
0xd23b1a10: at taskqueue_thread_loop+0x174
0xd23b1a50: at fork_exit+0xf4
0xd23b1a80: at fork_trampoline+0xc
KDB: enter: panic
[ thread pid 0 tid 1000082 ]
Stopped at kdb_enter_0x70: addi r0,r0,0x0


I've tried this on a powerpc64 and it works
the same, complete with the "freeing free
block" issue.

I've also had the problem with a normal multi-user
boot that initiated a fsck -B automatically in a
context where the SSD had not been marked clean.

To avoid this and fix such file systems I've been
booting with "boot -s" and using "fsck -F" from
the single-user command prompt.


Unfortunately two problems with major consequences
for my involved context limit the svn range that I
can cover for the activity, the problem version
ranges being:

-r319722 through -r320651 (fixed by -r320652)
(actually this is why I had originally used
"boot -s"  in what I report above: I could get
to a shell prompt that way instead of crashing
before any login prompt; the crashes left
the file system in need of repair)

-r320509 through -r320561 (fixed by -r320570)

So I was using -r320570 to avoid one of the
two problems, now with a trail patch for what
was later fixed in -r320652.

I do not know if the problem was present back
before -r319722 or before -r320509.
Comment 1 Peter Holm freebsd_committer freebsd_triage 2017-07-14 12:08:03 UTC
Just running mksnap_ffs(8) on a pristine file system triggers:
g_vfs_done():md5a[READ(offset=34368126976, length=28672)]error = 5
This on
FreeBSD t1.osted.lan 12.0-CURRENT FreeBSD 12.0-CURRENT #1 r320982: Fri Jul 14 13:16:44 CEST 2017     pho@t1.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64

Partial script:
mdconfig -a -t swap -s 2g -u $mdstart || exit 1
bsdlabel -w md$mdstart auto
newfs -U md${mdstart}$part > /dev/null
mount /dev/md${mdstart}$part $mntpoint

while ! mksnap_ffs $mntpoint $mntpoint/.snap/stress2; do :; done

tail -5 /var/log/messages | grep "g_vfs_done():md5a" && s=1
Comment 2 commit-hook freebsd_committer freebsd_triage 2017-07-16 07:12:18 UTC
A commit references this bug:

Author: kib
Date: Sun Jul 16 07:11:30 UTC 2017
New revision: 321040
URL: https://svnweb.freebsd.org/changeset/base/321040

Log:
  A followup to r320453, correct removal of the blocks from UFS snapshots.

  Tested by:	pho
  PR:    220693
  Sponsored by:	The FreeBSD Foundation

Changes:
  head/sys/ufs/ffs/ffs_alloc.c