Bug 236504 - [gmirror][UFS2] "Superblock check-hash failed" messages seen
Summary: [gmirror][UFS2] "Superblock check-hash failed" messages seen
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: stress2
Depends on:
Blocks:
 
Reported: 2019-03-13 09:47 UTC by Peter Holm
Modified: 2019-09-01 18:21 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable12-
koobs: mfc-stable11-


Attachments
Proposed patch to fix bug. (629 bytes, patch)
2019-03-15 21:05 UTC, Kirk McKusick
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Holm freebsd_committer freebsd_triage 2019-03-13 09:47:25 UTC
While inserting and removing disks in a gmirror, the "Superblock check-hash failed" messaage is seen. fsck_ffs(8) does not find any issues.

Details @ https://people.freebsd.org/~pho/stress/log/graid1_7-2.txt
Comment 1 Conrad Meyer freebsd_committer freebsd_triage 2019-03-13 16:40:07 UTC
I suspect the taste method, e.g.,

Tracing command geom pid 13 tid 100429 td 0xfffff80c47554000 (CPU 25)
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe00e3bd4f40
vpanic() at vpanic+0x1ba/frame 0xfffffe00e3bd4f90
panic() at panic+0x43/frame 0xfffffe00e3bd4ff0
readsuper() at readsuper+0x1ec/frame 0xfffffe00e3bd5070
ffs_sbget() at ffs_sbget+0xf6/frame 0xfffffe00e3bd5140
g_label_ufs_taste_common() at g_label_ufs_taste_common+0x79/frame 0xfffffe00e3bd5180
g_label_taste() at g_label_taste+0x332/frame 0xfffffe00e3bd5290
g_new_provider_event() at g_new_provider_event+0x1ba/frame 0xfffffe00e3bd52b0
g_run_events() at g_run_events+0x197/frame 0xfffffe00e3bd52f0
fork_exit() at fork_exit+0x84/frame 0xfffffe00e3bd5330
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e3bd5330
Comment 2 Conrad Meyer freebsd_committer freebsd_triage 2019-03-13 16:41:20 UTC
Doh, I see you added a panic that clearly illustrates the stack.

Yep, clearly readsuper / ffs_sbget should not print anything for a taste operation!  :-)
Comment 3 Kirk McKusick freebsd_committer freebsd_triage 2019-03-15 21:05:00 UTC
Created attachment 202888 [details]
Proposed patch to fix bug.

Good analysis. I believe this patch will fix this bug.
Comment 4 Peter Holm freebsd_committer freebsd_triage 2019-03-16 07:43:10 UTC
This patch does not seem to fix the problem seen:

20190316 08:29:28 all: graid1_7.sh
Mar 16 08:35:18 t2 kernel: Superblock check-hash failed: recorded check-hash 0xe4e5b25e != computed check-hash 0x1aaf9179 (Ignored)
FAIL graid1_7.sh exit code 5
20190316 08:35:52 all: graid1_7.sh
^C
$ svn diff -x -p /usr/src/sys
Index: /usr/src/sys/geom/label/g_label_ufs.c
===================================================================
--- /usr/src/sys/geom/label/g_label_ufs.c       (revision 345222)
+++ /usr/src/sys/geom/label/g_label_ufs.c       (working copy)
@@ -76,8 +76,8 @@ g_label_ufs_taste_common(struct g_consumer *cp, ch
        label[0] = '\0';
 
        fs = NULL;
-       if (SBLOCKSIZE % pp->sectorsize != 0 ||
-           ffs_sbget(cp, &fs, STDSB, M_GEOM, g_use_g_read_data) != 0) {
+       if (SBLOCKSIZE % pp->sectorsize != 0 || ffs_sbget(cp, &fs,
+           STDSB_NOHASHFAIL, M_GEOM, g_use_g_read_data) != 0) {
                KASSERT(fs == NULL,
                    ("g_label_ufs_taste_common: non-NULL fs %p\n", fs));
                return;
$ uname -a
FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r345222M: Sat Mar 16 08:26:16 CET 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
$
Comment 5 Peter Holm freebsd_committer freebsd_triage 2019-03-16 12:04:32 UTC
To be fair: The message now has an added "(Ignored)".
Comment 6 Kirk McKusick freebsd_committer freebsd_triage 2019-03-20 23:46:54 UTC
(In reply to Peter Holm from comment #5)
I would define warning but proceeding as correct operation. If you do not agree with that, what do you think it should do?
Comment 7 Conrad Meyer freebsd_committer freebsd_triage 2019-03-21 00:11:39 UTC
Taste should probably be able to use a "silent" ffs_sbget mode to avoid printing anything.  This is like the GEOM equivalent of every newbus driver printing a message on failed probe.
Comment 8 Peter Holm freebsd_committer freebsd_triage 2019-03-21 06:17:34 UTC
I was expecting not to see any message at all.
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2019-03-21 06:32:54 UTC
I don't quite understand why the test is generating these errors in the first place.  If the checksum is failing, it means that we saw the UFS2 magic number in the superblock, so we're probably not looking at random garbage.  Is it possible the taste is racing with synchronization of a new gmirror disk, and seeing a partial write of a copied superblock?
Comment 10 Kirk McKusick freebsd_committer freebsd_triage 2019-03-21 06:51:20 UTC
(In reply to Mark Johnston from comment #9)
It is an out-of-date superblock from a filesystem that did not do superblock check hashes (while the current one does do check-hashes). So the correct thing to do is just reject it silently rather than reporting why it was rejected.
Comment 11 Peter Holm freebsd_committer freebsd_triage 2019-03-21 11:43:01 UTC
The test uses a memory disk.
This is basically what the test does:

for i in `jot 8`; do
        while [ -f $cont ]; do
                for u in md${u1}p2 md${u1}p3; do
                        gmirror forget test
                        gmirror remove test $u
                        gmirror insert test $u
                done 2>/dev/null
        done &
done
Comment 12 Peter Holm freebsd_committer freebsd_triage 2019-03-22 07:14:17 UTC
I have uploaded a self contained version of the test scenario to:
https://people.freebsd.org/~pho/graid1_7s2.sh
I have run this on three different hosts, all with the same result.
Comment 13 commit-hook freebsd_committer freebsd_triage 2019-08-06 18:10:53 UTC
A commit references this bug:

Author: mckusick
Date: Tue Aug  6 18:10:34 UTC 2019
New revision: 350651
URL: https://svnweb.freebsd.org/changeset/base/350651

Log:
  A race condition existed between the time a UFS/FFS superblock check
  hash was computed and the time that the superblock was copied to a
  buffer to be written to disk. The result was a failed superblock
  check hash the next time that the superblock was read.

  The fix is to compute the check hash after the superblock has been
  copied to a buffer to be written.

  PR:           236504
  Reported by:  Peter Holm
  Tested by:    Peter Holm
  Sponsored by: Netflix

Changes:
  head/sys/ufs/ffs/ffs_vfsops.c
Comment 14 Kirk McKusick freebsd_committer freebsd_triage 2019-08-06 18:16:26 UTC
The -r350651 commit resolves the check-hash failures.

No need to MFC as superblock check hashes are not in 12-stable.