Bug 177966 - [zfs] resilver completes but subsequent scrub reports errors
Summary: [zfs] resilver completes but subsequent scrub reports errors
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.1-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-18 19:30 UTC by Nathaniel Filardo
Modified: 2022-06-11 21:14 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nathaniel Filardo 2013-04-18 19:30:00 UTC
I took one disk out of a raidz2 pool, and proceeded to run the system for a while on a degraded configuration (but still with redundancy).  I then replaced the missing disk (with zpool replace rather than zpool online) and let the system run resilver to completion.  It succeeded and reported no errors.  Having had bad experiences in the past (http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016627.html) I ran scrub, which reported 11 checksum errors on the replaced drive, very clearly during the part of the scrub which was walking refcount > 1 blocks.  I am currently running another scrub pass, which I hypothesize will succeed without error.

The pool, under normal circumstances, looks like this:

        NAME        STATE     READ WRITE CKSUM
        tank0       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada6    ONLINE       0     0     0
            ada7    ONLINE       0     0     0
            ada9    ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada5    ONLINE       0     0     0
            ada8    ONLINE       0     0     0
        cache
          ada1a     ONLINE       0     0     0
          ada0b     ONLINE       0     0     0

The pool configuration is pretty default, except that it uses 4K sectors (ashift=12) and the following options are set:

tank0  checksum              sha256                 received
tank0  compression           gzip                   received
tank0  atime                 off                    received
tank0  dedup                 sha256,verify          received

The deduplication table is pretty sizable:

dedup: DDT entries 16754758, size 981 on disk, 158 in core
bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    13.0M   1.33T   1.24T   1.27T    13.0M   1.33T   1.24T   1.27T
     2    2.35M    198G    165G    172G    5.18M    430G    361G    378G
     4     495K   25.4G   13.4G   16.1G    2.24M    114G   61.0G   73.8G
     8     121K   1.60G    689M   1.48G    1.28M   16.3G   6.78G   15.5G
    16    22.1K    250M    116M    269M     469K   5.04G   2.31G   5.48G
    32    4.11K    157M    138M    159M     195K   8.45G   7.65G   8.59G
    64    1.53K   9.76M   3.99M   14.8M     124K    897M    375M   1.22G
   128      254   6.49M   2.89M   4.60M    41.8K    949M    427M    717M
   256       58    582K    100K    519K    19.6K    181M   34.3M    175M
   512       27    540K     26K    232K    19.0K    482M   20.7M    167M
    1K       12      6K      6K   95.9K    17.9K   8.94M   8.94M    143M
    2K        8    648K   13.5K   71.9K    19.9K   1.42G   34.4M    181M
    4K        3    256K    129K    144K    17.6K   1.38G    764M    851M
    8K       12    644K   8.50K   95.9K     149K   8.97G    110M   1.16G
 Total    16.0M   1.55T   1.42T   1.45T    22.7M   1.90T   1.67T   1.74T

Full DSL scans (scrub, resilver) take about 48 hours each, the first half of which is spent in an incredibly annoyingly slow scan (currently moving about 20 iops/sec and 1Mb/sec) as it works its way through the DDT entries with refcount > 1, after which it ramps up to 35MB/sec as it traverses refcount=1 blocks in disk order.

In any case, the scrub after the resilver was clearly in the first such phase of its scan and reported 11 checksum errors all at once (more or less).  There were no checksum errors found in the second (refcount=1) phase.

If I have to guess, this is possibly a bug in the code which handles entries in the DDT changing their class while a scrub is in progress.

How-To-Repeat: It appears sufficient to be performing I/O traffic to a resilvering pool with deduplication.  I will attempt to repeat the experiment as soon as this scrub pass finishes successfully; if it instead finds errors, I will run scrub again.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-04-18 20:13:57 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:59:38 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 3 dgilbert 2022-06-11 20:07:12 UTC
I want to add to this.  In the following, while preparing to add some disks to an array, I dislodged a disk.  I put it back and "onlined" it ... after which it "resilvered" and completed without any warning.

This is still happening.  Here is the scrub (in progress) output... the 42 blocks repaired on v1-f6a were flagged by the scrub and thus missed by the resilver.

  pool: vr1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Jun 10 21:37:50 2022
	45.7T scanned at 724M/s, 40.0T issued at 633M/s, 84.3T total
	900K repaired, 47.44% done, 20:23:11 to go
config:

	NAME             STATE     READ WRITE CKSUM
	vr1              ONLINE       0     0     0
	  raidz2-0       ONLINE       0     0     0
	    gpt/v1-d0    ONLINE       0     0     0
	    gpt/v1-d1    ONLINE       0     0     0
	    gpt/v1-d2    ONLINE       0     0     0
	    gpt/v1-d3    ONLINE       0     0     0
	    gpt/v1-d4a   ONLINE       0     0     0
	    gpt/v1-d5    ONLINE       0     0     0
	    gpt/v1-d6a   ONLINE       0     0     0
	    gpt/v1-d7a   ONLINE       0     0     0
	  raidz2-2       ONLINE       0     0     0
	    gpt/v1-e0c   ONLINE       0     0     0
	    gpt/v1-e1c   ONLINE       0     0     0
	    gpt/v1-e2b   ONLINE       0     0     0
	    gpt/v1-e3b   ONLINE       0     0     0
	    gpt/v1-e4b   ONLINE       0     0     0
	    gpt/v1-e5a   ONLINE       0     0     0
	    gpt/v1-e6a   ONLINE       0     0     0
	    gpt/v1-e7d   ONLINE       0     0     0
	  raidz2-3       ONLINE       0     0     0
	    gpt/v1-f0    ONLINE       0     0     0
	    gpt/v1-f1    ONLINE       0     0     0
	    gpt/v1-f2    ONLINE       0     0     0
	    gpt/v1-f3    ONLINE       0     0     0
	    gpt/v1-f4    ONLINE       0     0     0
	    gpt/v1-f5    ONLINE       0     0     0
	    gpt/v1-f6a   ONLINE      42     0     0  (repairing)
	    gpt/v1-f7b   ONLINE       0     0     0
	  raidz2-5       ONLINE       0     0     0
	    gpt/v1-g0    ONLINE       0     0     0
	    gpt/v1-g1    ONLINE       0     0     0
	    gpt/v1-g2    ONLINE       0     0     0
	    gpt/v1-g3    ONLINE       0     0     0
	    gpt/v1-g4    ONLINE       0     0     0
	    gpt/v1-g5    ONLINE       0     0     0
	    gpt/v1-g6    ONLINE       0     0     0
	    gpt/v1-g7    ONLINE       0     0     0
	logs	
	  gpt/vr1log     ONLINE       0     0     0
	cache
	  gpt/vr1cache   ONLINE       0     0     0
	  gpt/vr1cache2  ONLINE       0     0     0

errors: No known data errors
Comment 4 dgilbert 2022-06-11 21:14:17 UTC
Just having a thought to this.  Maybe it's obvious... but in my case the drive was pulled without any shutdown at the ZFS level.  The error blocks are typically between a few an maybe at most 100.  I'm _thinking_ that this is just the matter of one transaction that partially made it onto the disk.

Since I don't have "great" labelling and I don't usually get led to work, I often have to pull a few drives to find the right one... and since I don't know offhand which is which, I can't shut them down in ZFS.  The errors on a scrub afterward are reasonably likely ... upward of 50%, but not certain.