262421 – zfs checksum errors and panic with invalid abd_t

Bug 262421 - zfs checksum errors and panic with invalid abd_t

Summary: zfs checksum errors and panic with invalid abd_t

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	13.0-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2022-03-08 14:34 UTC by John F. Carr
Modified:	2023-09-13 17:59 UTC (History)
CC List:	1 user (show)

See Also:	268909

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description John F. Carr 2022-03-08 14:34:05 UTC

During a scrub my zfs pool reported a few dozen checksum errors per
disk, about 1 per 200 GB scanned:

$ zpool status -v data
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Mar  6 19:16:15 2022
	13.6T scanned at 942M/s, 11.7T issued at 202M/s, 18.2T total
	2.42M repaired, 64.64% done, 09:16:24 to go
config:

	NAME        STATE     READ WRITE CKSUM
	data        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    ada0    ONLINE       0     0    18  (repairing)
	    ada1    ONLINE       0     0    17  (repairing)
	    ada2    ONLINE       0     0    12  (repairing)
	    ada3    ONLINE       0     0    23  (repairing)
	cache
	  ada4p5    ONLINE       0     0     0

errors: No known data errors

This affects all disks so it is not a single bad disk (unless the cache disk
is bad).  More likely it is data corruption in the controller, the data path
from controller to kernel ZFS code, or the ZFS data structures.

After several hours the system crashed with

VERIFY3(abd->abd_size <= SPA_MAXBLOCKSIZE) failed (930062841 <= 16777216)

This indicates a corrupt abd_t structure (see abd.c line 113).
savecore did not generate a stack trace.

After rebooting the checksum error counters had reset to zero and the scrub
finished without error.  Probably something mysterious and irreproducible in
the state of my kernel that one time.

My kernel was up to date on stable/13:

FreeBSD flaviventris 13.1-PRERELEASE FreeBSD 13.1-PRERELEASE #8 stable/13-n249920-d1f3afc4a47: Mon Mar  7 10:10:37 EST 2022     root@flaviventris:/usr/obj/usr/src/amd64.amd64/sys/CALIGATA amd64

Worth noting:

1. I have dedup enabled.

2. I have encryption enabled.

3. Since the previous scrub I did a zfs dump | zfs restore of close to
50% of the pool size to enable encryption. The pool was very nearly full
when I had both an encrypted and an unencrypted copy around.  Now it is
half full.

4. In /etc/make.conf I set "CPUTYPE?=amdfam10", appropriate for the
HP MicroServer hardware.

ada0 to ada3 are identical spinning disks, ada4 (cache) is SSD.

ahci0: <Marvell 88SE9230 AHCI SATA controller> port 0xe050-0xe057,0xe040-0xe043,0xe030-0xe037,0xe020-0xe0
23,0xe000-0xe01f mem 0xfea40000-0xfea407ff at device 0.0 on pci1
ahci0: AHCI v1.20 with 8 6Gbps ports, Port Multiplier not supported
ahci0: quirks=0x1000900<NOBSYRES,ALTSIG,IOMMU_BUSWIDE>
ada3: <ST10000VN0008-2JJ101 SC60> ACS-4 ATA SATA 3.x device
ada3: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 9537536MB (19532873728 512 byte sectors)
ada4: <Samsung SSD 860 EVO 1TB RVT03B6Q> ACS-4 ATA SATA 3.x device
ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada4: Command Queueing enabled
ada4: 953869MB (1953525168 512 byte sectors)