Bug 241476 - zfs checksum errors with FreeBSD 11.3 and mps driver
Summary: zfs checksum errors with FreeBSD 11.3 and mps driver
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-10-24 16:53 UTC by dentarg
Modified: 2021-01-07 09:56 UTC (History)
2 users (show)

See Also:


Attachments
Output from "mpsutil show all" from Jon (1.81 KB, text/plain)
2019-10-24 16:53 UTC, dentarg
no flags Details
Output from "zdb -C db" from Jon (12.53 KB, text/plain)
2019-10-24 16:54 UTC, dentarg
no flags Details
Output from "mpsutil show all" from Samwell (1.65 KB, text/plain)
2019-10-24 16:55 UTC, dentarg
no flags Details
Output from "zdb -C db" from Samwell (12.69 KB, text/plain)
2019-10-24 16:56 UTC, dentarg
no flags Details
zpool status from 2019-09-20 occurrence on Jon (9.96 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-12 occurrence on Jon (9.84 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-09 occurrence on Samwell (9.88 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-24 occurrence on Samwell (6.63 KB, text/plain)
2019-10-24 16:58 UTC, dentarg
no flags Details
messages from Jon (3.23 KB, text/plain)
2021-01-07 09:55 UTC, Twingly
no flags Details
messages from Samwell (3.57 KB, text/plain)
2021-01-07 09:56 UTC, Twingly
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description dentarg 2019-10-24 16:53:56 UTC
Created attachment 208573 [details]
Output from "mpsutil show all" from Jon

We have two servers, Jon and Samwell.

The servers are SuperMicro SuperStorage Server 2027R-E1R24L with the X9DR7-LN4F motherboard, 2x Xeon E5-2620v2 and 256GB ECC RAM. The motherboard has the LSI SAS 2308 controller. Both servers have 22x 960GB Samsung SSD (SV843, SM843T, SM863, SM863a).

Each server has a zpool ("db") with 11 mirrors consisting of the SSDs. They boot of another zpool. The servers are using the mps driver.

The servers are configured to scrub regularly:

$ cat /etc/periodic.conf
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="7" # days between scrubs
daily_status_zfs_enable="YES"

We upgraded both servers from FreeBSD 11.2 to 11.3 at 2019-09-04.

Every 10th minute a cronjob checks that the pools are healthy, if not, we get an alert. After going to FreeBSD 11.3, both servers have alerted twice:

2019-09-20 04:10 Jon
2019-10-12 08:20 Jon

2019-10-09 05:10 Samwell
2019-10-24 04:10 Samwell

about the error https://illumos.org/msg/ZFS-8000-8A (see attached files for details; outputs from zpool status)

So far we have only seen errors reported for what I understand is metadata, example:

errors: Permanent errors have been detected in the following files:

        <0x16dc>:<0x498>


Issuing another "zpool scrub" makes zpool status say "errors: No known data errors" and then a "zpool clear" clears the checksum error counts.

We see no errors in /var/log/messages. Only "ZFS: vdev state changed" rows when a scrub starts.

I've seen and read bug #239801, but opted to open a new bug as we are using a different driver.
Comment 1 dentarg 2019-10-24 16:54:37 UTC
Created attachment 208574 [details]
Output from "zdb -C db" from Jon
Comment 2 dentarg 2019-10-24 16:55:21 UTC
Created attachment 208575 [details]
Output from "mpsutil show all" from Samwell
Comment 3 dentarg 2019-10-24 16:56:00 UTC
Created attachment 208576 [details]
Output from "zdb -C db" from Samwell
Comment 4 dentarg 2019-10-24 16:57:08 UTC
Created attachment 208577 [details]
zpool status from 2019-09-20 occurrence on Jon
Comment 5 dentarg 2019-10-24 16:57:31 UTC
Created attachment 208578 [details]
zpool status from 2019-10-12 occurrence on Jon
Comment 6 dentarg 2019-10-24 16:57:50 UTC
Created attachment 208579 [details]
zpool status from 2019-10-09 occurrence on Samwell
Comment 7 dentarg 2019-10-24 16:58:02 UTC
Created attachment 208580 [details]
zpool status from 2019-10-24 occurrence on Samwell
Comment 8 dentarg 2019-12-02 15:07:12 UTC
Not sure how much it helps, but we have seen the problem two more times for one of the servers in the last month:

2019-11-19 05:10 Jon
2019-11-27 08:50 Jon
Comment 9 dentarg 2020-01-20 08:48:26 UTC
We updated both servers to 11.3-RELEASE-p5 last week. 

During the weekend (2020-01-18 15:20), server Jon experienced "Permanent errors" again:

jon% sudo zpool status -v db
Password:
  pool: db
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 14:40:11 with 0 errors on Sat Jan 18 17:41:38 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    db                                                ONLINE       0     0    16
      mirror-0                                        ONLINE       0     0    12
        da0                                           ONLINE       0     0    12
        da4                                           ONLINE       0     0    12
      mirror-1                                        ONLINE       0     0    12
        da1                                           ONLINE       0     0    12
        da5                                           ONLINE       0     0    12
      mirror-2                                        ONLINE       0     0     4
        da2                                           ONLINE       0     0     4
        da3                                           ONLINE       0     0     4
      mirror-3                                        ONLINE       0     0     2
        da8                                           ONLINE       0     0     2
        da10                                          ONLINE       0     0     2
      mirror-4                                        ONLINE       0     0     4
        da9                                           ONLINE       0     0     4
        da11                                          ONLINE       0     0     4
      mirror-5                                        ONLINE       0     0     2
        diskid/DISK-S1E4NYAF500425%20%20%20%20%20%20  ONLINE       0     0     2
        diskid/DISK-S1E4NYAF500433%20%20%20%20%20%20  ONLINE       0     0     2
      mirror-6                                        ONLINE       0     0     4
        da14                                          ONLINE       0     0     4
        da15                                          ONLINE       0     0     4
      mirror-7                                        ONLINE       0     0     6
        diskid/DISK-S2HTNX0H613319%20%20%20%20%20%20  ONLINE       0     0     6
        diskid/DISK-S2HTNX0H613311%20%20%20%20%20%20  ONLINE       0     0     6
      mirror-8                                        ONLINE       0     0     4
        diskid/DISK-S186NEADC05812%20%20%20%20%20%20  ONLINE       0     0     4
        diskid/DISK-S186NEADC05804%20%20%20%20%20%20  ONLINE       0     0     4
      mirror-9                                        ONLINE       0     0     4
        diskid/DISK-S3F3NX0J602173%20%20%20%20%20%20  ONLINE       0     0     4
        diskid/DISK-S3F3NX0J603555%20%20%20%20%20%20  ONLINE       0     0     4
      mirror-10                                       ONLINE       0     0    10
        da6                                           ONLINE       0     0    10
        da7                                           ONLINE       0     0    10

errors: Permanent errors have been detected in the following files:

        <0xa44>:<0xfe3>
Comment 10 dentarg 2020-06-17 11:23:48 UTC
We updated server Jon to 12.1-RELEASE-p5 on 2020-06-04. (Server Samwell were updated later, on 2020-06-10, and is running 12.1-RELEASE-p6).

The problem is still present. Today, around 08:20 (~5h20m after the weekly scrub started), checksum errors appeared again on server Jon:


$ sudo zpool status -v db
  pool: db
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Jun 17 03:01:26 2020
	6.33T scanned at 180M/s, 6.22T issued at 177M/s, 7.68T total
	0 repaired, 81.06% done, 0 days 02:23:51 to go
config:

	NAME                                              STATE     READ WRITE CKSUM
	db                                                ONLINE       0     0    22
	  mirror-0                                        ONLINE       0     0    12
	    da0                                           ONLINE       0     0    12
	    da4                                           ONLINE       0     0    12
	  mirror-1                                        ONLINE       0     0    14
	    da1                                           ONLINE       0     0    14
	    da5                                           ONLINE       0     0    14
	  mirror-2                                        ONLINE       0     0    12
	    da2                                           ONLINE       0     0    12
	    da3                                           ONLINE       0     0    12
	  mirror-3                                        ONLINE       0     0     4
	    da8                                           ONLINE       0     0     4
	    da10                                          ONLINE       0     0     4
	  mirror-4                                        ONLINE       0     0     8
	    da9                                           ONLINE       0     0     8
	    da11                                          ONLINE       0     0     8
	  mirror-5                                        ONLINE       0     0     8
	    diskid/DISK-S1E4NYAF500425%20%20%20%20%20%20  ONLINE       0     0     8
	    diskid/DISK-S1E4NYAF500433%20%20%20%20%20%20  ONLINE       0     0     8
	  mirror-6                                        ONLINE       0     0     8
	    da14                                          ONLINE       0     0     8
	    da15                                          ONLINE       0     0     8
	  mirror-7                                        ONLINE       0     0     4
	    diskid/DISK-S2HTNX0H613319%20%20%20%20%20%20  ONLINE       0     0     4
	    diskid/DISK-S2HTNX0H613311%20%20%20%20%20%20  ONLINE       0     0     4
	  mirror-8                                        ONLINE       0     0     4
	    diskid/DISK-S47PNA0MB06473%20%20%20%20%20%20  ONLINE       0     0     4
	    diskid/DISK-S47PNA0MB06796%20%20%20%20%20%20  ONLINE       0     0     4
	  mirror-9                                        ONLINE       0     0    10
	    diskid/DISK-S3F3NX0J602173%20%20%20%20%20%20  ONLINE       0     0    10
	    diskid/DISK-S3F3NX0J603555%20%20%20%20%20%20  ONLINE       0     0    10
	  mirror-10                                       ONLINE       0     0     4
	    da6                                           ONLINE       0     0     4
	    da7                                           ONLINE       0     0     4

errors: Permanent errors have been detected in the following files:

        <0x1173>:<0x1ae>
Comment 11 Twingly 2021-01-07 09:54:59 UTC
We have now updated to FreeBSD 12.2 and the problem is still present. Curiously enough we got errors on both of the machines at the same time last time.

I noticed that under 12.2 there are related things written to `/var/log/messages`, attaching files from both machines.

I might add that we have hourly snapshots setup, using zfs-auto-snapshot, that we keep for, rolling, 24 hours. Perhaps something "weird" happens when a snapshot is removed during scrubbing?
Comment 12 Twingly 2021-01-07 09:55:46 UTC
Created attachment 221345 [details]
messages from Jon
Comment 13 Twingly 2021-01-07 09:56:05 UTC
Created attachment 221346 [details]
messages from Samwell