Bug 241476 - zfs checksum errors with FreeBSD 11.3 and mps driver
Summary: zfs checksum errors with FreeBSD 11.3 and mps driver
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-fs mailing list
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-10-24 16:53 UTC by dentarg
Modified: 2020-01-20 08:48 UTC (History)
2 users (show)

See Also:


Attachments
Output from "mpsutil show all" from Jon (1.81 KB, text/plain)
2019-10-24 16:53 UTC, dentarg
no flags Details
Output from "zdb -C db" from Jon (12.53 KB, text/plain)
2019-10-24 16:54 UTC, dentarg
no flags Details
Output from "mpsutil show all" from Samwell (1.65 KB, text/plain)
2019-10-24 16:55 UTC, dentarg
no flags Details
Output from "zdb -C db" from Samwell (12.69 KB, text/plain)
2019-10-24 16:56 UTC, dentarg
no flags Details
zpool status from 2019-09-20 occurrence on Jon (9.96 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-12 occurrence on Jon (9.84 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-09 occurrence on Samwell (9.88 KB, text/plain)
2019-10-24 16:57 UTC, dentarg
no flags Details
zpool status from 2019-10-24 occurrence on Samwell (6.63 KB, text/plain)
2019-10-24 16:58 UTC, dentarg
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description dentarg 2019-10-24 16:53:56 UTC
Created attachment 208573 [details]
Output from "mpsutil show all" from Jon

We have two servers, Jon and Samwell.

The servers are SuperMicro SuperStorage Server 2027R-E1R24L with the X9DR7-LN4F motherboard, 2x Xeon E5-2620v2 and 256GB ECC RAM. The motherboard has the LSI SAS 2308 controller. Both servers have 22x 960GB Samsung SSD (SV843, SM843T, SM863, SM863a).

Each server has a zpool ("db") with 11 mirrors consisting of the SSDs. They boot of another zpool. The servers are using the mps driver.

The servers are configured to scrub regularly:

$ cat /etc/periodic.conf
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="7" # days between scrubs
daily_status_zfs_enable="YES"

We upgraded both servers from FreeBSD 11.2 to 11.3 at 2019-09-04.

Every 10th minute a cronjob checks that the pools are healthy, if not, we get an alert. After going to FreeBSD 11.3, both servers have alerted twice:

2019-09-20 04:10 Jon
2019-10-12 08:20 Jon

2019-10-09 05:10 Samwell
2019-10-24 04:10 Samwell

about the error https://illumos.org/msg/ZFS-8000-8A (see attached files for details; outputs from zpool status)

So far we have only seen errors reported for what I understand is metadata, example:

errors: Permanent errors have been detected in the following files:

        <0x16dc>:<0x498>


Issuing another "zpool scrub" makes zpool status say "errors: No known data errors" and then a "zpool clear" clears the checksum error counts.

We see no errors in /var/log/messages. Only "ZFS: vdev state changed" rows when a scrub starts.

I've seen and read bug #239801, but opted to open a new bug as we are using a different driver.
Comment 1 dentarg 2019-10-24 16:54:37 UTC
Created attachment 208574 [details]
Output from "zdb -C db" from Jon
Comment 2 dentarg 2019-10-24 16:55:21 UTC
Created attachment 208575 [details]
Output from "mpsutil show all" from Samwell
Comment 3 dentarg 2019-10-24 16:56:00 UTC
Created attachment 208576 [details]
Output from "zdb -C db" from Samwell
Comment 4 dentarg 2019-10-24 16:57:08 UTC
Created attachment 208577 [details]
zpool status from 2019-09-20 occurrence on Jon
Comment 5 dentarg 2019-10-24 16:57:31 UTC
Created attachment 208578 [details]
zpool status from 2019-10-12 occurrence on Jon
Comment 6 dentarg 2019-10-24 16:57:50 UTC
Created attachment 208579 [details]
zpool status from 2019-10-09 occurrence on Samwell
Comment 7 dentarg 2019-10-24 16:58:02 UTC
Created attachment 208580 [details]
zpool status from 2019-10-24 occurrence on Samwell
Comment 8 dentarg 2019-12-02 15:07:12 UTC
Not sure how much it helps, but we have seen the problem two more times for one of the servers in the last month:

2019-11-19 05:10 Jon
2019-11-27 08:50 Jon
Comment 9 dentarg 2020-01-20 08:48:26 UTC
We updated both servers to 11.3-RELEASE-p5 last week. 

During the weekend (2020-01-18 15:20), server Jon experienced "Permanent errors" again:

jon% sudo zpool status -v db
Password:
  pool: db
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 14:40:11 with 0 errors on Sat Jan 18 17:41:38 2020
config:

    NAME                                              STATE     READ WRITE CKSUM
    db                                                ONLINE       0     0    16
      mirror-0                                        ONLINE       0     0    12
        da0                                           ONLINE       0     0    12
        da4                                           ONLINE       0     0    12
      mirror-1                                        ONLINE       0     0    12
        da1                                           ONLINE       0     0    12
        da5                                           ONLINE       0     0    12
      mirror-2                                        ONLINE       0     0     4
        da2                                           ONLINE       0     0     4
        da3                                           ONLINE       0     0     4
      mirror-3                                        ONLINE       0     0     2
        da8                                           ONLINE       0     0     2
        da10                                          ONLINE       0     0     2
      mirror-4                                        ONLINE       0     0     4
        da9                                           ONLINE       0     0     4
        da11                                          ONLINE       0     0     4
      mirror-5                                        ONLINE       0     0     2
        diskid/DISK-S1E4NYAF500425%20%20%20%20%20%20  ONLINE       0     0     2
        diskid/DISK-S1E4NYAF500433%20%20%20%20%20%20  ONLINE       0     0     2
      mirror-6                                        ONLINE       0     0     4
        da14                                          ONLINE       0     0     4
        da15                                          ONLINE       0     0     4
      mirror-7                                        ONLINE       0     0     6
        diskid/DISK-S2HTNX0H613319%20%20%20%20%20%20  ONLINE       0     0     6
        diskid/DISK-S2HTNX0H613311%20%20%20%20%20%20  ONLINE       0     0     6
      mirror-8                                        ONLINE       0     0     4
        diskid/DISK-S186NEADC05812%20%20%20%20%20%20  ONLINE       0     0     4
        diskid/DISK-S186NEADC05804%20%20%20%20%20%20  ONLINE       0     0     4
      mirror-9                                        ONLINE       0     0     4
        diskid/DISK-S3F3NX0J602173%20%20%20%20%20%20  ONLINE       0     0     4
        diskid/DISK-S3F3NX0J603555%20%20%20%20%20%20  ONLINE       0     0     4
      mirror-10                                       ONLINE       0     0    10
        da6                                           ONLINE       0     0    10
        da7                                           ONLINE       0     0    10

errors: Permanent errors have been detected in the following files:

        <0xa44>:<0xfe3>