Created attachment 208573 [details] Output from "mpsutil show all" from Jon We have two servers, Jon and Samwell. The servers are SuperMicro SuperStorage Server 2027R-E1R24L with the X9DR7-LN4F motherboard, 2x Xeon E5-2620v2 and 256GB ECC RAM. The motherboard has the LSI SAS 2308 controller. Both servers have 22x 960GB Samsung SSD (SV843, SM843T, SM863, SM863a). Each server has a zpool ("db") with 11 mirrors consisting of the SSDs. They boot of another zpool. The servers are using the mps driver. The servers are configured to scrub regularly: $ cat /etc/periodic.conf daily_scrub_zfs_enable="YES" daily_scrub_zfs_default_threshold="7" # days between scrubs daily_status_zfs_enable="YES" We upgraded both servers from FreeBSD 11.2 to 11.3 at 2019-09-04. Every 10th minute a cronjob checks that the pools are healthy, if not, we get an alert. After going to FreeBSD 11.3, both servers have alerted twice: 2019-09-20 04:10 Jon 2019-10-12 08:20 Jon 2019-10-09 05:10 Samwell 2019-10-24 04:10 Samwell about the error https://illumos.org/msg/ZFS-8000-8A (see attached files for details; outputs from zpool status) So far we have only seen errors reported for what I understand is metadata, example: errors: Permanent errors have been detected in the following files: <0x16dc>:<0x498> Issuing another "zpool scrub" makes zpool status say "errors: No known data errors" and then a "zpool clear" clears the checksum error counts. We see no errors in /var/log/messages. Only "ZFS: vdev state changed" rows when a scrub starts. I've seen and read bug #239801, but opted to open a new bug as we are using a different driver.
Created attachment 208574 [details] Output from "zdb -C db" from Jon
Created attachment 208575 [details] Output from "mpsutil show all" from Samwell
Created attachment 208576 [details] Output from "zdb -C db" from Samwell
Created attachment 208577 [details] zpool status from 2019-09-20 occurrence on Jon
Created attachment 208578 [details] zpool status from 2019-10-12 occurrence on Jon
Created attachment 208579 [details] zpool status from 2019-10-09 occurrence on Samwell
Created attachment 208580 [details] zpool status from 2019-10-24 occurrence on Samwell
Not sure how much it helps, but we have seen the problem two more times for one of the servers in the last month: 2019-11-19 05:10 Jon 2019-11-27 08:50 Jon
We updated both servers to 11.3-RELEASE-p5 last week. During the weekend (2020-01-18 15:20), server Jon experienced "Permanent errors" again: jon% sudo zpool status -v db Password: pool: db state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0 days 14:40:11 with 0 errors on Sat Jan 18 17:41:38 2020 config: NAME STATE READ WRITE CKSUM db ONLINE 0 0 16 mirror-0 ONLINE 0 0 12 da0 ONLINE 0 0 12 da4 ONLINE 0 0 12 mirror-1 ONLINE 0 0 12 da1 ONLINE 0 0 12 da5 ONLINE 0 0 12 mirror-2 ONLINE 0 0 4 da2 ONLINE 0 0 4 da3 ONLINE 0 0 4 mirror-3 ONLINE 0 0 2 da8 ONLINE 0 0 2 da10 ONLINE 0 0 2 mirror-4 ONLINE 0 0 4 da9 ONLINE 0 0 4 da11 ONLINE 0 0 4 mirror-5 ONLINE 0 0 2 diskid/DISK-S1E4NYAF500425%20%20%20%20%20%20 ONLINE 0 0 2 diskid/DISK-S1E4NYAF500433%20%20%20%20%20%20 ONLINE 0 0 2 mirror-6 ONLINE 0 0 4 da14 ONLINE 0 0 4 da15 ONLINE 0 0 4 mirror-7 ONLINE 0 0 6 diskid/DISK-S2HTNX0H613319%20%20%20%20%20%20 ONLINE 0 0 6 diskid/DISK-S2HTNX0H613311%20%20%20%20%20%20 ONLINE 0 0 6 mirror-8 ONLINE 0 0 4 diskid/DISK-S186NEADC05812%20%20%20%20%20%20 ONLINE 0 0 4 diskid/DISK-S186NEADC05804%20%20%20%20%20%20 ONLINE 0 0 4 mirror-9 ONLINE 0 0 4 diskid/DISK-S3F3NX0J602173%20%20%20%20%20%20 ONLINE 0 0 4 diskid/DISK-S3F3NX0J603555%20%20%20%20%20%20 ONLINE 0 0 4 mirror-10 ONLINE 0 0 10 da6 ONLINE 0 0 10 da7 ONLINE 0 0 10 errors: Permanent errors have been detected in the following files: <0xa44>:<0xfe3>
We updated server Jon to 12.1-RELEASE-p5 on 2020-06-04. (Server Samwell were updated later, on 2020-06-10, and is running 12.1-RELEASE-p6). The problem is still present. Today, around 08:20 (~5h20m after the weekly scrub started), checksum errors appeared again on server Jon: $ sudo zpool status -v db pool: db state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub in progress since Wed Jun 17 03:01:26 2020 6.33T scanned at 180M/s, 6.22T issued at 177M/s, 7.68T total 0 repaired, 81.06% done, 0 days 02:23:51 to go config: NAME STATE READ WRITE CKSUM db ONLINE 0 0 22 mirror-0 ONLINE 0 0 12 da0 ONLINE 0 0 12 da4 ONLINE 0 0 12 mirror-1 ONLINE 0 0 14 da1 ONLINE 0 0 14 da5 ONLINE 0 0 14 mirror-2 ONLINE 0 0 12 da2 ONLINE 0 0 12 da3 ONLINE 0 0 12 mirror-3 ONLINE 0 0 4 da8 ONLINE 0 0 4 da10 ONLINE 0 0 4 mirror-4 ONLINE 0 0 8 da9 ONLINE 0 0 8 da11 ONLINE 0 0 8 mirror-5 ONLINE 0 0 8 diskid/DISK-S1E4NYAF500425%20%20%20%20%20%20 ONLINE 0 0 8 diskid/DISK-S1E4NYAF500433%20%20%20%20%20%20 ONLINE 0 0 8 mirror-6 ONLINE 0 0 8 da14 ONLINE 0 0 8 da15 ONLINE 0 0 8 mirror-7 ONLINE 0 0 4 diskid/DISK-S2HTNX0H613319%20%20%20%20%20%20 ONLINE 0 0 4 diskid/DISK-S2HTNX0H613311%20%20%20%20%20%20 ONLINE 0 0 4 mirror-8 ONLINE 0 0 4 diskid/DISK-S47PNA0MB06473%20%20%20%20%20%20 ONLINE 0 0 4 diskid/DISK-S47PNA0MB06796%20%20%20%20%20%20 ONLINE 0 0 4 mirror-9 ONLINE 0 0 10 diskid/DISK-S3F3NX0J602173%20%20%20%20%20%20 ONLINE 0 0 10 diskid/DISK-S3F3NX0J603555%20%20%20%20%20%20 ONLINE 0 0 10 mirror-10 ONLINE 0 0 4 da6 ONLINE 0 0 4 da7 ONLINE 0 0 4 errors: Permanent errors have been detected in the following files: <0x1173>:<0x1ae>
We have now updated to FreeBSD 12.2 and the problem is still present. Curiously enough we got errors on both of the machines at the same time last time. I noticed that under 12.2 there are related things written to `/var/log/messages`, attaching files from both machines. I might add that we have hourly snapshots setup, using zfs-auto-snapshot, that we keep for, rolling, 24 hours. Perhaps something "weird" happens when a snapshot is removed during scrubbing?
Created attachment 221345 [details] messages from Jon
Created attachment 221346 [details] messages from Samwell