Summary: | mfi errors causing zfs checksum errors | ||
---|---|---|---|
Product: | Base System | Reporter: | Daniel Mafua <mafua> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | hrs, marco, ota, shared+bugs.freebsd.org, spam+bugs.freebsd.org, timur |
Priority: | --- | Keywords: | regression |
Version: | 11.3-RELEASE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Daniel Mafua
2019-08-12 13:34:33 UTC
I had this exact same problem with a large array on a PERC H730/P. I could fix this by changing the controller driver from mfi to mrsas. See https://www.freebsd.org/cgi/man.cgi?query=mrsas&apropos=0&sektion=0&manpath=FreeBSD+11.3-RELEASE&arch=default&format=html on details on how to do that. (loader.conf variable) Please make sure you do a "zfs scrub" afterwards to autoheal all checksum errors. Marco van Tol Can you tell me whether you updated or not the firmware before or after upgrading FreeBSD? Hi, so my bit more detailed story is this: We have 2 identical Dell R730xd systems with each 12 6TB drives in them, running on raidz2. It also has 2 SSD's in it which add to the system as ZIL and cache drives. Both have been running 11.1-RELEASE which gave no problems. The systems also have been on 11.2-RELEASE, also without any problems. In between the upgrades I have also done "zpool upgrade" where available. Then I upgraded to version 11.3-RELEASE, and trusting that all was as flawless as in the first 2 years, I gave it not much attention other then keeping an eye on it from the monitoring host. Unfortunately we only monitor zpool status, which has been ONLINE throughout the entire process. I left it running for quite a while when at some point I wanted to show the pool to someone and found out both systems had a few 100K checksum errors on each of the 12 drives in the pool. Not on the SSDs that make up the ZIL and cache. One of the systems was running 11.3-RELEASE-p1, and the other was running 11.3-RELEASE-p3. My path to fixing this was this: - Google for the issue, and find out mrsas was a potential fix - Change the system on p3 to mrsas and reboot - Do a zpool scrub to autoheal all broken sectors, which ended up fixing almost 100GB of data - Do a zpool clear to clear the counters - Do another zpool scrub to see if the counters remain 0, which they did. - Do the same on the other system, which was on p1, and bring it to p3 in the process Hope this helps, Marco van Tol The firmware has not been upgraded on any of the servers. I was able to switch from mfi to mrsas on one of the servers, but because they're at remote locations I need to be careful on how aggressive I am on this. I've successfully scrubbed out errors and everything is clear now. However, this also has been the case after any reboot, and errors started appearing again around 3-4 days later. It will probably take me two weeks to verify it's working okay, then I'll report back. (In reply to Daniel Mafua from comment #4) Thanks for your report. I have several other systems which are suffering from the same problem. Here is information about this problem which I got so far: * mfi(4) can report I/O errors which is not related to an actual hardware failure after upgrading FreeBSD to 11.3 or 12.0. * The error does not depend on ZFS while ZFS is very likely to report checksum errors of the zpool. On a system using UFS, the I/O error can cause a system panic, boot failure, or something fatal. * It seems that the I/O error depends on a specific firmware version. Some older firmware versions work fine even with mfi(4) on 11.3 and 12.0. * If the device is also supported by mrsas(4), switching to it will solves the error. Note that it will cause an incompatibility issue---mfi(4) uses /dev/mfi* device nodes for the attached drives and mfiutil(8) as the userland utility. mrsas(4) uses /dev/da* and a vendor-supplied utility such as megaCli instead. I am investigating what caused this regression but a workaround is to use mrsas(4) instead of mfi(4) by specifying hw.mfi.mrsas_enable="1" at boot time. The #230557 is kind of related to this issue (In reply to Timur I. Bakeyev from comment #6) Do you mean bug 230557 ? (In reply to Andriy Gapon from comment #7) Yes. Basically, it has a correct statement - for PERC H730P mrsas driver is a better(and proper) option. I.e. mfi shouldn't even be picked up by the kernel by default for it. After switching to mrsas we've seen double in performance and overall bandwidth, not to mention disappearing of those weird I/O error bugs. I think I can safely say that switching to the mrsas driver has resolved my problem. Thanks to everyone! For reference I was experiencing problems with Dell servers of various models using the PERC H330 controller. Firmware versions were 25.3.0.0016 and 25.5.2.0001. A PERC H310 Adapter Firmware: 20.13.0-0007 wasn't experiencing the problem and continued to use the mfi driver. |