Bug 141685 - [zfs] zfs corruption on adaptec 5805 raid controller
Summary: [zfs] zfs corruption on adaptec 5805 raid controller
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 8.0-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Pawel Jakub Dawidek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-16 16:30 UTC by Tom Payne
Modified: 2012-02-28 13:10 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Payne 2009-12-16 16:30:01 UTC
Short version:

zfs on a new 5.44T Adaptec 5805 hardware RAID5 partition reports lots of zfs checksum errors.  Tests claim that the hardware is working correctly.

Long version:

I have an Adaptec RAID 5085 controller with eight 1TB SAS disks:

# dmesg | grep aac
aac0: <Adaptec RAID 5805> mem 0xfbc00000-0xfbdfffff irq 16 at device 0.0 on pci9
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 5805, aac driver 2.0.0-1
aacp0: <SCSI Passthrough Bus> on aac0
aacp1: <SCSI Passthrough Bus> on aac0
aacp2: <SCSI Passthrough Bus> on aac0
aacd0: <RAID 5> on aac0
aacd0: 16370MB (33525760 sectors)
aacd1: <RAID 5> on aac0
aacd1: 6657011MB (13633558528 sectors)


It's configured with a small partition (aacd0) for the root filesystem, the rest (aacd1) is a single large zpool:
# zpool create tank aacd1
# zfs list | head -n 2
NAME                                       USED  AVAIL  REFER  MOUNTPOINT
tank                                       792G  5.44T    18K  none


After a few days of light use (rsync'ing data from older disk servers) zfs reports lots of checksum errors:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 1h17m with 49 errors on Mon Dec 14 13:35:50 2009
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0    98
          aacd1     ONLINE       0     0   196

These 49 errors are in various files scattered across the the 200+ zfs filesystems on the disk.


/var/log/messages contains, for example:
# grep ZFS /var/log/messages
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072
Dec 14 13:23:50 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072
Dec 14 13:27:47 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072
Dec 14 13:28:07 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86


The 49 checksum errors occur at 49 different offsets in three distinct ranges:
  70743228416..  84649705472 ( 6)
1406828281856..1441780858880 (14)
2749871030272..2817199702016 (29)


The Adaptec controller firmware was updated the latest version (at the time of writing) after the first errors were observed.  Since the firmware was updated more errors have been observed.
# arcconf getversion
Controllers found: 1
Controller #1
==============
Firmware           : 5.2-0 (17544)
Staged Firmware    : 5.2-0 (17544)
BIOS               : 5.2-0 (17544)
Driver             : 5.2-0 (17544)
Boot Flash         : 5.2-0 (17544)


I ran a verify task on the RAID controller with
# arcconf task start 1 logicaldrive 1 verify noprompt
As far as I can tell, this verify task did not find any errors.  The array status is still reported as "optimal" and there seems to be nothing in the logs.


A 24 hour memory test with memtest86+ version 4.00 did not detect any memory errors.


Previously, problems have been found with zfs on USB drives:
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005510.html


As I understand it, the situation is:
- zfs has checksum errors
- the hardware RAID believes that the data on disk is consistent
- there are no obvious memory problems


Could this be a FreeBSD bug?

Fix: 

Unknown
How-To-Repeat: Unknown
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2009-12-16 17:54:45 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2010-03-20 00:27:29 UTC
State Changed
From-To: open->feedback

This is unlikely ZFS bug. It s hard to tell if the problem is in the driver, 
controller, cabels, disks, etc. 
You could try to configure geli(8) with data integrity verification on top of 
this array and see if geli will also report problems. 


Comment 3 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2010-03-20 00:27:29 UTC
Responsible Changed
From-To: freebsd-fs->pjd

I'll take this one.
Comment 4 tom 2010-04-29 16:07:12 UTC
Since upgrading the firmware the corruption is no longer manifesting itself.

So, I think Gary's analysis is correct: this was an Adaptec problem,
not a FreeBSD one.

Thank you for your time and please mark this bug as invalid.
-- 
Tom
Comment 5 Ed Maste freebsd_committer freebsd_triage 2012-02-28 12:52:24 UTC
If you have the information available, can you please add the firmware
version that fixed the problem for you (for the sake of anyone finding
this bug report in a search in the future).

Regards,
Ed
Comment 6 Pawel Jakub Dawidek freebsd_committer freebsd_triage 2014-06-01 07:12:13 UTC
State Changed
From-To: feedback->closed

Closed per submitters request. 
This was Adaptec firmware bug, not ZFS bug.