Created attachment 183182 [details] dmesg With a zvol to provide a 1TB iSCSI drive for a Windows 10 initiator, using "Create a system image" causes the pool to be corrupted. Neither drive reports any SMART problems (both have "SMART overall-health self-assessment test result: PASSED"). The backup completes without errors on the Windows side. The iSCSI volume is set for 64k blocks and that is also the NTFS cluster size. The backup is from two SATA SSDs in a laptop and done over a Gigabit Ethernet link. Bitlocker is enabled for the backup volume on the Windows side, which could possibly alter the disk access patterns...? Deleting the zvol, clearing the error, and retrying the backup to a new zvol causes the same problem. The first try used "FreeBSD 11.0-STABLE #10 r317513: Thu Apr 27 12:09:34 MST 2017". I updated to "FreeBSD 11.1-PRERELEASE #11 r319488: Thu Jun 1 21:12:33 MST 2017" for the second try. pool: tank state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 9h34m with 0 errors on Fri Jun 2 22:36:45 2017 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 74 mirror-0 ONLINE 0 0 0 gpt/tank0 ONLINE 0 0 0 gpt/tank1 ONLINE 0 0 0 mirror-1 DEGRADED 0 0 148 gpt/tank0b DEGRADED 0 0 148 too many errors gpt/tank1b DEGRADED 0 0 148 too many errors errors: Permanent errors have been detected in the following files: tank/iscsi/mcp2:<0x1> NAME PROPERTY VALUE SOURCE tank/iscsi/mcp2 type volume - tank/iscsi/mcp2 creation Fri Jun 2 12:49 2017 - tank/iscsi/mcp2 used 1.00T - tank/iscsi/mcp2 available 3.22T - tank/iscsi/mcp2 referenced 971G - tank/iscsi/mcp2 compressratio 1.00x - tank/iscsi/mcp2 reservation none default tank/iscsi/mcp2 volsize 1T local tank/iscsi/mcp2 volblocksize 64K - tank/iscsi/mcp2 checksum skein inherited from tank tank/iscsi/mcp2 compression off local tank/iscsi/mcp2 readonly off default tank/iscsi/mcp2 copies 1 default tank/iscsi/mcp2 refreservation 1.00T local tank/iscsi/mcp2 primarycache all default tank/iscsi/mcp2 secondarycache all default tank/iscsi/mcp2 usedbysnapshots 0 - tank/iscsi/mcp2 usedbydataset 971G - tank/iscsi/mcp2 usedbychildren 0 - tank/iscsi/mcp2 usedbyrefreservation 57.2G - tank/iscsi/mcp2 logbias latency default tank/iscsi/mcp2 dedup off default tank/iscsi/mcp2 mlslabel - tank/iscsi/mcp2 sync standard default tank/iscsi/mcp2 refcompressratio 1.00x - tank/iscsi/mcp2 written 971G - tank/iscsi/mcp2 logicalused 970G - tank/iscsi/mcp2 logicalreferenced 970G - tank/iscsi/mcp2 volmode default default tank/iscsi/mcp2 snapshot_limit none default tank/iscsi/mcp2 snapshot_count none default tank/iscsi/mcp2 redundant_metadata all default Geom name: ada3 modified: false state: OK fwheads: 16 fwsectors: 63 last: 11721045127 first: 40 entries: 152 scheme: GPT Providers: 1. Name: ada3p1 Mediasize: 6001173463040 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2 rawuuid: 7830d285-8079-11e6-b5ec-00e081c81bd6 rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b label: tank0b length: 6001173463040 offset: 1048576 type: freebsd-zfs index: 1 end: 11721043967 start: 2048 Consumers: 1. Name: ada3 Mediasize: 6001175126016 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e3 Geom name: ada4 modified: false state: OK fwheads: 16 fwsectors: 63 last: 11721045127 first: 40 entries: 152 scheme: GPT Providers: 1. Name: ada4p1 Mediasize: 6001173463040 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2 rawuuid: 7b35d907-8079-11e6-b5ec-00e081c81bd6 rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b label: tank1b length: 6001173463040 offset: 1048576 type: freebsd-zfs index: 1 end: 11721043967 start: 2048 Consumers: 1. Name: ada4 Mediasize: 6001175126016 (5.5T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e3
Looking at your zpool... Correct me if I'm wrong, but don't your vdevs overlap somehow? I mean, the first one is on /dev/gpt[01], and the second one on /dev/gtp[01]b? Can you paste the output of "gpart list"?
I mean, the full output, the one that shows what tank0 (without the "b") is.
(In reply to Edward Tomasz Napierala from comment #2) The tank0 and tank0b drives are completely separate. Had I more sense when adding the two new drives, they would have been named something like "tankb0" and "tankb1". From "gpart show -l": => 40 11721045088 ada3 GPT (5.5T) 40 2008 - free - (1.0M) 2048 11721041920 1 tank0b (5.5T) 11721043968 1160 - free - (580K) => 40 11721045088 ada4 GPT (5.5T) 40 2008 - free - (1.0M) 2048 11721041920 1 tank1b (5.5T) 11721043968 1160 - free - (580K) => 34 5860533101 da0 GPT (2.7T) 34 2014 - free - (1.0M) 2048 5860529032 1 tank0 (2.7T) 5860531080 2055 - free - (1.0M) => 34 5860533101 da1 GPT (2.7T) 34 2014 - free - (1.0M) 2048 5860529032 1 tank1 (2.7T) 5860531080 2055 - free - (1.0M) Here's the gpart list for tank0 and tank1: Geom name: da0 modified: false state: OK fwheads: 255 fwsectors: 63 last: 5860533134 first: 34 entries: 128 scheme: GPT Providers: 1. Name: da0p1 Mediasize: 3000590864384 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2 rawuuid: 618bb5ab-66a1-11e2-8e33-00e081c81bd6 rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b label: tank0 length: 3000590864384 offset: 1048576 type: freebsd-zfs index: 1 end: 5860531079 start: 2048 Consumers: 1. Name: da0 Mediasize: 3000592982016 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e3 Geom name: da1 modified: false state: OK fwheads: 255 fwsectors: 63 last: 5860533134 first: 34 entries: 128 scheme: GPT Providers: 1. Name: da1p1 Mediasize: 3000590864384 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e2 rawuuid: b0e34fe7-6a9e-11e2-8e33-00e081c81bd6 rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b label: tank1 length: 3000590864384 offset: 1048576 type: freebsd-zfs index: 1 end: 5860531079 start: 2048 Consumers: 1. Name: da1 Mediasize: 3000592982016 (2.7T) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e3
If I'm honest I doubt it's caused by iSCSI; I have no idea how it could corrupt a filesystem _below_ it. I think it's something between your disks and ZFS, and iSCSI just manages to generate a load high enough to trigger it. Is there a firmware update for your disks? Also, the errors seem to appear only on ada3 and ada4; can you recreate the zpool without those two disks and see if the problem goes away?
(In reply to Edward Tomasz Napierala from comment #4) The ada4 and ada3 disks were added because the existing mirror set in the pool was full (as in "getting out of disk space errors", not "uncomfortably high"), so they are not likely to see any significant writes. This is a home NAS setup and while I have backups in the cloud and some on LTO3 tapes, I don't have anywhere local I can store what is on those drives so I'm really reluctant to pull those drives. I was not able to find any firmware updates for those drives. Both long and short SMART tests have been run w/o finding anything. Writing multiple 1TB files to files with "dd if=/dev/random" hasn't caused any problems. I'll try doing those "dd"s to zdevs next. The "write a couple TB, then scrub," testing takes a while; any thoughts on other non-destructive testing I could do? Is there some sane way to get a list of what the actual errors are in the pool (offset on the raw disk or such)? When I've done the backup, both drives have always reported the same number of errors. If both drives report exactly the same (bad) data at the same locations, then that could perhaps suggest something useful? I have both the chipset SATA ports and SAS/SATA ports from the motherboards LSI SAS controller. I might be able to move the drives from one controller to the other. Another device--a Server 2008 R2 box--has been doing similar iSCSI backups to another pool in the system for years. It, however, has much less to write and does so from spinning rust instead of SSD (although, I would assume the GigE link is still the bottleneck). The drives for this other pool are on the same controller as ada3 and ada4. I'll see about having it do an extra backup or two to the "tank" pool to see if it causes any grief there. For ada3: smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-PRERELEASE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: HGST Deskstar NAS Device Model: HGST HDN726060ALE610 Serial Number: NCGTEJ2S LU WWN Device Id: 5 000cca 24dcb1ba7 Firmware Version: APGNT517 User Capacity: 6,001,175,126,016 bytes [6.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Jun 10 15:49:55 2017 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED For ada4: smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-PRERELEASE amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: HGST Deskstar NAS Device Model: HGST HDN726060ALE610 Serial Number: NCGTES1S LU WWN Device Id: 5 000cca 24dcb1c7f Firmware Version: APGNT517 User Capacity: 6,001,175,126,016 bytes [6.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Jun 10 15:46:34 2017 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
I observed similar behaviour on one of my SAN systems. In my opinion, iSCSI + zfs is broken somewhere between r310734 and r320056. Symptoms: - random fatal trap 12 panics. - random general protection faults panics - random "Solaris(panic): zfs: allocating allocated segment" panics - zfs pool corruption that happens ONLY on pools that serve the zvols as iSCSI target devices - zfs pool corruption happening _on the fly_, without system panicking. - no zfs corruption is happening of the zfs pools not serving the devices for the iSCSI targets. I have 7 SAN systems running this setup. No system more recent than r310734 is showing that behaviour. The only system more recent than r310734 (at least r320056, and until 11.1-RELEASE) was affected by this, and became healthy when downgraded to r310734 (r310734 was chosen simply because it's the most recent revision on all of the 7). First I had the strong impression that we had a hardware problem. Memtest86+ found no problems. We found multiple SNART ATA errors that were caused by the bad cabling, and that seemed to be the rooy cause for the moment, but after switching to a new cable (and also to a new HBA, new server and new enclosure) the problem hasn't vanished. It was solved only after the downgrade to the r310734. The SAN system is up and running for 48 hours already without pool corruption (which usually happened withing first 12 hours of running) and without panics (which usually happened within first 24 hours). Unfortunately, I have no crashdumps, because the mpr(4) blocks crashdump collecting (see the discussion in the freebsd-scsi@). I have only the backtraces from serial-over-ethernet IPMI, which I will attach here. Problem initial description: https://lists.freebsd.org/pipermail/freebsd-fs/2017-August/025099.html
Created attachment 185276 [details] backtraces showing the issue These are panic backtraces collected within one month from the affected system, running multiple revisions affected up to 11.1.
Workaround (provided by mav@): set the vfs.zfs.vol.immediate_write_sz to 131072. No zvol corruptiojn observed after this.
Good catch! Do you know _why_ does it fix the problem?