219760 – ZFS iSCSI w/ Win10 Initiator Causes pool corruption

Bug 219760 - ZFS iSCSI w/ Win10 Initiator Causes pool corruption

Summary: ZFS iSCSI w/ Win10 Initiator Causes pool corruption

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.0-STABLE
Hardware:	amd64 Any

Importance:	--- Affects Only Me
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-06-03 19:14 UTC by Henric Jungheim
Modified:	2020-07-21 05:33 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
dmesg (10.81 KB, text/plain) 2017-06-03 19:14 UTC, Henric Jungheim	no flags	Details
backtraces showing the issue (12.44 KB, text/plain) 2017-08-11 08:18 UTC, emz	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Henric Jungheim 2017-06-03 19:14:39 UTC

Created attachment 183182 [details]
dmesg

With a zvol to provide a 1TB iSCSI drive for a Windows 10 initiator, using "Create a system image" causes the pool to be corrupted.  Neither drive reports any SMART problems (both have "SMART overall-health self-assessment test result: PASSED").  The backup completes without errors on the Windows side.

The iSCSI volume is set for 64k blocks and that is also the NTFS cluster size.  The backup is from two SATA SSDs in a laptop and done over a Gigabit Ethernet link. Bitlocker is enabled for the backup volume on the Windows side, which could possibly alter the disk access patterns...?  

Deleting the zvol, clearing the error, and retrying the backup to a new zvol causes the same problem.  The first try used "FreeBSD 11.0-STABLE #10 r317513: Thu Apr 27 12:09:34 MST 2017".  I updated to "FreeBSD 11.1-PRERELEASE #11 r319488: Thu Jun  1 21:12:33 MST 2017" for the second try.

  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 9h34m with 0 errors on Fri Jun  2 22:36:45 2017
config:

        NAME            STATE     READ WRITE CKSUM
        tank            DEGRADED     0     0    74
          mirror-0      ONLINE       0     0     0
            gpt/tank0   ONLINE       0     0     0
            gpt/tank1   ONLINE       0     0     0
          mirror-1      DEGRADED     0     0   148
            gpt/tank0b  DEGRADED     0     0   148  too many errors
            gpt/tank1b  DEGRADED     0     0   148  too many errors

errors: Permanent errors have been detected in the following files:

        tank/iscsi/mcp2:<0x1>


NAME             PROPERTY              VALUE                  SOURCE
tank/iscsi/mcp2  type                  volume                 -
tank/iscsi/mcp2  creation              Fri Jun  2 12:49 2017  -
tank/iscsi/mcp2  used                  1.00T                  -
tank/iscsi/mcp2  available             3.22T                  -
tank/iscsi/mcp2  referenced            971G                   -
tank/iscsi/mcp2  compressratio         1.00x                  -
tank/iscsi/mcp2  reservation           none                   default
tank/iscsi/mcp2  volsize               1T                     local
tank/iscsi/mcp2  volblocksize          64K                    -
tank/iscsi/mcp2  checksum              skein                  inherited from tank
tank/iscsi/mcp2  compression           off                    local
tank/iscsi/mcp2  readonly              off                    default
tank/iscsi/mcp2  copies                1                      default
tank/iscsi/mcp2  refreservation        1.00T                  local
tank/iscsi/mcp2  primarycache          all                    default
tank/iscsi/mcp2  secondarycache        all                    default
tank/iscsi/mcp2  usedbysnapshots       0                      -
tank/iscsi/mcp2  usedbydataset         971G                   -
tank/iscsi/mcp2  usedbychildren        0                      -
tank/iscsi/mcp2  usedbyrefreservation  57.2G                  -
tank/iscsi/mcp2  logbias               latency                default
tank/iscsi/mcp2  dedup                 off                    default
tank/iscsi/mcp2  mlslabel                                     -
tank/iscsi/mcp2  sync                  standard               default
tank/iscsi/mcp2  refcompressratio      1.00x                  -
tank/iscsi/mcp2  written               971G                   -
tank/iscsi/mcp2  logicalused           970G                   -
tank/iscsi/mcp2  logicalreferenced     970G                   -
tank/iscsi/mcp2  volmode               default                default
tank/iscsi/mcp2  snapshot_limit        none                   default
tank/iscsi/mcp2  snapshot_count        none                   default
tank/iscsi/mcp2  redundant_metadata    all                    default

Geom name: ada3
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 11721045127
first: 40
entries: 152
scheme: GPT
Providers:
1. Name: ada3p1
   Mediasize: 6001173463040 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: 7830d285-8079-11e6-b5ec-00e081c81bd6
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: tank0b
   length: 6001173463040
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 11721043967
   start: 2048
Consumers:
1. Name: ada3
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e3
Geom name: ada4
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 11721045127
first: 40
entries: 152
scheme: GPT
Providers:
1. Name: ada4p1
   Mediasize: 6001173463040 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: 7b35d907-8079-11e6-b5ec-00e081c81bd6
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: tank1b
   length: 6001173463040
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 11721043967
   start: 2048
Consumers:
1. Name: ada4
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e3

Comment 1 Edward Tomasz Napierala freebsd_committer

2017-06-05 18:29:45 UTC

Looking at your zpool...  Correct me if I'm wrong, but don't your vdevs overlap somehow?  I mean, the first one is on /dev/gpt[01], and the second one on /dev/gtp[01]b?

Can you paste the output of "gpart list"?

Comment 2 Edward Tomasz Napierala freebsd_committer

2017-06-05 18:31:44 UTC

I mean, the full output, the one that shows what tank0 (without the "b") is.

Comment 3 Henric Jungheim 2017-06-05 19:33:25 UTC

(In reply to Edward Tomasz Napierala from comment #2)
The tank0 and tank0b drives are completely separate.  Had I more sense when adding the two new drives, they would have been named something like "tankb0" and "tankb1".  From "gpart show -l":
=>         40  11721045088  ada3  GPT  (5.5T)
           40         2008        - free -  (1.0M)
         2048  11721041920     1  tank0b  (5.5T)
  11721043968         1160        - free -  (580K)

=>         40  11721045088  ada4  GPT  (5.5T)
           40         2008        - free -  (1.0M)
         2048  11721041920     1  tank1b  (5.5T)
  11721043968         1160        - free -  (580K)

=>        34  5860533101  da0  GPT  (2.7T)
          34        2014       - free -  (1.0M)
        2048  5860529032    1  tank0  (2.7T)
  5860531080        2055       - free -  (1.0M)

=>        34  5860533101  da1  GPT  (2.7T)
          34        2014       - free -  (1.0M)
        2048  5860529032    1  tank1  (2.7T)
  5860531080        2055       - free -  (1.0M)


Here's the gpart list for tank0 and tank1:
Geom name: da0
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 5860533134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da0p1
   Mediasize: 3000590864384 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: 618bb5ab-66a1-11e2-8e33-00e081c81bd6
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: tank0
   length: 3000590864384
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 5860531079
   start: 2048
Consumers:
1. Name: da0
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e3

Geom name: da1
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 5860533134
first: 34
entries: 128
scheme: GPT
Providers:
1. Name: da1p1
   Mediasize: 3000590864384 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   rawuuid: b0e34fe7-6a9e-11e2-8e33-00e081c81bd6
   rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
   label: tank1
   length: 3000590864384
   offset: 1048576
   type: freebsd-zfs
   index: 1
   end: 5860531079
   start: 2048
Consumers:
1. Name: da1
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e3

Comment 4 Edward Tomasz Napierala freebsd_committer

2017-06-06 17:18:05 UTC

If I'm honest I doubt it's caused by iSCSI; I have no idea how it could corrupt a filesystem _below_ it.  I think it's something between your disks and ZFS, and iSCSI just manages to generate a load high enough to trigger it.

Is there a firmware update for your disks?  Also, the errors seem to appear only on ada3 and ada4; can you recreate the zpool without those two disks and see if the problem goes away?

Comment 5 Henric Jungheim 2017-06-10 22:52:33 UTC

(In reply to Edward Tomasz Napierala from comment #4)
The ada4 and ada3 disks were added because the existing mirror set in the pool was full (as in "getting out of disk space errors", not "uncomfortably high"), so they are not likely to see any significant writes.  This is a home NAS setup and while I have backups in the cloud and some on LTO3 tapes, I don't have anywhere local I can store what is on those drives so I'm really reluctant to pull those drives.

I was not able to find any firmware updates for those drives.  Both long and short SMART tests have been run w/o finding anything.  Writing multiple 1TB files to files with "dd if=/dev/random" hasn't caused any problems.  I'll try doing those "dd"s to zdevs next.  The "write a couple TB, then scrub," testing takes a while; any thoughts on other non-destructive testing I could do?

Is there some sane way to get a list of what the actual errors are in the pool (offset on the raw disk or such)?  When I've done the backup, both drives have always reported the same number of errors.  If both drives report exactly the same (bad) data at the same locations, then that could perhaps suggest something useful?

I have both the chipset SATA ports and SAS/SATA ports from the motherboards LSI SAS controller.  I might be able to move the drives from one controller to the other.

Another device--a Server 2008 R2 box--has been doing similar iSCSI backups to another pool in the system for years.  It, however, has much less to write and does so from spinning rust instead of SSD (although, I would assume the GigE link is still the bottleneck).  The drives for this other pool are on the same controller as ada3 and ada4.  I'll see about having it do an extra backup or two to the "tank" pool to see if it causes any grief there.

For ada3:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN726060ALE610
Serial Number:    NCGTEJ2S
LU WWN Device Id: 5 000cca 24dcb1ba7
Firmware Version: APGNT517
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jun 10 15:49:55 2017 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


For ada4:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN726060ALE610
Serial Number:    NCGTES1S
LU WWN Device Id: 5 000cca 24dcb1c7f
Firmware Version: APGNT517
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jun 10 15:46:34 2017 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Comment 6 emz 2017-08-11 08:09:29 UTC

I observed similar behaviour on one of my SAN systems.

In my opinion, iSCSI + zfs is broken somewhere between r310734 and r320056.

Symptoms:

- random fatal trap 12 panics.
- random general protection faults panics
- random "Solaris(panic): zfs: allocating allocated segment" panics
- zfs pool corruption that happens ONLY on pools that serve the zvols as iSCSI target devices
- zfs pool corruption happening _on the fly_, without system panicking.
- no zfs corruption is happening of the zfs pools not serving the devices for the iSCSI targets.

I have 7 SAN systems running this setup. No system more recent than r310734 is showing that behaviour. The only system more recent than r310734 (at least r320056, and until 11.1-RELEASE) was affected by this, and became healthy when downgraded to r310734 (r310734 was chosen simply because it's the most recent revision on all of the 7).

First I had the strong impression that we had a hardware problem. Memtest86+ found no problems. We found multiple SNART ATA errors that were caused by the bad cabling, and that seemed to be the rooy cause for the moment, but after switching to a new cable (and also to a new HBA, new server and new enclosure) the problem hasn't vanished. It was solved only after the downgrade to the r310734. The SAN system is up and running for 48 hours already without pool corruption (which usually happened withing first 12 hours of running) and without panics (which usually happened within first 24 hours).

Unfortunately, I have no crashdumps, because the mpr(4) blocks crashdump collecting (see the discussion in the freebsd-scsi@). I have only the backtraces from serial-over-ethernet IPMI, which I will attach here.

Problem initial description:

https://lists.freebsd.org/pipermail/freebsd-fs/2017-August/025099.html

Comment 7 emz 2017-08-11 08:18:23 UTC

Created attachment 185276 [details]
backtraces showing the issue

These are panic backtraces collected within one month from the affected system, running multiple revisions affected up to 11.1.

Comment 8 emz 2018-08-21 11:24:07 UTC

Workaround (provided by mav@): set the vfs.zfs.vol.immediate_write_sz to 131072.

No zvol corruptiojn observed after this.

Comment 9 Edward Tomasz Napierala freebsd_committer

2018-08-21 14:07:36 UTC

Good catch!  Do you know _why_ does it fix the problem?