Bug 241980 - panic: I/O to pool appears to be hung on vdev
Summary: panic: I/O to pool appears to be hung on vdev
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.3-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2019-11-15 08:01 UTC by Eugene Grosbein
Modified: 2022-10-12 00:50 UTC (History)
3 users (show)

See Also:


Attachments
more debug logging (2.91 KB, patch)
2019-11-16 09:59 UTC, Eugene Grosbein
no flags Details | Diff
more debug logging (4.14 KB, patch)
2019-11-24 07:04 UTC, Eugene Grosbein
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 08:01:50 UTC
11.2-STABLE/amd64 r344922 paniced with message like: panic: I/O to pool appears to be hung on vdev

It paniced same way 2 days after upgrade to 11.3-STABLE r354667:
http://www.grosbein.net/freebsd/zpanic.png

Hardware: Dell PowerEdge R640 with 360G RAM, mrsas(4)-supported controller PERC H730/P Mini LSI MegaRAID SAS-3 3108 [Invader]

7 SSD devices, two of them keep FreeBSD installation (distinct boot pool) and
five others are GELI-encrypted and combined to another (RAIDZ1) pool 'sata' mentioned on screenshot.

vfs.zfs.arc_max=160g

# zpool status -v sata
  pool: sata
 state: ONLINE
  scan: resilvered 58,0G in 0 days 00:08:02 with 0 errors on Wed Mar 13 21:34:28 2019
config:

        NAME         STATE     READ WRITE CKSUM
        sata         ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            da2.eli  ONLINE       0     0     0
            da3.eli  ONLINE       0     0     0
            da4.eli  ONLINE       0     0     0
            da5.eli  ONLINE       0     0     0
            da6.eli  ONLINE       0     0     0
        logs
          raid/r0s3  ONLINE       0     0     0
        cache
          raid/r1    ONLINE       0     0     0

errors: No known data errors

SSD devices da0 and da1 are combined with GEOM_RAID array having 3 volumes: mirrored boot volume /dev/raid/r0 spans over beginning of both devices, has MBR partitioning and keeps non-dedundant ZFS boot pool "os" as /dev/raid/r0s2 while /dev/raid/r0s1 is swap space and /dev/raid/r0s3 acts as ZFS Log device for "sata" pool shown above. The tail of da0 SSD (/dev/raid/r1) is used as ZFS Cache (L2ARC) for "sata" pool and the tail of da1 (/dev/raid/r2) is unused (reserved).

# gpart show
=>       63  448790465  raid/r0  MBR  (214G)
         63        441           - free -  (221K)
        504  134217728        1  freebsd  (64G)
  134218232  104857600        2  freebsd  [active]  (50G)
  239075832  209714696        3  freebsd  (100G)
# graid list
Geom name: Promise
State: OPTIMAL
Metadata: Promise
Providers:
1. Name: raid/r1
   Mediasize: 250256121856 (233G)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r1w1e1
   Subdisks: da0 (ACTIVE)
   Dirty: Yes
   State: OPTIMAL
   Strip: 131072
   Components: 1
   Transformation: CONCAT
   RAIDLevel: RAID0
   Label: r1
   descr: Promise RAID0 volume
2. Name: raid/r0
   Mediasize: 229780750336 (214G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r3w3e6
   Subdisks: da0 (ACTIVE), da1 (ACTIVE)
   Dirty: Yes
   State: OPTIMAL
   Strip: 131072
   Components: 2
   Transformation: RAID1
   RAIDLevel: RAID1
   Label: r0
   descr: Promise RAID1 volume
3. Name: raid/r2
   Mediasize: 250256121856 (233G)
   Sectorsize: 512
   Stripesize: 131072
   Stripeoffset: 0
   Mode: r0w0e0
   Subdisks: da1 (ACTIVE)
   Dirty: No
   State: OPTIMAL
   Strip: 131072
   Components: 1
   Transformation: CONCAT
   RAIDLevel: RAID0
   Label: r2
   descr: Promise RAID0 volume
Consumers:
1. Name: da0
   Mediasize: 480103981056 (447G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   ReadErrors: 0
   Subdisks: r1(r1):0@229780750336, r0(r0):0@0
   State: ACTIVE (ACTIVE, ACTIVE)
2. Name: da1
   Mediasize: 480103981056 (447G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1
   ReadErrors: 0
   Subdisks: r0(r0):1@0, r2(r2):0@229780750336
   State: ACTIVE (ACTIVE, ACTIVE)

From dmesg.boot:

AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd
mrsas0: <AVAGO Invader SAS Controller> port 0x4000-0x40ff mem 0xa0c00000-0xa0c0ffff,0xa0b00000-0xa0bfffff irq 32 at device 0.0 numa-domain 0 on pci5
mrsas0: FW now in Ready state
mrsas0: Using MSI-X with 32 number of vectors
mrsas0: FW supports <96> MSIX vector,Online CPU 32 Current MSIX <32>
mrsas0: max sge: 0x46, max chain frame size: 0x400, max fw cmd: 0x39f
mrsas0: Issuing IOC INIT command to FW.
mrsas0: IOC INIT response received from FW.
mrsas0: System PD created target ID: 0x0
mrsas0: System PD created target ID: 0x1
mrsas0: System PD created target ID: 0x2
mrsas0: System PD created target ID: 0x3
mrsas0: System PD created target ID: 0x4
mrsas0: System PD created target ID: 0x5
mrsas0: System PD created target ID: 0x6
mrsas0: FW supports: UnevenSpanSupport=1
mrsas0: max_fw_cmds: 927  max_scsi_cmds: 911
mrsas0: MSI-x interrupts setup success
mrsas0: mrsas_ocr_thread

da5 at mrsas0 bus 1 scbus17 target 5 lun 0
da2 at mrsas0 bus 1 scbus17 target 2 lun 0
da3 at mrsas0 bus 1 scbus17 target 3 lun 0
da4 at mrsas0 bus 1 scbus17 target 4 lun 0
da4: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number 18531FF58A9B
da4: 150.000MB/s transfers
da4: 3662830MB (7501476528 512 byte sectors)
da2: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number 18532039E29C
da2: 150.000MB/s transfers
da2: 3662830MB (7501476528 512 byte sectors)
da3: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number 18531FF58CF6
da3: 150.000MB/s transfers
da3: 3662830MB (7501476528 512 byte sectors)
da5: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number 18531FF586C2
da5: 150.000MB/s transfers
da5: 3662830MB (7501476528 512 byte sectors)
da6 at mrsas0 bus 1 scbus17 target 6 lun 0
da6: <ATA XA3840LE10063 00ZU> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number HKT01DYR
da6: 150.000MB/s transfers
da6: 3662830MB (7501476528 512 byte sectors)
Comment 1 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 08:05:24 UTC
The system routinely creates snapshots for ZVOLs and removes old snapshots, so gstat -adI3s -f 'da[2-6].*' shows that system constandly floods SSDs with ATA TRIM requests.

I decreased kern.cam.da.[2-6].delete_max from default 17179607040 to 262144 after 1.5 days since update to 11.3-STABLE hoping to decrease latency of read/write operations but it panices again in several hours after that.
Comment 2 Andriy Gapon freebsd_committer freebsd_triage 2019-11-15 08:05:57 UTC
The reported vdev is da2.eli. Maybe a GELI problem or mrsas or hardware.
Comment 3 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 08:18:02 UTC
I wonder why kernel panices instead of disconnecting device from RAIDZ1 pool and continuing in degraded state.
Comment 4 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 08:35:56 UTC
spa_misc.c:

/*
 * Expiration time in milliseconds. This value has two meanings. First it is
 * used to determine when the spa_deadman() logic should fire. By default the
 * spa_deadman() will fire if spa_sync() has not completed in 1000 seconds.
 * Secondly, the value determines if an I/O is considered "hung". Any I/O that
 * has not completed in zfs_deadman_synctime_ms is considered "hung" resulting
 * in a system panic.
 */
uint64_t zfs_deadman_synctime_ms = 1000000ULL;]

Is it possible that queue of ATA TRIM requests grows faster than underlying SSDs erase their cells physically, so timeout fires eventually? And da2 is just first component of the pool.

SMART Extended offline test passed successfully for da2.
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2019-11-15 08:40:05 UTC
(In reply to Eugene Grosbein from comment #3)
Because it's not an I/O error. It's an outstanding request with unknown status.
Comment 6 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 08:57:53 UTC
Sorry, I did not understand your reply. I have not asked "why"? :-)
Comment 7 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 09:00:37 UTC
Oh, now I get it, I did ask why does it panic.

I still don't understand why this difference matters. It still should disconnect the device just like it were I/O error.
Comment 8 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 09:04:32 UTC
Is it possible to somehow see state/length of ZFS pending I/O request queues that vdev_deadman() checks?
Comment 9 Andriy Gapon freebsd_committer freebsd_triage 2019-11-15 09:06:29 UTC
(In reply to Eugene Grosbein from comment #7)
You can google for prior discussions on this topic. Short story: unknown is unknown, there can be DMA, etc.

(In reply to Eugene Grosbein from comment #8)
I think that only in a crash dump.
Comment 10 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 09:13:47 UTC
It just crashed again, I have not prepared crashdumps yet. It got much worse after upgrade from 11.2 to 11.3 two days ago and I suspect some logic error as all hardware is new and had long periods without panicing with 11.2.
Comment 11 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 09:43:39 UTC
No, latest crash was due to my inaccurate play with a watchdog... Anyway, does GENERIC in 11.3-STABLE has enough options for usable crashdump or I need to add something?
Comment 12 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 09:52:17 UTC
ZFS seem to have some means to produce debugging output on-the-fly, zfs_debug.c:

/*
 * Print these messages by running:
 * echo ::zfs_dbgmsg | mdb -k
 *
 * Monitor these messages by running:
 * dtrace -qn 'zfs-dbgmsg{printf("%s\n", stringof(arg0))}'
 *
 * When used with libzpool, monitor with:
 * dtrace -qn 'zfs$pid::zfs_dbgmsg:probe1{printf("%s\n", copyinstr(arg1))}'
 */
void
zfs_dbgmsg(const char *fmt, ...)

Does it work with FreeBSD?
Comment 13 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 10:18:21 UTC
(In reply to Eugene Grosbein from comment #12)

It seems "zdb -G" shows that debugging output. Also, I've discovered loader tunnable vfs.zfs.deadman_enabled. I'm going to switch it off and maybe add some more debugging to ZFS kernel code to monitor this.
Comment 14 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-15 10:27:00 UTC
Also, it makes no sense to me keeping vfs.zfs.deadman_enabled read-only. It should be read-write.
Comment 15 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-16 07:12:52 UTC
It crashed again, this time with same picture about "pool appears to be hung", after 18 hours of uptime. I had enabled crashdumps with "dumpon -v /dev/raid/r2" that is unused and has 233GB. But system just hung again: no crashdump generation, no reboot.
Comment 16 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-16 07:31:27 UTC
Here comes full kernel config file:

include GENERIC
ident   COL

options KDB
options KDB_UNATTENDED
options KDB_TRACE
options DDB
options GDB
#EOF
Comment 17 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-16 09:58:47 UTC
Now I'm pretty sure the hardware is fune and there is some bug in ZFS logic. I suspect this system triggers it by combination of multiple ZVOL snapshots created/removed every hour (tens or hundreds) combined with SSD-only RAIDZ1 pool (no HDDs), so huge amount of ATA TRIM operations are added every hours to the queues (avl trees, in fact).

Because of kernel hungs and inability to produce crashdump, I've patched ZFS code changing vfs.zfs.deadman_enabled to r/w sysctl. Also, the patch disables panic if vfs.zfs.deadman_enabled=2 and adds additional logging to syslog (not via zfs_dbgmsg) early when number of queued zio's hits 1/8th of standard limit of 1000 seconds.
Comment 18 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-16 09:59:22 UTC
Created attachment 209178 [details]
more debug logging
Comment 19 Andriy Gapon freebsd_committer freebsd_triage 2019-11-20 08:11:06 UTC
(In reply to Eugene Grosbein from comment #18)
There is no reason why deadman_enabled had to be read-only. But the code needs to be careful to stop and restart the callout when the deadman logic is disabled and enabled.  Maybe we can have the callout always running, then we need to check the flag in the callout callback.

Also, I think that there is a bug in the early warning code.  I think that the warning timeout is 8 times longer that the regular timeout.

The best way to debug the problem would be to get a crash dump.
Maybe you could try to add a dedicated dump disk and connect it to a different controller, etc.
But if you can't get that to work, then I suggest that you add more useful information to the early warning.  You can print a pointer to zio and a pointer to vdev.  Then you'd be able to use kgdb on a live kernel and you would have a place to start.

Just my 2 cents as they say.
Comment 20 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-20 09:31:36 UTC
This is dedicated server in Hetzner. Well, it has another SATA controller seen as:

ahci0@pci0:0:17:5:      class=0x010601 card=0x07161028 chip=0xa1d28086 rev=0x09 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'C620 Series Chipset Family SSATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA

But first I'd like to disable panicing by zfs_deadman and see if distinct ZFS boot pool is alive in such moment. I already tried booting patched zfs.ko by means of nextboot.conf and module_path="/boot/nextboot;/boot/kernel;/boot/modules". Additionally, I've enabled hardware watchdog in BIOS setup (the server has distinct public IP for console via virtual KVM) and enabled watchdogd.

Sadly, I made a misprint and it booted with default unpatched zfs.ko and I missed that, so another same panic that occured today gave me little but interesting information: the watchdog works and dmesg buffer is not cleared between reboots, so I have KDB trace in /var/log/kernel.log after reboot (kern.* messages logged there). I've fixed the misprint, will reboot again and wait.

Thank you for your remark about early warning, now I see you are right. I'll fix it.
Comment 21 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-24 06:54:01 UTC
It took me unexpectedly long time to get debugging output as I struggled to force the loader perform one-time (nextboot) loading for patched zfs.ko being able to perform only one reboot per day for this production machine. It appeared we have documented way to do this for the kernel but not for modules. After several attempts, zfs_name="/boot/nextboot/zfs.ko" in the /boot/nextboot.conf did it finally.

# TZ=UTC sysctl kern.boottime
kern.boottime: { sec = 1574393119, usec = 301171 } Fri Nov 22 03:25:19 2019

Local time is UTC+3. After two days of uptime, I got this in the log:

Nov 24 06:24:36 col02 kernel: sata SLOW IO: zio io_type 3 timestamp 171750435607373ns, delta 1006734932093ns, last io 172757121535759ns
I/O to pool 'sata' appears to be hung on vdev guid 3313589389580178043 at '/dev/da2.eli' active zio 65

I've used this script to convert ZFS timestamps (getnanouptime expressed in nanoseconds) to readable time:

#!/bin/sh
date -jr $(sysctl -n kern.boottime | awk -vt=${1%?????????} -F'[, ]' '{print t+$4}')
#EOF

First, there was no hardware hung. The system is alive, both ZFS pools still work, dd if=/dev/da2.eli works too. Second, ip_type 3 is ZIO_TYPE_FREE from /sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h.

# gethrtime 172757121535759
Sun Nov 24 06:24:36 MSK 2019

65 active zio's correspond to L(q) shown by gstat:

# gstat -adI1s -f 'da[2-6].*'
dT: 1.001s  w: 1.000s  filter: da[2-6].*
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
   65    774     37   3924    9.0    540  25707    0.7    193   8532  212.6   84.5| da2
   65    780     39   4144   14.3    544  26007    0.9    193   7724  241.7   91.0| da3
   65    771     33   3345    9.3    542  24448    0.7    192   8100  255.5   90.7| da4
   65    851     38   4004    7.8    554  28840    1.4    255   9679  236.0   97.1| da5
   65    766     36   4100    4.5    534  22730    0.5    192   8911  208.8   75.6| da6
   65    693     23   4144   14.8    474  26007    1.2    193   7724  243.0   95.0| da3.eli
   65    691     20   3345    9.4    476  24448    1.0    192   8100  255.8   95.1| da4.eli
   65    688     22   3924    9.8    470  25707    1.0    193   8532  213.0   89.1| da2.eli
   65    753     22   4004   11.6    473  28840    1.7    255   9679  238.3  100.7| da5.eli
   65    691     20   4100    5.6    476  22730    0.7    192   8911  210.6   80.7| da6.eli
Comment 22 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-24 07:04:13 UTC
Created attachment 209374 [details]
more debug logging

Latest changes to ZFS code used to get debug info.
Comment 23 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-24 07:44:16 UTC
As I've suspected, ZFS issues large number of BIO_DELETE/ZIO_TYPE_FREE requests due to massive removal of hourly/daily snapshots for this SSD-only pool. Clearing of SSD cells may be slow.

The system has daily cron job started at early morning at 6:05 local time. It creates many new daily snapshots for ZVOLs and removes old snapshots. First debugging warning I've got at 6:24:36 when ZIO_TYPE_FREE spent 1006/8=125.75 seconds in the avl_tree for active zio's.

Then there were more such events (total 50) with increasing "delta" upto 1230 seconds at 06:28:26 after which messages stopped. They were not for ZIO_TYPE_FREE(3) zio's only, but for ZIO_TYPE_WRITE(2) too. And for all components of the pool plust for distinct ZIL device /dev/raid/r0s3 that is part of GEOM_RAID's mirror volume based on another pair of SSDs da[01].

I'd like to show all messages appeared in the log for very first event at 06:24:36.

sata SLOW IO: zio io_type 3 timestamp 171750435607373ns, delta 1006734932093ns, last io 172757121535759ns
I/O to pool 'sata' appears to be hung on vdev guid 3313589389580178043 at '/dev/da2.eli' active zio 65
SLOW IO: sata is child of sata
sata SLOW IO: zio io_type 2 timestamp 171757174663043ns, delta 999995876423ns, last io 172756998535496nsI/O to pool 'sata' appears to be hung on vdev guid 17157972224957886076 at '/dev/da3.eli' active zio 65
SLOW IO: sata is child of sata
sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172757029559490ns
I/O to pool 'sata' appears to be hung on vdev guid 11111943829132819397 at '/dev/da4.eli' active zio 65
SLOW IO: sata is child of sata
sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172756997538959ns
I/O to pool 'sata' appears to be hung on vdev guid 13912528878380060504 at '/dev/da5.eli' active zio 65
SLOW IO: sata is child of sata
sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172756973537187ns
I/O to pool 'sata' appears to be hung on vdev guid 18428510808262122301 at '/dev/da6.eli' active zio 65
SLOW IO: sata is child of sata
SLOW IO: sata is child of sata
sata SLOW IO: zio io_type 2 timestamp 171985896582124ns, delta 771273957342ns, last io 172756042541811ns
I/O to pool 'sata' appears to be hung on vdev guid 15059311870935027496 at '/dev/raid/r0s3' active zio 2
SLOW IO: sata is child of sata
Comment 24 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-24 07:56:11 UTC
After continuous stream of debugging messages stopped at 06:28:26, messages appeared again but only 3 times: one at 06:57:20 (all ZIO_TYPE_FREE), one at 06:57:25 (all ZIO_TYPE_FREE) and last at 06:57:30 (ZIO_TYPE_WRITE for /dev/da2.eli and ZIO_TYPE_FREE for both of /dev/da3.eli and /dev/da5.eli).

Then they ceased completely and systems runs fine.
Comment 25 Andriy Gapon freebsd_committer freebsd_triage 2019-11-25 22:09:36 UTC
(In reply to Eugene Grosbein from comment #23)
So, I am not sure what this tells us.
An I/O request taking tens of minutes does not look normal to me, even if it's a TRIM request.
Maybe disable the deadman logic, maybe disable ZFS TRIM.
Or maybe there is something special about the SSDs or something that can be tuned.
But this does not look like a problem with "core" ZFS.

I think that Warner Losh looked deeply into the SSD/TRIM situation, maybe he can give some additional ideas.
Comment 26 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-26 06:06:54 UTC
At first this hardware used mfi(4) driver that has no TRIM support at all and write performance was very bad. Switch to mrsas(4) with re-trimming SSDs one-by-one made performance normal. Disabling TRIM is not an option.
Comment 27 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-27 08:31:33 UTC
(In reply to Andriy Gapon from comment #25)

I still think "core ZFS" logic of panicing system is wrong. It should just disconnect "non-responding" device from pool having redundancy instead. And continue working with pool in degraded state. This is why we add costly redundancy to pools.
Comment 28 Eugene Grosbein freebsd_committer freebsd_triage 2021-01-08 20:34:14 UTC
Closing this as I have no more that hardware.

For the record: as work-around against repeating panics, ZFS deadman feature was disabled but that did not solved slow data access problem being fatal for guest OSes inside bhyve virtual machines. So SSDs with bad TRIM performance were replaced with NVMEs.