11.2-STABLE/amd64 r344922 paniced with message like: panic: I/O to pool appears to be hung on vdev It paniced same way 2 days after upgrade to 11.3-STABLE r354667: http://www.grosbein.net/freebsd/zpanic.png Hardware: Dell PowerEdge R640 with 360G RAM, mrsas(4)-supported controller PERC H730/P Mini LSI MegaRAID SAS-3 3108 [Invader] 7 SSD devices, two of them keep FreeBSD installation (distinct boot pool) and five others are GELI-encrypted and combined to another (RAIDZ1) pool 'sata' mentioned on screenshot. vfs.zfs.arc_max=160g # zpool status -v sata pool: sata state: ONLINE scan: resilvered 58,0G in 0 days 00:08:02 with 0 errors on Wed Mar 13 21:34:28 2019 config: NAME STATE READ WRITE CKSUM sata ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da2.eli ONLINE 0 0 0 da3.eli ONLINE 0 0 0 da4.eli ONLINE 0 0 0 da5.eli ONLINE 0 0 0 da6.eli ONLINE 0 0 0 logs raid/r0s3 ONLINE 0 0 0 cache raid/r1 ONLINE 0 0 0 errors: No known data errors SSD devices da0 and da1 are combined with GEOM_RAID array having 3 volumes: mirrored boot volume /dev/raid/r0 spans over beginning of both devices, has MBR partitioning and keeps non-dedundant ZFS boot pool "os" as /dev/raid/r0s2 while /dev/raid/r0s1 is swap space and /dev/raid/r0s3 acts as ZFS Log device for "sata" pool shown above. The tail of da0 SSD (/dev/raid/r1) is used as ZFS Cache (L2ARC) for "sata" pool and the tail of da1 (/dev/raid/r2) is unused (reserved). # gpart show => 63 448790465 raid/r0 MBR (214G) 63 441 - free - (221K) 504 134217728 1 freebsd (64G) 134218232 104857600 2 freebsd [active] (50G) 239075832 209714696 3 freebsd (100G) # graid list Geom name: Promise State: OPTIMAL Metadata: Promise Providers: 1. Name: raid/r1 Mediasize: 250256121856 (233G) Sectorsize: 512 Stripesize: 131072 Stripeoffset: 0 Mode: r1w1e1 Subdisks: da0 (ACTIVE) Dirty: Yes State: OPTIMAL Strip: 131072 Components: 1 Transformation: CONCAT RAIDLevel: RAID0 Label: r1 descr: Promise RAID0 volume 2. Name: raid/r0 Mediasize: 229780750336 (214G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r3w3e6 Subdisks: da0 (ACTIVE), da1 (ACTIVE) Dirty: Yes State: OPTIMAL Strip: 131072 Components: 2 Transformation: RAID1 RAIDLevel: RAID1 Label: r0 descr: Promise RAID1 volume 3. Name: raid/r2 Mediasize: 250256121856 (233G) Sectorsize: 512 Stripesize: 131072 Stripeoffset: 0 Mode: r0w0e0 Subdisks: da1 (ACTIVE) Dirty: No State: OPTIMAL Strip: 131072 Components: 1 Transformation: CONCAT RAIDLevel: RAID0 Label: r2 descr: Promise RAID0 volume Consumers: 1. Name: da0 Mediasize: 480103981056 (447G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e1 ReadErrors: 0 Subdisks: r1(r1):0@229780750336, r0(r0):0@0 State: ACTIVE (ACTIVE, ACTIVE) 2. Name: da1 Mediasize: 480103981056 (447G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e1 ReadErrors: 0 Subdisks: r0(r0):1@0, r2(r2):0@229780750336 State: ACTIVE (ACTIVE, ACTIVE) From dmesg.boot: AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd mrsas0: <AVAGO Invader SAS Controller> port 0x4000-0x40ff mem 0xa0c00000-0xa0c0ffff,0xa0b00000-0xa0bfffff irq 32 at device 0.0 numa-domain 0 on pci5 mrsas0: FW now in Ready state mrsas0: Using MSI-X with 32 number of vectors mrsas0: FW supports <96> MSIX vector,Online CPU 32 Current MSIX <32> mrsas0: max sge: 0x46, max chain frame size: 0x400, max fw cmd: 0x39f mrsas0: Issuing IOC INIT command to FW. mrsas0: IOC INIT response received from FW. mrsas0: System PD created target ID: 0x0 mrsas0: System PD created target ID: 0x1 mrsas0: System PD created target ID: 0x2 mrsas0: System PD created target ID: 0x3 mrsas0: System PD created target ID: 0x4 mrsas0: System PD created target ID: 0x5 mrsas0: System PD created target ID: 0x6 mrsas0: FW supports: UnevenSpanSupport=1 mrsas0: max_fw_cmds: 927 max_scsi_cmds: 911 mrsas0: MSI-x interrupts setup success mrsas0: mrsas_ocr_thread da5 at mrsas0 bus 1 scbus17 target 5 lun 0 da2 at mrsas0 bus 1 scbus17 target 2 lun 0 da3 at mrsas0 bus 1 scbus17 target 3 lun 0 da4 at mrsas0 bus 1 scbus17 target 4 lun 0 da4: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device da4: Serial Number 18531FF58A9B da4: 150.000MB/s transfers da4: 3662830MB (7501476528 512 byte sectors) da2: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device da2: Serial Number 18532039E29C da2: 150.000MB/s transfers da2: 3662830MB (7501476528 512 byte sectors) da3: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device da3: Serial Number 18531FF58CF6 da3: 150.000MB/s transfers da3: 3662830MB (7501476528 512 byte sectors) da5: <ATA Micron_5200_MTFD U404> Fixed Direct Access SPC-4 SCSI device da5: Serial Number 18531FF586C2 da5: 150.000MB/s transfers da5: 3662830MB (7501476528 512 byte sectors) da6 at mrsas0 bus 1 scbus17 target 6 lun 0 da6: <ATA XA3840LE10063 00ZU> Fixed Direct Access SPC-4 SCSI device da6: Serial Number HKT01DYR da6: 150.000MB/s transfers da6: 3662830MB (7501476528 512 byte sectors)
The system routinely creates snapshots for ZVOLs and removes old snapshots, so gstat -adI3s -f 'da[2-6].*' shows that system constandly floods SSDs with ATA TRIM requests. I decreased kern.cam.da.[2-6].delete_max from default 17179607040 to 262144 after 1.5 days since update to 11.3-STABLE hoping to decrease latency of read/write operations but it panices again in several hours after that.
The reported vdev is da2.eli. Maybe a GELI problem or mrsas or hardware.
I wonder why kernel panices instead of disconnecting device from RAIDZ1 pool and continuing in degraded state.
spa_misc.c: /* * Expiration time in milliseconds. This value has two meanings. First it is * used to determine when the spa_deadman() logic should fire. By default the * spa_deadman() will fire if spa_sync() has not completed in 1000 seconds. * Secondly, the value determines if an I/O is considered "hung". Any I/O that * has not completed in zfs_deadman_synctime_ms is considered "hung" resulting * in a system panic. */ uint64_t zfs_deadman_synctime_ms = 1000000ULL;] Is it possible that queue of ATA TRIM requests grows faster than underlying SSDs erase their cells physically, so timeout fires eventually? And da2 is just first component of the pool. SMART Extended offline test passed successfully for da2.
(In reply to Eugene Grosbein from comment #3) Because it's not an I/O error. It's an outstanding request with unknown status.
Sorry, I did not understand your reply. I have not asked "why"? :-)
Oh, now I get it, I did ask why does it panic. I still don't understand why this difference matters. It still should disconnect the device just like it were I/O error.
Is it possible to somehow see state/length of ZFS pending I/O request queues that vdev_deadman() checks?
(In reply to Eugene Grosbein from comment #7) You can google for prior discussions on this topic. Short story: unknown is unknown, there can be DMA, etc. (In reply to Eugene Grosbein from comment #8) I think that only in a crash dump.
It just crashed again, I have not prepared crashdumps yet. It got much worse after upgrade from 11.2 to 11.3 two days ago and I suspect some logic error as all hardware is new and had long periods without panicing with 11.2.
No, latest crash was due to my inaccurate play with a watchdog... Anyway, does GENERIC in 11.3-STABLE has enough options for usable crashdump or I need to add something?
ZFS seem to have some means to produce debugging output on-the-fly, zfs_debug.c: /* * Print these messages by running: * echo ::zfs_dbgmsg | mdb -k * * Monitor these messages by running: * dtrace -qn 'zfs-dbgmsg{printf("%s\n", stringof(arg0))}' * * When used with libzpool, monitor with: * dtrace -qn 'zfs$pid::zfs_dbgmsg:probe1{printf("%s\n", copyinstr(arg1))}' */ void zfs_dbgmsg(const char *fmt, ...) Does it work with FreeBSD?
(In reply to Eugene Grosbein from comment #12) It seems "zdb -G" shows that debugging output. Also, I've discovered loader tunnable vfs.zfs.deadman_enabled. I'm going to switch it off and maybe add some more debugging to ZFS kernel code to monitor this.
Also, it makes no sense to me keeping vfs.zfs.deadman_enabled read-only. It should be read-write.
It crashed again, this time with same picture about "pool appears to be hung", after 18 hours of uptime. I had enabled crashdumps with "dumpon -v /dev/raid/r2" that is unused and has 233GB. But system just hung again: no crashdump generation, no reboot.
Here comes full kernel config file: include GENERIC ident COL options KDB options KDB_UNATTENDED options KDB_TRACE options DDB options GDB #EOF
Now I'm pretty sure the hardware is fune and there is some bug in ZFS logic. I suspect this system triggers it by combination of multiple ZVOL snapshots created/removed every hour (tens or hundreds) combined with SSD-only RAIDZ1 pool (no HDDs), so huge amount of ATA TRIM operations are added every hours to the queues (avl trees, in fact). Because of kernel hungs and inability to produce crashdump, I've patched ZFS code changing vfs.zfs.deadman_enabled to r/w sysctl. Also, the patch disables panic if vfs.zfs.deadman_enabled=2 and adds additional logging to syslog (not via zfs_dbgmsg) early when number of queued zio's hits 1/8th of standard limit of 1000 seconds.
Created attachment 209178 [details] more debug logging
(In reply to Eugene Grosbein from comment #18) There is no reason why deadman_enabled had to be read-only. But the code needs to be careful to stop and restart the callout when the deadman logic is disabled and enabled. Maybe we can have the callout always running, then we need to check the flag in the callout callback. Also, I think that there is a bug in the early warning code. I think that the warning timeout is 8 times longer that the regular timeout. The best way to debug the problem would be to get a crash dump. Maybe you could try to add a dedicated dump disk and connect it to a different controller, etc. But if you can't get that to work, then I suggest that you add more useful information to the early warning. You can print a pointer to zio and a pointer to vdev. Then you'd be able to use kgdb on a live kernel and you would have a place to start. Just my 2 cents as they say.
This is dedicated server in Hetzner. Well, it has another SATA controller seen as: ahci0@pci0:0:17:5: class=0x010601 card=0x07161028 chip=0xa1d28086 rev=0x09 hdr=0x00 vendor = 'Intel Corporation' device = 'C620 Series Chipset Family SSATA Controller [AHCI mode]' class = mass storage subclass = SATA But first I'd like to disable panicing by zfs_deadman and see if distinct ZFS boot pool is alive in such moment. I already tried booting patched zfs.ko by means of nextboot.conf and module_path="/boot/nextboot;/boot/kernel;/boot/modules". Additionally, I've enabled hardware watchdog in BIOS setup (the server has distinct public IP for console via virtual KVM) and enabled watchdogd. Sadly, I made a misprint and it booted with default unpatched zfs.ko and I missed that, so another same panic that occured today gave me little but interesting information: the watchdog works and dmesg buffer is not cleared between reboots, so I have KDB trace in /var/log/kernel.log after reboot (kern.* messages logged there). I've fixed the misprint, will reboot again and wait. Thank you for your remark about early warning, now I see you are right. I'll fix it.
It took me unexpectedly long time to get debugging output as I struggled to force the loader perform one-time (nextboot) loading for patched zfs.ko being able to perform only one reboot per day for this production machine. It appeared we have documented way to do this for the kernel but not for modules. After several attempts, zfs_name="/boot/nextboot/zfs.ko" in the /boot/nextboot.conf did it finally. # TZ=UTC sysctl kern.boottime kern.boottime: { sec = 1574393119, usec = 301171 } Fri Nov 22 03:25:19 2019 Local time is UTC+3. After two days of uptime, I got this in the log: Nov 24 06:24:36 col02 kernel: sata SLOW IO: zio io_type 3 timestamp 171750435607373ns, delta 1006734932093ns, last io 172757121535759ns I/O to pool 'sata' appears to be hung on vdev guid 3313589389580178043 at '/dev/da2.eli' active zio 65 I've used this script to convert ZFS timestamps (getnanouptime expressed in nanoseconds) to readable time: #!/bin/sh date -jr $(sysctl -n kern.boottime | awk -vt=${1%?????????} -F'[, ]' '{print t+$4}') #EOF First, there was no hardware hung. The system is alive, both ZFS pools still work, dd if=/dev/da2.eli works too. Second, ip_type 3 is ZIO_TYPE_FREE from /sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h. # gethrtime 172757121535759 Sun Nov 24 06:24:36 MSK 2019 65 active zio's correspond to L(q) shown by gstat: # gstat -adI1s -f 'da[2-6].*' dT: 1.001s w: 1.000s filter: da[2-6].* L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name 65 774 37 3924 9.0 540 25707 0.7 193 8532 212.6 84.5| da2 65 780 39 4144 14.3 544 26007 0.9 193 7724 241.7 91.0| da3 65 771 33 3345 9.3 542 24448 0.7 192 8100 255.5 90.7| da4 65 851 38 4004 7.8 554 28840 1.4 255 9679 236.0 97.1| da5 65 766 36 4100 4.5 534 22730 0.5 192 8911 208.8 75.6| da6 65 693 23 4144 14.8 474 26007 1.2 193 7724 243.0 95.0| da3.eli 65 691 20 3345 9.4 476 24448 1.0 192 8100 255.8 95.1| da4.eli 65 688 22 3924 9.8 470 25707 1.0 193 8532 213.0 89.1| da2.eli 65 753 22 4004 11.6 473 28840 1.7 255 9679 238.3 100.7| da5.eli 65 691 20 4100 5.6 476 22730 0.7 192 8911 210.6 80.7| da6.eli
Created attachment 209374 [details] more debug logging Latest changes to ZFS code used to get debug info.
As I've suspected, ZFS issues large number of BIO_DELETE/ZIO_TYPE_FREE requests due to massive removal of hourly/daily snapshots for this SSD-only pool. Clearing of SSD cells may be slow. The system has daily cron job started at early morning at 6:05 local time. It creates many new daily snapshots for ZVOLs and removes old snapshots. First debugging warning I've got at 6:24:36 when ZIO_TYPE_FREE spent 1006/8=125.75 seconds in the avl_tree for active zio's. Then there were more such events (total 50) with increasing "delta" upto 1230 seconds at 06:28:26 after which messages stopped. They were not for ZIO_TYPE_FREE(3) zio's only, but for ZIO_TYPE_WRITE(2) too. And for all components of the pool plust for distinct ZIL device /dev/raid/r0s3 that is part of GEOM_RAID's mirror volume based on another pair of SSDs da[01]. I'd like to show all messages appeared in the log for very first event at 06:24:36. sata SLOW IO: zio io_type 3 timestamp 171750435607373ns, delta 1006734932093ns, last io 172757121535759ns I/O to pool 'sata' appears to be hung on vdev guid 3313589389580178043 at '/dev/da2.eli' active zio 65 SLOW IO: sata is child of sata sata SLOW IO: zio io_type 2 timestamp 171757174663043ns, delta 999995876423ns, last io 172756998535496nsI/O to pool 'sata' appears to be hung on vdev guid 17157972224957886076 at '/dev/da3.eli' active zio 65 SLOW IO: sata is child of sata sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172757029559490ns I/O to pool 'sata' appears to be hung on vdev guid 11111943829132819397 at '/dev/da4.eli' active zio 65 SLOW IO: sata is child of sata sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172756997538959ns I/O to pool 'sata' appears to be hung on vdev guid 13912528878380060504 at '/dev/da5.eli' active zio 65 SLOW IO: sata is child of sata sata SLOW IO: zio io_type 2 timestamp 171757172636758ns, delta 999997902708ns, last io 172756973537187ns I/O to pool 'sata' appears to be hung on vdev guid 18428510808262122301 at '/dev/da6.eli' active zio 65 SLOW IO: sata is child of sata SLOW IO: sata is child of sata sata SLOW IO: zio io_type 2 timestamp 171985896582124ns, delta 771273957342ns, last io 172756042541811ns I/O to pool 'sata' appears to be hung on vdev guid 15059311870935027496 at '/dev/raid/r0s3' active zio 2 SLOW IO: sata is child of sata
After continuous stream of debugging messages stopped at 06:28:26, messages appeared again but only 3 times: one at 06:57:20 (all ZIO_TYPE_FREE), one at 06:57:25 (all ZIO_TYPE_FREE) and last at 06:57:30 (ZIO_TYPE_WRITE for /dev/da2.eli and ZIO_TYPE_FREE for both of /dev/da3.eli and /dev/da5.eli). Then they ceased completely and systems runs fine.
(In reply to Eugene Grosbein from comment #23) So, I am not sure what this tells us. An I/O request taking tens of minutes does not look normal to me, even if it's a TRIM request. Maybe disable the deadman logic, maybe disable ZFS TRIM. Or maybe there is something special about the SSDs or something that can be tuned. But this does not look like a problem with "core" ZFS. I think that Warner Losh looked deeply into the SSD/TRIM situation, maybe he can give some additional ideas.
At first this hardware used mfi(4) driver that has no TRIM support at all and write performance was very bad. Switch to mrsas(4) with re-trimming SSDs one-by-one made performance normal. Disabling TRIM is not an option.
(In reply to Andriy Gapon from comment #25) I still think "core ZFS" logic of panicing system is wrong. It should just disconnect "non-responding" device from pool having redundancy instead. And continue working with pool in degraded state. This is why we add costly redundancy to pools.
Closing this as I have no more that hardware. For the record: as work-around against repeating panics, ZFS deadman feature was disabled but that did not solved slow data access problem being fatal for guest OSes inside bhyve virtual machines. So SSDs with bad TRIM performance were replaced with NVMEs.