Hi! "zpool scrub" may hang in an uninterruptable disk i/o state in case of damaged pool data for 11.2-STABLE/amd64 r335757. This is easily reproduceable using file-backed ZFS pool when files reside on another ("real") pool: cd dir # resides on ZFS size=100 rm -f vdev1 vdev2 truncate -s ${size}m vdev1 vdev2 zpool create ztest $(realpath vdev1) zpool add ztest $(realpath vdev2) # simulate data corruption dd if=/dev/urandom of=vdev2 bs=1m count=${size} zpool scrub ztest The last command "zpool scrub" always hangs here: load: 0.53 cmd: zpool 2130 [tx->tx_sync_done_cv] 34.59r 0.00u 0.00s 0% 3692k "kill -9" cannot kill it.
I am not too surprised. The pool configuration is not redundant and the whole top level vdev is corrupted. I suspect that the scrub command needs to write something to the pool to record the initial scrub state. And it's quite likely that it needs to perform Read-Modify-Write. And the read fails and the pool gets suspended. zpool scrub command is stuck waiting for confirmation that the scrub is actually started. procstat -kk -a would paint a fuller picture. Maybe there is something reported in dmesg too, but not sure.
(In reply to Andriy Gapon from comment #1) Nothing in the dmesg output. Procstat output is huge, so I compressed it, see attachment.
Created attachment 195052 [details] procstat -kk -a output
Please do not put bugs on stable@, current@, hackers@, etc
(In reply to Eugene Grosbein from comment #3) 5 101937 zfskern txg_thread_enter mi_switch+0xc5 sleepq_wait+0x2c _cv_wait+0x160 zio_resume_wait+0x4b spa_sync+0xd46 txg_sync_thread+0x25e fork_exit+0x75 fork_trampoline+0xe 3249 101681 zpool - mi_switch+0xc5 sleepq_wait+0x2c _cv_wait+0x160 txg_wait_synced+0xa5 dsl_sync_task_common+0x219 dsl_sync_task+0x14 dsl_scan+0x9e zfs_ioc_pool_scan+0x5a zfsdev_ioctl+0x6c2 devfs_ioctl_f+0x12d kern_ioctl+0x212 sys_ioctl+0x15c amd64_syscall+0x25c fast_syscall_common+0x101 So, unfortunately, this is how ZFS works now.
It is reproduceacble exactly same way under 13.2-PRERELEASE/amd64 with stock ZFS.