Bug 229694

Summary:

[zfs] unkillable "zpool scrub" in [tx->tx_sync_done_cv] state for damaged data

Product:

Base System

Reporter:

Eugene Grosbein <eugen>

Component:

kern

Assignee:

freebsd-fs (Nobody) <fs>

Status:

New ---

Severity:

Affects Some People

CC:

pi, stable

Priority:

---

Version:

11.2-STABLE

Hardware:

Any

OS:

Any

Attachments:

Description	Flags
procstat -kk -a output	none

Description Eugene Grosbein freebsd_committer

2018-07-11 11:10:29 UTC

Hi!

"zpool scrub" may hang in an uninterruptable disk i/o state in case of damaged pool data for 11.2-STABLE/amd64 r335757. This is easily reproduceable using file-backed ZFS pool when files reside on another ("real") pool:

cd dir # resides on ZFS
size=100
rm -f vdev1 vdev2
truncate -s ${size}m vdev1 vdev2
zpool create ztest $(realpath vdev1)
zpool add ztest $(realpath vdev2)
# simulate data corruption
dd if=/dev/urandom of=vdev2 bs=1m count=${size}
zpool scrub ztest

The last command "zpool scrub" always hangs here:

load: 0.53  cmd: zpool 2130 [tx->tx_sync_done_cv] 34.59r 0.00u 0.00s 0% 3692k

"kill -9" cannot kill it.

Comment 1 Andriy Gapon freebsd_committer

2018-07-11 12:08:48 UTC

I am not too surprised.  The pool configuration is not redundant and the whole top level vdev is corrupted.  I suspect that the scrub command needs to write something to the pool to record the initial scrub state.  And it's quite likely that it needs to perform Read-Modify-Write.  And the read fails and the pool gets suspended.  zpool scrub command is stuck waiting for confirmation that the scrub is actually started.

procstat -kk -a would paint a fuller picture.
Maybe there is something reported in dmesg too, but not sure.

Comment 2 Eugene Grosbein freebsd_committer

2018-07-11 13:58:22 UTC

(In reply to Andriy Gapon from comment #1)

Nothing in the dmesg output. Procstat output is huge, so I compressed it, see attachment.

Comment 3 Eugene Grosbein freebsd_committer

2018-07-11 13:58:55 UTC

Created attachment 195052 [details]
procstat -kk -a output

Comment 4 Rodney W. Grimes freebsd_committer

2019-02-13 02:00:51 UTC

Please do not put bugs on stable@, current@, hackers@, etc

Comment 5 Andriy Gapon freebsd_committer

2019-02-13 10:10:45 UTC

(In reply to Eugene Grosbein from comment #3)
    5 101937 zfskern             txg_thread_enter    mi_switch+0xc5 sleepq_wait+0x2c _cv_wait+0x160 zio_resume_wait+0x4b spa_sync+0xd46 txg_sync_thread+0x25e fork_exit+0x75 fork_trampoline+0xe 

 3249 101681 zpool               -                   mi_switch+0xc5 sleepq_wait+0x2c _cv_wait+0x160 txg_wait_synced+0xa5 dsl_sync_task_common+0x219 dsl_sync_task+0x14 dsl_scan+0x9e zfs_ioc_pool_scan+0x5a zfsdev_ioctl+0x6c2 devfs_ioctl_f+0x12d kern_ioctl+0x212 sys_ioctl+0x15c amd64_syscall+0x25c fast_syscall_common+0x101

So, unfortunately, this is how ZFS works now.

Comment 6 Eugene Grosbein freebsd_committer

2023-02-13 02:49:52 UTC

It is reproduceacble exactly same way under 13.2-PRERELEASE/amd64 with stock ZFS.