After an installworld installkernel, the machine panics with "softdep_waitidle: work added after flush". A core dump is available on request.
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 4 2 2 1 1 0 0 done
Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
All buffers synced.
panic: softdep_waitidle: work added after flush
cpuid = 0
time = 1609629605
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe009db49790
vpanic() at vpanic+0x181/frame 0xfffffe009db497e0
panic() at panic+0x43/frame 0xfffffe009db49840
softdep_flushfiles() at softdep_flushfiles+0x445/frame 0xfffffe009db498d0
ffs_unmount() at ffs_unmount+0x5d/frame 0xfffffe009db49940
dounmount() at dounmount+0x42c/frame 0xfffffe009db499b0
vfs_unmountall() at vfs_unmountall+0x6a/frame 0xfffffe009db499e0
bufshutdown() at bufshutdown+0x2ce/frame 0xfffffe009db49a30
kern_reboot() at kern_reboot+0x213/frame 0xfffffe009db49a70
sys_reboot() at sys_reboot+0x3a4/frame 0xfffffe009db49ac0
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe009db49bf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe009db49bf0
--- syscall (55, FreeBSD ELF64, sys_reboot), rip = 0x2dec6182dd5a, rsp = 0x7fffffb03f58, rbp = 0x7fffffb040d0 ---
I wonder if unlocking the vnode during sync on reclaim allows more mutations to
happen during vflush().
Then we should just make this assertion into loop restart.
(In reply to Konstantin Belousov from comment #1)
I concur with your analysis. The unlocking can allow other operations to happen which could cause new work to show up. My take would be to just delete the KASSERT and let the function loop again. In theory the change could happen more than SU_WAITIDLE_RETRIES (20) times, but even in that scenario softdep_waitidle will simply return EBUSY which can be handled at a higher level. Alternatively we could test for this condition and decrement the loop variable to give it an extra run around the loop, but that seems to be over-thinking the problem.
Do you agree with this assessment?