On several servers I manage, I take backups to an external HD and use mksnap_ffs to create snapshots.
I've never had troubles with this up to 11.3.
Lately I started doing this on a 12.1 server and noticed mksnap_ffs will hang the box for several minutes (services stuck, no login allowed, already established ssh sessions partially work; shutdown not feasible unless reset button is pressed).
N.B. This is an external drive, only mounted when needed and only accessed to make backups, so if *it* got stuck, it should not affect the whole system.
I decided to check this and took the external HD to my desktop (11.3): it worked perfectly.
I then upgraded my desktop to 12.1p2 and it started doing as above:
_ mksnap_ffs will work for several minutes under high I/O (the HD is 6TB); meanwhile, the system is responsive;
_ then mksnap_ffs will drastically reduce its I/O (at least as measured with top), but will keep working for some other minutes: during this phase, I cannot open any new program; ThunderBird gets stuck, already open FireFox windows still works, but I cannot open any new window; audacity keeps playing the current track, but will get stuck when moving on to the next; already open terminal windows might partially work;
_ after several minutes mksnap_ffs will exit and everything will get back to normal.
This is of course unacceptable on a production server.
I built a test machine with 12.1/amd64 with the following kernel options: KDB, KDB_TRACE, DDB, GDB, INVARIANTS, INVARIANT_SUPPORT, WITNESS, WITNESS_SKIPSPIN, DEBUG_VFS_LOCKS, LOCK_PROFILING, KTR, ALQ, KTR_ENTRIES=4096.
Such a kernel paniced immediately after launching mksnap_ffs with LOR #269.
I removed WITNESS, WITNESS_SKIPSPIN, issued a "fsck -y" on the disk and tried again.
This time I got a different panic:
panic: ffs_copyonwrite: bad copy block
cpuid = 0
time = 1581243816
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe001beef0e0
vpanic() at vpanic+0x19d/frame 0xfffffe001beef130
panic() at panic+0x43/frame 0xfffffe001beef190
ffs_copyonwrite() at ffs_copyonwrite+0x74c/frame 0xfffffe001beef230
ffs_geom_strategy() at ffs_geom_strategy+0x8c/frame 0xfffffe001beef260
ufs_strategy() at ufs_strategy+0x83/frame 0xfffffe001beef290
VOP_STRATEGY_APV() at VOP_STRATEGY_APV+0xc9/frame 0xfffffe001beef2c0
bufstrategy() at bufstrategy+0x44/frame 0xfffffe001beef2f0
bufwrite() at bufwrite+0x230/frame 0xfffffe001beef330
ffs_snapshot() at ffs_snapshot+0x8e0/frame 0xfffffe001beef630
ffs_mount() at ffs_mount+0xb3a/frame 0xfffffe001beef7d0
vfs_domount() at vfs_domount+0x8b6/frame 0xfffffe001beef9f0
vfs_donmount() at vfs_donmount+0x7e7/frame 0xfffffe001beefa90
sys_nmount() at sys_nmount+0xf2/frame 0xfffffe001beefac0
amd64_syscall() at amd64_syscall+0x281/frame 0xfffffe001beefbf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe001beefbf0
--- syscall (378, FreeBSD ELF64, sys_nmount), rip = 0x8002d88ba, rsp = 0x7fffffffd288, rbp = 0x7fffffffeae0 ---
KDB: enter: panic
So, I also removed INVARIANTS and INVARIANT_SUPPORT (and run "fsck -y" twice) in order to be able to get snapshots.
I haven't been able to collect other data yet.
This week another box was upgraded from 11.3 to 12.1: like the first one, this hangs as soon as mksnap_ffs is run on an external USB-attached HD and only the reset button can get it out of this situaation.
The external HD is not the same, so we can exclude it's something wroing in that particular filesystem.
As a side note, I tried setting up debugging on a test machine via serial cable, but was unable to get it to work.
_ one one box, running GENERIC instead of custom kernel, still shows the problem;
_ on another box, I installed a 11.3 VM (through bhyve) and connected it to the host via ggate: doing the shapshot on the VM does not halt the system (so this has nothing to do with the data on the disk).
I also backup a filesystem from cron. Sometimes this hangs.
When this happens, the mksnap_ffs is running at 100% of one of the processors. It is also unkillable... The hapless dump(8) sits there waiting for mksnap_ffs to exit...
The filesystem becomes unusable -- the already-started processes keep running, but their attempts to access the filesystem hang.
In our case the filesystem's parameters are:
tunefs: POSIX.1e ACLs: (-a) disabled
tunefs: NFSv4 ACLs: (-N) disabled
tunefs: MAC multilabel: (-l) disabled
tunefs: soft updates: (-n) enabled
tunefs: soft update journaling: (-j) disabled
tunefs: gjournal: (-J) disabled
tunefs: trim: (-t) enabled
tunefs: maximum blocks per file in a cylinder group: (-e) 4096
tunefs: average file size: (-f) 16384
tunefs: average number of files in a directory: (-s) 64
tunefs: minimum percentage of free space: (-m) 8%
tunefs: space to hold for metadata blocks: (-k) 6408
tunefs: optimization preference: (-o) time
I cannot say, if this is a regression as this is a new machine.
Today I tried again after upgrading to 12.2.
Still the same behavious (i.e. hardware reset required).
This is becoming more serious, as 11.x is approaching its EOL, as it prevents me from upgrading several machines.
I finally was able to reproduce this systematically on a virtual machine (with INVARIANTS & co enabled).
I've got a snapshot where the VM is deadlocked and I can connect to it via remote GDB: I'd appreciate any help on diagnosing this.
I've also got a different snapshot where the described panic happens, but I've opened a different bug for that (#252821).