14.3p1, running poudriere, fails to umount after build, hangs hard: umount: unmount of /pou/data/.m/143-default/21/usr/ports failed: Device busy umount: unmount of /pou/data/.m/143-default/21/usr/local failed: Device busy umount: unmount of /pou/data/.m/143-default/21/dev failed: Device busy umount: unmount of /pou/data/.m/143-default/21/.p failed: Device busy Reboot will probably hang as well (the last time I tried, it did). What kind of diagnostic can be done ? host: HPE DL380 Gen10 2x Intel Xeon-Platinum 8168 2.7GHz-3.7GHz/24-core/205W Processor
(In reply to Kurt Jaeger from comment #0) I created a ktrace: ktrace -f /tmp/ktr3 df kdump -f /tmp/ktr3 > df-ktrace see: https://people.freebsd.org/~pi/logs/hang/df-ktrace It looks like the last sign of life are some calls to getfsstat() ? 23969 df RET close 0 23969 df CALL munmap(0x2c36300f000,0xcf44) 23969 df RET munmap 0 23969 df CALL ioctl(0x1,TIOCGETA,0x2c35fccf394) 23969 df RET ioctl 0 23969 df CALL getfsstat(0,0,MNT_NOWAIT) 23969 df RET getfsstat 68/0x44 23969 df CALL getfsstat(0xf7dddc4a580,0x4dd40,MNT_NOWAIT) 23969 df RET getfsstat 67/0x43 23969 df CALL getfsstat(0xf7dddc4a580,0x4dd40,MNT_WAIT)
(In reply to Kurt Jaeger from comment #1) ktrace -f /tmp/ktr4 mount > mount-stdout kdump -f /tmp/ktr4 > mount-ktrace2 show that almost the same getfsstat() is made, but it leads to no hang: 24159 mount CALL getfsstat(0,0,MNT_NOWAIT) 24159 mount RET getfsstat 68/0x44 24159 mount CALL getfsstat(0x3ae24724a840,0x4dd40,MNT_NOWAIT) 24159 mount RET getfsstat 67/0x43 24159 mount CALL fstat(0x1,0x1337f26aa7d0) 24159 mount STRU struct stat {dev=3136594270542703103, ino=24217, mode=0100664, nlink=1, uid=0, gid=0, rdev=0, atime=1753005689.619056000, mtime=1753005689.619056000, ctime=1753005689.619056000, birthtime=1753005689.619056000, size=0, blksize=131072, blocks=1, flags=0x800 } 24159 mount RET fstat 0
Note: this box ran many builds before without that effect.
Can you share fstat output? That should reveal what is holding a file open on the mountpoint. What exactly is hanging? The original comment just shows output from umount failing. Does umount -f work?
(In reply to Mark Johnston from comment #4) Sorry, we had to power-cycle to be able to use the host. I'll use fstat the next time this happens. umount -f hang as well.
(In reply to Kurt Jaeger from comment #5) Also important would be output from "procstat -kka".
(In reply to Mark Johnston from comment #6) Ok, will collect both infos.
(In reply to Kurt Jaeger from comment #7) 2 more hanging incidents yesterday and today. https://people.freebsd.org/~pi/logs/20250817/ and https://people.freebsd.org/~pi/logs/20250818/
In both cases we are getting stuck removing a tmpfs mount: umount - mi_switch+0xbd _sleep+0x1f3 vfs_write_suspend+0xe2 vfs_write_suspend_umnt+0x1c tmpfs_unmount+0x23 dounmount+0x789 kern_unmount+0x2f5 amd64_syscall+0x117 fast_syscall_common+0xf8 and there is an ld thread blocked on a vnode lock: ld - mi_switch+0xbd sleeplk+0x10a lockmgr_xlock_hard+0x398 _vn_lock+0x47 tmpfs_alloc_vp+0x2f3 tmpfs_alloc_file+0x20c tmpfs_create+0x2b VOP_CREATE_APV+0x25 vn_open_cred+0x3dd openatfp+0x262 sys_open+0x28 amd64_syscall+0x117 fast_syscall_common+0xf8 So we allocated a new tmpfs node and a vnode to go with it, but something else has the vnode locked. When this deadlock arises again, it'd be good to get a kernel dump so we can see who's holding the lock. I can't see any obvious candidates in the procstat output, e.g., the vnlru thread is idle. If you have a dumpdev configured, it should be enough to just panic the system with `sysctl debug.kdb.panic=1` after the deadlock happens.
I've also seen this on some occasions (~5 or 6 times) on both of my buildhosts since upgrading to 14.3-RELEASE. I just tried to build angie-* ports for i386 to try and reproduce the recent failures for angie-module-auth-spnego and the poudriere build job went completely catatonic, so I aborted via ^C and got a bunch of umount failed/device busy errors: [00:23:10] Cleaning up umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/04/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/05/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/03/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/07/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/06/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/01/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/12/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/root/.ccache failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/dev failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/.p failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/16/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/root/.ccache failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/dev failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/.p failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/20/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/09/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/root/.ccache failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/.p failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/root/.ccache failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/dev failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/.p failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/10/wrkdirs failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/root/.ccache failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/.p failed: Device busy output of fstat | grep poudriere at termbin: https://termbin.com/u0dr output of procstat -kka at termbin: https://termbin.com/dsmg (this particular build host is my home server and running a bunch of other jails, so the procstat output might be rather crowded with unrelated stuff...) Unfortunately, it seems I don't have large enough swap configured on this host to handle a full 'savecore' (256GB RAM vs 64G SWAP), at least triggering a panic didn't gave me a 'vmcore.0' dumpfile at /var/crash... Please let me know if there is there any other information I can provide. Thanks, Sebastian
(In reply to Sebastian Oswald from comment #10) The pastes you sent seem to be truncated. It'd be better to attach them directly here if possible. > Unfortunately, it seems I don't have large enough swap configured on this host to handle a full 'savecore' (256GB RAM vs 64G SWAP), at least triggering a panic didn't gave me a 'vmcore.0' dumpfile at /var/crash... You might try adding dumpon_flags=-Z to rc.conf to enable crash dump compression next time, that will reduce the space requirements for a kernel dump.
I had another one today. I got on the ipmi console and there were a bunch of messages like this: swap_pager: out of swap space swp_pager_getswapspace(2): failed I tried to create a crash dump but my "partition is tool small". (I now have dumpon_flags=-Z). I do have a LOT of swap: zinc 3 # swapinfo Device 1K-blocks Used Avail Capacity /dev/mirror/gm0p2 33554428 0 33554428 0% And a decent amount of cores and ram: zinc 7 # sysctl hw.ncpu hw.ncpu: 48 zinc 9 # egrep ^real /var/run/dmesg.boot real memory = 206158430208 (196608 MB) This machine was spec'ed to be a build server. What can I capture next time this happens?
I had another hang today. I tried to force a panic; the system was unreachable for ~10 minutes and booted without a crash dump... I was able to get the list of D processes: 64116 1 S+ 0:00.00 fgrep D 49043 2 IJ 0:00.04 /usr/bin/make -C /usr/ports/devel/qt6-base -k -DNOC 49064 2 DJ 0:00.01 /bin/rm -f -r /wrkdirs/usr/ports/devel/qt6-base/wor 70858 2 DJ 0:00.28 /usr/bin/tar -xf /portdistfiles/KDE/Qt/6.9.2/qtbase
I'm still seeing about one hang a week (5 of them since Sept 4th). This morning's poudriere finished building but three jails were stuck in "clean_pool". I tried debug.kdb.panic and watched it dump "3002 out of 195197 MB" but when the system booted it said "No core dumps found." I guess next time I should try running savecore on the running system? I never saw this behavior with 14.2
I had another hang overnight. I tried "savecore -L" in a window I had open but it just hung. Next time I'll try a hardware reset -> single user -> "savecore -vLZ"
Wow, two hangs two days. I see now "savecore -vLZ" is wrong, what I need to do is (if possible) "sysctl debug.kdb.panic=1", THEN -> single user -> "mount -u /" -> "savecore -vZ"
This week I had another 4 (!) poudriere bulk jobs hanging indefinitiely and was unable to terminate them and couldn't reboot without a power cycle via ipmi... For the last occurence, I could determine the approximate moment when all builds went catatonic and that time pretty much lined up with these cronjobs: 0 22 * * * /usr/local/sbin/zfsnap snapshot -a 3w -r nvme_pool/poudriere/data 0 22 * * * /usr/local/sbin/zfsnap snapshot -a 4w -r nvme_pool/poudriere/jails 0 22 * * * /usr/local/sbin/zfsnap snapshot -a 4w -r nvme_pool/poudriere/ports 0 22 * * * /usr/local/sbin/zfsnap snapshot -a 3m -r nvme_pool/poudriere/poudriere.d These were present as long as this system existed and always worked. I suspect it's the first one causing issues, as it's always the data/.m mounts that go titsup. The fact that the snapshot itself seems to trigger all builds to go catatonic and make the data/.m mounts defunct definitely reeks of some regression with one of the latest zfs updates (of which there are way too many nowadays IMHO...), so poudriere most likely isn't at fault here.
I've had many more hangs but today was the first time I was able to get on the (ipmi) console and try to get a crash dump. So far I'm still unsuccessful. I was getting usage messages if I added -Z to savecore so I left it out. I tried this: # savecore -kv unable to open bounds file, using 0 checking for kernel dump on device /dev/mirror/gm0p2 mediasize = 34359737856 bytes sectorsize = 512 bytes magic mismatch on last dump header on /dev/mirror/gm0p2 savecore 43 - - no dumps found "dumpon -l" showed /dev/null but savecore was still able to locate mirror/gm0p2 as the dump device (which is what I see when the system is booted multiuser). I also see that when the system is running, I've configured /var/crash as a zfs partition but it's not mounted when /etc/rc.d/savecore runs. I guess having dedicated zfs crash partition isn't really possible. Is something initializing swap when I boot into single user after "sysctl debug.kdb.panic=1"? This is a pretty annoying bug that I only see on this one system and I'm not sure what to try next time it occurs.
Today was my third hang in a week, second day in a row. I had several windows on the box and tried to do a live savecore; that got into D-wait pretty quickly. Soon almost anything I did would hang (including the sysctl panic). Is there a problem with the way I have swap configured on this box that is preventing savecore from working? This server runs zfs on six nvme drives. zinc 7 % dumpon -l mirror/gm0p2 zinc 8 % gpart show => 40 937703008 nda0 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) => 40 937703008 nda1 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) => 40 937703008 nda2 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) => 40 937703008 nda3 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) => 40 937703008 nda4 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) => 40 937703008 nda5 GPT (447G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 67108864 2 freebsd-swap (32G) 67110912 870590464 3 freebsd-zfs (415G) 937701376 1672 - free - (836K) zinc 9 % gmirror list Geom name: gm0p2 State: COMPLETE Components: 6 Balance: load Slice: 4096 Flags: NONE GenID: 0 SyncID: 2 ID: 859694810 Type: AUTOMATIC Providers: 1. Name: mirror/gm0p2 Mediasize: 34359737856 (32G) Sectorsize: 512 Mode: r1w1e0 Consumers: 1. Name: nda0p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: (null) GenID: 0 SyncID: 2 ID: 2317411985 2. Name: nda1p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 1 Flags: (null) GenID: 0 SyncID: 2 ID: 1611450630 3. Name: nda2p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 2 Flags: (null) GenID: 0 SyncID: 2 ID: 2208673777 4. Name: nda3p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 3 Flags: (null) GenID: 0 SyncID: 2 ID: 807113585 5. Name: nda4p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 4 Flags: (null) GenID: 0 SyncID: 2 ID: 3628942691 6. Name: nda5p2 Mediasize: 34359738368 (32G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 1048576 Mode: r1w1e1 State: ACTIVE Priority: 5 Flags: (null) GenID: 0 SyncID: 2 ID: 4233446169 zinc 13 % zpool status pool: tank state: ONLINE scan: scrub repaired 0B in 00:04:54 with 0 errors on Wed Dec 10 20:04:54 2025 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 nda0p3 ONLINE 0 0 0 nda1p3 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 nda2p3 ONLINE 0 0 0 nda3p3 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 nda4p3 ONLINE 0 0 0 nda5p3 ONLINE 0 0 0 errors: No known data errors zinc 14 % zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT tank 1.21T 1000G 236G - - 64% 80% 1.00x ONLINE -