Bug 288345 - poudriere run, hanging umounts, system fails to reboot due to hanging processes
Summary: poudriere run, hanging umounts, system fails to reboot due to hanging processes
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-07-20 08:00 UTC by Kurt Jaeger
Modified: 2025-12-11 18:17 UTC (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kurt Jaeger freebsd_committer freebsd_triage 2025-07-20 08:00:15 UTC
14.3p1, running poudriere, fails to umount after build, hangs hard:

umount: unmount of /pou/data/.m/143-default/21/usr/ports failed: Device busy
umount: unmount of /pou/data/.m/143-default/21/usr/local failed: Device busy
umount: unmount of /pou/data/.m/143-default/21/dev failed: Device busy
umount: unmount of /pou/data/.m/143-default/21/.p failed: Device busy

Reboot will probably hang as well (the last time I tried, it did).

What kind of diagnostic can be done ?

host: HPE DL380 Gen10
2x Intel Xeon-Platinum 8168
2.7GHz-3.7GHz/24-core/205W Processor
Comment 1 Kurt Jaeger freebsd_committer freebsd_triage 2025-07-20 10:34:07 UTC
(In reply to Kurt Jaeger from comment #0)
I created a ktrace:
   
ktrace -f /tmp/ktr3 df
kdump -f /tmp/ktr3 > df-ktrace

see:
https://people.freebsd.org/~pi/logs/hang/df-ktrace

It looks like the last sign of life are some calls to getfsstat() ?

 23969 df       RET   close 0
 23969 df       CALL  munmap(0x2c36300f000,0xcf44)
 23969 df       RET   munmap 0
 23969 df       CALL  ioctl(0x1,TIOCGETA,0x2c35fccf394)
 23969 df       RET   ioctl 0
 23969 df       CALL  getfsstat(0,0,MNT_NOWAIT)
 23969 df       RET   getfsstat 68/0x44
 23969 df       CALL  getfsstat(0xf7dddc4a580,0x4dd40,MNT_NOWAIT)
 23969 df       RET   getfsstat 67/0x43
 23969 df       CALL  getfsstat(0xf7dddc4a580,0x4dd40,MNT_WAIT)
Comment 2 Kurt Jaeger freebsd_committer freebsd_triage 2025-07-20 10:43:54 UTC
(In reply to Kurt Jaeger from comment #1)
  ktrace -f /tmp/ktr4 mount > mount-stdout
  kdump -f /tmp/ktr4 > mount-ktrace2

show that almost the same getfsstat() is made, but it leads to no hang:

 24159 mount    CALL  getfsstat(0,0,MNT_NOWAIT)
 24159 mount    RET   getfsstat 68/0x44
 24159 mount    CALL  getfsstat(0x3ae24724a840,0x4dd40,MNT_NOWAIT)
 24159 mount    RET   getfsstat 67/0x43
 24159 mount    CALL  fstat(0x1,0x1337f26aa7d0)
 24159 mount    STRU  struct stat {dev=3136594270542703103, ino=24217, mode=0100664, nlink=1, uid=0, gid=0, rdev=0, atime=1753005689.619056000, mtime=1753005689.619056000, ctime=1753005689.619056000, birthtime=1753005689.619056000, size=0, blksize=131072, blocks=1, flags=0x800 }
 24159 mount    RET   fstat 0
Comment 3 Kurt Jaeger freebsd_committer freebsd_triage 2025-07-20 12:48:36 UTC
Note: this box ran many builds before without that effect.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2025-07-25 16:25:30 UTC
Can you share fstat output?  That should reveal what is holding a file open on the mountpoint.

What exactly is hanging?  The original comment just shows output from umount failing.  Does umount -f work?
Comment 5 Kurt Jaeger freebsd_committer freebsd_triage 2025-07-27 20:06:19 UTC
(In reply to Mark Johnston from comment #4)
Sorry, we had to power-cycle to be able to use the host. I'll use fstat
the next time this happens.

umount -f hang as well.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2025-07-28 13:08:22 UTC
(In reply to Kurt Jaeger from comment #5)
Also important would be output from "procstat -kka".
Comment 7 Kurt Jaeger freebsd_committer freebsd_triage 2025-07-28 13:19:46 UTC
(In reply to Mark Johnston from comment #6)
Ok, will collect both infos.
Comment 8 Kurt Jaeger freebsd_committer freebsd_triage 2025-08-18 16:19:50 UTC
(In reply to Kurt Jaeger from comment #7)
2 more hanging incidents yesterday and today.

https://people.freebsd.org/~pi/logs/20250817/
and
https://people.freebsd.org/~pi/logs/20250818/
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2025-08-18 16:36:45 UTC
In both cases we are getting stuck removing a tmpfs mount:

umount              -                   mi_switch+0xbd _sleep+0x1f3 vfs_write_suspend+0xe2 vfs_write_suspend_umnt+0x1c tmpfs_unmount+0x23 dounmount+0x789 kern_unmount+0x2f5 amd64_syscall+0x117 fast_syscall_common+0xf8

and there is an ld thread blocked on a vnode lock:

ld                  -                   mi_switch+0xbd sleeplk+0x10a lockmgr_xlock_hard+0x398 _vn_lock+0x47 tmpfs_alloc_vp+0x2f3 tmpfs_alloc_file+0x20c tmpfs_create+0x2b VOP_CREATE_APV+0x25 vn_open_cred+0x3dd openatfp+0x262 sys_open+0x28 amd64_syscall+0x117 fast_syscall_common+0xf8

So we allocated a new tmpfs node and a vnode to go with it, but something else has the vnode locked.

When this deadlock arises again, it'd be good to get a kernel dump so we can see who's holding the lock.  I can't see any obvious candidates in the procstat output, e.g., the vnlru thread is idle.  If you have a dumpdev configured, it should be enough to just panic the system with `sysctl debug.kdb.panic=1` after the deadlock happens.
Comment 10 Sebastian Oswald 2025-08-25 20:18:33 UTC
I've also seen this on some occasions (~5 or 6 times) on both of my buildhosts since upgrading to 14.3-RELEASE.

I just tried to build angie-* ports for i386 to try and reproduce the recent failures for angie-module-auth-spnego and the poudriere build job went completely catatonic, so I aborted via ^C and got a bunch of umount failed/device busy errors:

[00:23:10] Cleaning up
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/04/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/05/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/03/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/07/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/06/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/01/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/12/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/root/.ccache failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/dev failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/08/.p failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/16/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/root/.ccache failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/dev failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/19/.p failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/20/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/09/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/root/.ccache failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/21/.p failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/root/.ccache failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/dev failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/18/.p failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/10/wrkdirs failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/root/.ccache failed: Device busy
umount: unmount of /usr/local/poudriere/data/.m/FreeBSD_14_i386-latest-servers/17/.p failed: Device busy


output of fstat | grep poudriere at termbin: https://termbin.com/u0dr

output of procstat -kka at termbin: https://termbin.com/dsmg

(this particular build host is my home server and running a bunch of other jails, so the procstat output might be rather crowded with unrelated stuff...)

Unfortunately, it seems I don't have large enough swap configured on this host to handle a full 'savecore' (256GB RAM vs 64G SWAP), at least triggering a panic didn't gave me a 'vmcore.0' dumpfile at /var/crash...


Please let me know if there is there any other information I can provide.

Thanks,
Sebastian
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2025-08-25 20:24:00 UTC
(In reply to Sebastian Oswald from comment #10)
The pastes you sent seem to be truncated.  It'd be better to attach them directly here if possible.

> Unfortunately, it seems I don't have large enough swap configured on this host to handle a full 'savecore' (256GB RAM vs 64G SWAP), at least triggering a panic didn't gave me a 'vmcore.0' dumpfile at /var/crash...

You might try adding dumpon_flags=-Z to rc.conf to enable crash dump compression next time, that will reduce the space requirements for a kernel dump.
Comment 12 Craig Leres freebsd_committer freebsd_triage 2025-08-28 17:10:49 UTC
I had another one today. I got on the ipmi console and there were a bunch of messages like this:

    swap_pager: out of swap space
    swp_pager_getswapspace(2): failed

I tried to create a crash dump but my "partition is tool small". (I now have dumpon_flags=-Z).

I do have a LOT of swap:

    zinc 3 # swapinfo
    Device          1K-blocks     Used    Avail Capacity
    /dev/mirror/gm0p2  33554428        0 33554428     0%

And a decent amount of cores and ram:

    zinc 7 # sysctl hw.ncpu
    hw.ncpu: 48
    zinc 9 # egrep ^real /var/run/dmesg.boot 
    real memory  = 206158430208 (196608 MB)

This machine was spec'ed to be a build server.

What can I capture next time this happens?
Comment 13 Craig Leres freebsd_committer freebsd_triage 2025-09-04 19:52:25 UTC
I had another hang today. I tried to force a panic; the system was unreachable for ~10 minutes and booted without a crash dump... I was able to get the list of D processes:

64116  1  S+        0:00.00 fgrep D
49043  2  IJ        0:00.04 /usr/bin/make -C /usr/ports/devel/qt6-base -k -DNOC
49064  2  DJ        0:00.01 /bin/rm -f -r /wrkdirs/usr/ports/devel/qt6-base/wor
70858  2  DJ        0:00.28 /usr/bin/tar -xf /portdistfiles/KDE/Qt/6.9.2/qtbase
Comment 14 Craig Leres freebsd_committer freebsd_triage 2025-10-10 17:57:42 UTC
I'm still seeing about one hang a week (5 of them since Sept 4th). This morning's poudriere finished building but three jails were stuck in "clean_pool". I tried debug.kdb.panic and watched it dump "3002 out of 195197 MB" but when the system booted it said "No core dumps found." I guess next time I should try running savecore on the running system?

I never saw this behavior with 14.2
Comment 15 Craig Leres freebsd_committer freebsd_triage 2025-10-24 16:33:34 UTC
I had another hang overnight. I tried "savecore -L" in a window I had open but it just hung.

Next time I'll try a hardware reset -> single user -> "savecore -vLZ"
Comment 16 Craig Leres freebsd_committer freebsd_triage 2025-10-25 17:53:08 UTC
Wow, two hangs two days. I see now "savecore -vLZ" is wrong, what I need to do is (if possible) "sysctl debug.kdb.panic=1", THEN -> single user -> "mount -u /" -> "savecore -vZ"
Comment 17 Sebastian Oswald 2025-11-13 10:59:55 UTC
This week I had another 4 (!) poudriere bulk jobs hanging indefinitiely and was unable to terminate them and couldn't reboot without a power cycle via ipmi...

For the last occurence, I could determine the approximate moment when all builds went catatonic and that time pretty much lined up with these cronjobs:

0   22  *   *   *   /usr/local/sbin/zfsnap snapshot -a 3w -r nvme_pool/poudriere/data
0   22  *   *   *   /usr/local/sbin/zfsnap snapshot -a 4w -r nvme_pool/poudriere/jails
0   22  *   *   *   /usr/local/sbin/zfsnap snapshot -a 4w -r nvme_pool/poudriere/ports
0   22  *   *   *   /usr/local/sbin/zfsnap snapshot -a 3m -r nvme_pool/poudriere/poudriere.d

These were present as long as this system existed and always worked. I suspect it's the first one causing issues, as it's always the data/.m mounts that go titsup.
The fact that the snapshot itself seems to trigger all builds to go catatonic and make the data/.m mounts defunct definitely reeks of some regression with one of the latest zfs updates (of which there are way too many nowadays IMHO...), so poudriere most likely isn't at fault here.
Comment 18 Craig Leres freebsd_committer freebsd_triage 2025-12-10 18:24:06 UTC
I've had many more hangs but today was the first time I was able to get on the (ipmi) console and try to get a crash dump. So far I'm still unsuccessful.

I was getting usage messages if I added -Z to savecore so I left it out. I tried this:

    # savecore -kv
    unable to open bounds file, using 0
    checking for kernel dump on device /dev/mirror/gm0p2
    mediasize = 34359737856 bytes
    sectorsize = 512 bytes
    magic mismatch on last dump header on /dev/mirror/gm0p2
    savecore 43 - - no dumps found

"dumpon -l" showed /dev/null but savecore was still able to locate mirror/gm0p2 as the dump device (which is what I see when the system is booted multiuser).

I also see that when the system is running, I've configured /var/crash as a zfs partition but it's not mounted when /etc/rc.d/savecore runs. I guess having dedicated zfs crash partition isn't really possible.

Is something initializing swap when I boot into single user after "sysctl debug.kdb.panic=1"?

This is a pretty annoying bug that I only see on this one system and I'm not sure what to try next time it occurs.
Comment 19 Craig Leres freebsd_committer freebsd_triage 2025-12-11 18:17:03 UTC
Today was my third hang in a week, second day in a row. I had several windows on the box and tried to do a live savecore; that got into D-wait pretty quickly. Soon almost anything I did would hang (including the sysctl panic).

Is there a problem with the way I have swap configured on this box that is preventing savecore from working? This server runs zfs on six nvme drives.

zinc 7 % dumpon -l
mirror/gm0p2
zinc 8 % gpart show
=>       40  937703008  nda0  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

=>       40  937703008  nda1  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

=>       40  937703008  nda2  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

=>       40  937703008  nda3  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

=>       40  937703008  nda4  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

=>       40  937703008  nda5  GPT  (447G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   67108864     2  freebsd-swap  (32G)
   67110912  870590464     3  freebsd-zfs  (415G)
  937701376       1672        - free -  (836K)

zinc 9 % gmirror list
Geom name: gm0p2
State: COMPLETE
Components: 6
Balance: load
Slice: 4096
Flags: NONE
GenID: 0
SyncID: 2
ID: 859694810
Type: AUTOMATIC
Providers:
1. Name: mirror/gm0p2
   Mediasize: 34359737856 (32G)
   Sectorsize: 512
   Mode: r1w1e0
Consumers:
1. Name: nda0p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 0
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 2317411985
2. Name: nda1p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 1
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 1611450630
3. Name: nda2p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 2
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 2208673777
4. Name: nda3p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 3
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 807113585
5. Name: nda4p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 4
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 3628942691
6. Name: nda5p2
   Mediasize: 34359738368 (32G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 1048576
   Mode: r1w1e1
   State: ACTIVE
   Priority: 5
   Flags: (null)
   GenID: 0
   SyncID: 2
   ID: 4233446169

zinc 13 % zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 00:04:54 with 0 errors on Wed Dec 10 20:04:54 2025
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nda0p3  ONLINE       0     0     0
            nda1p3  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            nda2p3  ONLINE       0     0     0
            nda3p3  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            nda4p3  ONLINE       0     0     0
            nda5p3  ONLINE       0     0     0

errors: No known data errors
zinc 14 % zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank  1.21T  1000G   236G        -         -    64%    80%  1.00x    ONLINE  -