Bug 283747 - kernel panic after telegraf service restart
Summary: kernel panic after telegraf service restart
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-12-30 16:43 UTC by Eero Hänninen
Modified: 2025-04-01 14:48 UTC (History)
8 users (show)

See Also:


Attachments
Screenshot of panic (215.64 KB, image/jpeg)
2025-01-07 16:00 UTC, Matthew L. Dailey
no flags Details
Telegraf config (1.85 KB, text/plain)
2025-01-07 16:01 UTC, Matthew L. Dailey
no flags Details
Telegraf config (27.79 KB, text/plain)
2025-01-10 09:05 UTC, Eero Hänninen
no flags Details
crashinfo outpout (484.88 KB, text/plain)
2025-01-13 12:52 UTC, Eero Hänninen
no flags Details
Telegraf ktrace.out before process restart (12.57 KB, application/octet-stream)
2025-01-13 21:09 UTC, Eero Hänninen
no flags Details
Kernel crash info for telegraf ktrace.out (855.02 KB, application/x-bzip2)
2025-01-13 21:12 UTC, Eero Hänninen
no flags Details
telegraf daemon process ktrace.out (3.32 KB, application/octet-stream)
2025-01-22 09:43 UTC, Eero Hänninen
no flags Details
core.txt for telegraf last ktrace.out, compressed with bzip2 (625.71 KB, application/x-bzip)
2025-01-22 09:47 UTC, Eero Hänninen
no flags Details
kernel buffer (1.12 KB, text/plain)
2025-02-11 14:07 UTC, Matthew L. Dailey
no flags Details
backtrace (3.12 KB, text/plain)
2025-02-11 14:08 UTC, Matthew L. Dailey
no flags Details
frame info (3.77 KB, text/plain)
2025-02-11 14:08 UTC, Matthew L. Dailey
no flags Details
283747-long-ref.diff (680 bytes, patch)
2025-03-19 02:59 UTC, Gleb Smirnoff
no flags Details | Diff
Crash data with 14.2-p1 debug kernel (838.13 KB, application/x-bzip2)
2025-03-21 00:40 UTC, Eero Hänninen
no flags Details
ktrace.out.tar.gz (716.62 KB, application/x-gzip)
2025-03-31 22:19 UTC, Eugene Kilimchuk
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Hänninen 2024-12-30 16:43:36 UTC
When server is up about 10 or more days and if I do restart to telegraf service then result is kernel panic. I don't think that this is telegraf itself related but telegraf is a "good" trigger. As far I have tried to reproduce this crash, I have noticed that this happens most of the times on machines with AMD cpu (for example EPYC 7443, EPYC 9354P, EPYC 7282).
I'm not sure, but this panic happens with FreeBSD 14.0 and 14.1 (not tested on 14.2).

Here is crash data:

Dump header from device: /dev/ada9p1
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 25966280704
  Blocksize: 512
  Compression: none
  Dumptime: 2024-12-30 16:54:25 +0200
  Hostname: xx.xx.xx
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.1-RELEASE-p5 GENERIC
  Panic String: page fault
  Dump Parity: 1677611570
  Bounds: 0
  Dump Status: good



Unread portion of the kernel message buffer:     
[1748990]                                                                                                                        
[1748990]                                                       
[1748990] Fatal trap 12: page fault while in kernel mode                                                                         
[1748990] cpuid = 0; apic id = 00         
[1748990] fault virtual address = 0x458                        
[1748990] fault code            = supervisor read data, page not present
[1748990] instruction pointer   = 0x20:0xffffffff80b0f0c9                                                                        
[1748990] stack pointer         = 0x28:0xfffffe089f2f3c60
[1748990] frame pointer         = 0x28:0xfffffe089f2f3ce0                                                                        
[1748990] code segment          = base rx0, limit 0xfffff, type 0x1b
[1748990]                       = DPL 0, pres 1, long 1, def32 0, gran 1
[1748990] processor eflags      = interrupt enabled, resume, IOPL = 0
[1748990] current process               = 2 (clock (0))                                                                          
[1748990] rdi: fffff8047720e518 rsi: 0000000000000004 rdx: 0000000000000000
[1748990] rcx: 0000000000000000  r8: 00000000000000bd  r9: fffffe089f2f4000
[1748990] rax: 0000000000000000 rbx: fffff8010813a740 rbp: fffffe089f2f3ce0
[1748990] r10: 0000000000001388 r11: 00000000e836231a r12: 0000000000000000
[1748990] r13: fffff8010813a740 r14: fffffe089f2f3c88 r15: fffff8047720e518
[1748990] trap number           = 12                                                                                             
[1748990] panic: page fault                                     
[1748990] cpuid = 0                                             
[1748990] time = 1735570465                                     
[1748990] KDB: stack backtrace:                                                                                                  
[1748990] #0 0xffffffff80b7fefd at kdb_backtrace+0x5d
[1748990] #1 0xffffffff80b32bd1 at vpanic+0x131                                                                                  
[1748990] #2 0xffffffff80b32a93 at panic+0x43    
[1748990] #3 0xffffffff8100091b at trap_fatal+0x40b                                                                              
[1748990] #4 0xffffffff81000966 at trap_pfault+0x46                                                                              
[1748990] #5 0xffffffff80fd6d48 at calltrap+0x8
[1748990] #6 0xffffffff80b1f6b9 at crfree+0xa9                                                                                   
[1748990] #7 0xffffffff80ceadb4 at in_pcbfree+0x2a4
[1748990] #8 0xffffffff80bd6049 at sorele_locked+0x89                                                                            
[1748990] #9 0xffffffff80d12c20 at tcp_close+0x170        
[1748990] #10 0xffffffff80d1cb09 at tcp_timer_2msl+0xf9                                                                          
[1748990] #11 0xffffffff80d1bb7e at tcp_timer_enter+0xfe
[1748990] #12 0xffffffff80b50b0c at softclock_call_cc+0x12c
[1748990] #13 0xffffffff80b52355 at softclock_thread+0xe5
[1748990] #14 0xffffffff80aecf7f at fork_exit+0x7f
[1748990] #15 0xffffffff80fd7dae at fork_trampoline+0xe
[1748990] Uptime: 20d5h49m50s
[1748990] Dumping 24763 out of 786107 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

Reading symbols from /boot/kernel/mrsas.ko...
Reading symbols from /usr/lib/debug//boot/kernel/mrsas.ko.debug...
Reading symbols from /boot/kernel/amdtemp.ko...
Reading symbols from /usr/lib/debug//boot/kernel/amdtemp.ko.debug...
Reading symbols from /boot/kernel/amdsmn.ko...
Reading symbols from /usr/lib/debug//boot/kernel/amdsmn.ko.debug...
Reading symbols from /boot/kernel/if_igb.ko...
Reading symbols from /usr/lib/debug//boot/kernel/if_em.ko.debug...
Reading symbols from /boot/kernel/accf_dns.ko...
Reading symbols from /usr/lib/debug//boot/kernel/accf_dns.ko.debug...
Reading symbols from /boot/kernel/zfs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...
Reading symbols from /boot/kernel/accf_data.ko...
Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug...
Reading symbols from /boot/kernel/accf_http.ko...
Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug...
Reading symbols from /boot/kernel/pflog.ko...
Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug...
Reading symbols from /boot/kernel/pf.ko...
Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug...
Reading symbols from /boot/kernel/ipmi.ko...
Reading symbols from /usr/lib/debug//boot/kernel/ipmi.ko.debug...
Reading symbols from /boot/kernel/smbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...
Reading symbols from /boot/kernel/intpm.ko...
Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug...
Reading symbols from /boot/kernel/cpuctl.ko...
Reading symbols from /usr/lib/debug//boot/kernel/cpuctl.ko.debug...
Reading symbols from /boot/kernel/if_lagg.ko...
Reading symbols from /usr/lib/debug//boot/kernel/if_lagg.ko.debug...
Reading symbols from /boot/kernel/if_infiniband.ko...
Reading symbols from /usr/lib/debug//boot/kernel/if_infiniband.ko.debug...
Reading symbols from /boot/kernel/uhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...
Reading symbols from /boot/kernel/ums.ko...
Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...
Reading symbols from /boot/kernel/usbhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug...
Reading symbols from /boot/kernel/hidbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug...
Reading symbols from /boot/kernel/if_urndis.ko...
Reading symbols from /usr/lib/debug//boot/kernel/if_urndis.ko.debug...
Reading symbols from /boot/kernel/uether.ko...
Reading symbols from /usr/lib/debug//boot/kernel/uether.ko.debug...
Reading symbols from /boot/kernel/nullfs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug...
__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
57              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb)
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2024-12-31 03:22:59 UTC
^Triage: apparently this applies to port net-mgmt/telegraf.  Cc: maintainer.
Comment 2 Palle Girgensohn freebsd_committer freebsd_triage 2025-01-06 14:18:04 UTC
Hi!

I use telegraf extensively and have not seen this. What is your telegraf configuration?

Also, are you able to verify that it still exists on 14.2?

Palle
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2025-01-06 23:02:41 UTC
Could you please show the backtrace from core.txt so that we have line numbers?
Comment 4 Matthew L. Dailey 2025-01-07 16:00:22 UTC
I just had this happen this morning on two machines running 14.2-RELEASE. In both cases, I had used pkg to upgrade telegraf from 1.32.3 to 1.33.0. After restarting telegraf, both panicked.

Unfortunately, I didn't have dumps set up properly on either system, but I did manage to catch a screenshot of the panic on one system. It was a "Fatal trap 12: page fault while in kernel mode" I'm attaching the screenshot and our telegraf config.

I was unable to replicate this on a VM that had been up for about the same about of time (~30 days). The only difference I can see if that the VM has no user traffic, while the others have significant nfs and smb traffic.
Comment 5 Matthew L. Dailey 2025-01-07 16:00:52 UTC
Created attachment 256501 [details]
Screenshot of panic
Comment 6 Matthew L. Dailey 2025-01-07 16:01:21 UTC
Created attachment 256502 [details]
Telegraf config
Comment 7 Matthew L. Dailey 2025-01-07 16:10:29 UTC
I should also add that the two machines that panicked for us this morning were both Intel, not AMD - Xeon Gold 5215 and Xeon E5-2650v3
Comment 8 Eero Hänninen 2025-01-10 08:59:11 UTC
Unfortunately there is nothing useful in core.txt.0, just a note about that gdb is not installed.
As this crashed machine went to upgrade 14.2, I moved crash data to another machine, but now I can't debug because error of debug symbols, even if I create clean install 14.1 virtual machine and updated it to 14.1-p6, then I get 

warning: the debug information found in "/usr/lib/debug//boot/kernel/kernel.debug" does not match "/boot/kernel/kernel" (CRC mismatch).
(No debugging symbols found in /boot/kernel/kernel)
'osreldate' has unknown type; cast it to its declared type
(kgdb)

So now I have crash data but can't deal with that and I cant share it in public.

I can confirm that crash happens in 14.2 too. Unfortunately I haven't configured crashdump on that machine (Xeon(R) Silver 4310).
In this week I have upgraded some machines to 14.2, so lets see in a week how its going and hopefully I can get useful crash data.
Comment 9 Eero Hänninen 2025-01-10 09:05:55 UTC
Created attachment 256596 [details]
Telegraf config

Our telegraf configuration - mostly generic and nothing special
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2025-01-10 13:47:59 UTC
(In reply to Eero Hänninen from comment #8)
If you install gdb, you can use crashinfo(8) to get the full core.txt file.

It would be really useful to see the backtrace from gdb - it's hard to see where exactly we're crashing using the info already here.
Comment 11 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-10 19:20:50 UTC
The panic from Matthew is different to panic from Eero, but let's keep
them in one bug report, cause both are associated with telegraf and
panic is around inp->inp_cred.

The panic addresses are very strange, especially in Eero's case, the
fault address is much bigger than size of cred or offset of cred in
the inpcb.  So it doesn't look like a NULL pointer.

Getting access to vmcore file for me and/or Mark will be very helpful.
Comment 12 Eero Hänninen 2025-01-13 12:51:26 UTC
Finally got this crash info.
Comment 13 Eero Hänninen 2025-01-13 12:52:15 UTC
Created attachment 256664 [details]
crashinfo outpout
Comment 14 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-13 14:01:11 UTC
Thanks, Eero! This gives a bit more of insight.  So we see that the actual
panic happens in kern_mutex.c:592:

                owner = lv_mtx_owner(v);
                if (TD_IS_RUNNING(owner)) {

The TD_IS_RUNNING() dereferences td->td_state, which is exactly at 0x458 offset
inside struct thread.  So, lv_mtx_owner() gave us NULL on the previous line.
But a few lines above we see that the lock value is not MTX_UNOWNED.  This
means that the value has one of the flags set: (MTX_RECURSED | MTX_CONTESTED |
MTX_DESTROYED).  Apparently it is MTX_DESTROYED, cause other flags are interal
to the mutex subsystem and having them on would mean there is a bug in mutex
itself and we would see all different kinds of a panics.

So, we got valid inpcb pointer with valid cred pointer, but the cred itself has
already went through crfree_final().  We got a step closer to understanding the
problem.  Myself or Mark might come up with an idea, but getting more
information would be very helpful.  If you find a way to reproduce that, please
let us know.
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2025-01-13 14:36:44 UTC
(In reply to Gleb Smirnoff from comment #14)
If possible, it would also be useful to try testing a debug kernel.  Compiling a GENERIC kernel from the main branch would work, or a GENERIC-DEBUG kernel from 14.2 (new in 14.2, not available in 14.1).
Comment 16 Eero Hänninen 2025-01-13 18:47:58 UTC
Thanks for the explanation!
Currently I don't know of any instant way to reproduce this panic. As I wrote before, in my case if telegraf process is older than 7 or 10 days and then I restart process, then this panic happens - sometimes there is about 30 - 60 seconds delay between telegraf restart and kernel panic. 
I'm not sure, but this panic seems to appears from 14.0 release. I'm not sure because our machines do not have crash dump configured by default, but pattern was the same. About 14.2 I can't confirm yet, but by the end of this week some machines will exceed the so-called critical time point, so we'll see if they crash with 14.2 or not.
I will boot known "crash servers" to debug kernel, so I will get more detailed data.
If you need, I have 3 more crash info samples - but to my eye it seems to be same.
Comment 17 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-13 19:50:58 UTC
My current theory is that telegraf closes active connection somehow different
than other applications, which creates conditions to hit this bug.  Otherwise
we would see this panic with all kinds of software.  The 30-60 period that you
see is the TCP 2MSL timeout, and that matches the stack trace that you have.
It happens in garbage collecting of TCP connections in timewait.

If you have possibility, can you please intentionally panic a machine, but
first run ktrace on the telegraf.  Assuming telegraf PID is 6666, the command
would be 'ktrace -p 6666'.  It will write to ktrace.out that would be located
in the current directory, so make sure it is not /tmp or other directory that
is cleansed on startup.  Since you are going to intentionally panic the machine
anyway, would be also very helpful if you write kernel crashdump that can be
later shared with me and Mark.
Comment 18 Eero Hänninen 2025-01-13 21:09:13 UTC
Created attachment 256673 [details]
Telegraf ktrace.out before process restart

Telegraf pid was 18418 if its important
Comment 19 Eero Hänninen 2025-01-13 21:12:50 UTC
Created attachment 256674 [details]
Kernel crash info for telegraf ktrace.out

Kernel crash info was unfortunately too big, so its compressed with bzip2
Comment 20 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-13 21:18:05 UTC
On Mon Jan 13 21:09:13  2025 UTC, fax@nohik.ee wrote:
> Created attachment 256673 [details]
> Telegraf ktrace.out before process restart
> Telegraf pid was 18418 if its important

But the ktrace.out has trace for PID 18727 (also telegraf).  How did that
happen?  The ktrace.out doesn't have a single syscall on a socket, neither
has call to exit the program.  Maybe somehow traced a wrong PID?
Comment 21 Eero Hänninen 2025-01-13 21:28:22 UTC
(In reply to Gleb Smirnoff from comment #20)
Sorry I was paste here wrong pid, as telegraf has two pids in process table, so I have done trace for telegraf process not for daemon process which pid I accidentally pasted here.

# ps xauwf | grep telegraf
telegraf      47615    1.7  0.0  5867944   216864  -  S    22:54     0:48.38 /usr/local/bin/telegraf --quiet --config=/usr/local/etc/telegraf.conf
root          47278    0.0  0.0    12828     2268  -  Is   22:54     0:00.00 daemon: /usr/local/bin/telegraf[47615] (daemon)

should I trace daemon process too ?
Comment 22 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-13 22:05:53 UTC
On Mon Jan 13 21:28:22  2025 UTC, fax@nohik.ee wrote:
> should I trace daemon process too ?

Yes, this definitely is the daemon process that owns the socket, that later
goes into TCP timewait state and references a freed struct cred.
Comment 23 Eero Hänninen 2025-01-14 11:19:58 UTC
Tried 4 times but can't get proper ktrace for daemon process. If crash happens, then after boot I see 0 length ktrace file. Probably ktrace writes to buffer and its not synced on disk (zfs). Should I try with ufs ?
Comment 24 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-14 21:41:49 UTC
On Tue Jan 14 11:19:58  2025 UTC, fax@nohik.ee wrote:
> then after boot I see 0 length ktrace file.

If the traced process exits (e.g. you no longer observe its PID in the list of
processes), then ktrace(2) definitely writes everything to the file system. Can
we make sure that telegraf indeed exists quickly after restart and doesn't the
old PID remains live until the panic?  If it does exit, then it is ZFS that
needs to be pushed to sync.  I will try to find answer how to do that.
Comment 25 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-14 21:43:13 UTC
Can you please try just these two commands quickly after telegraf exited:

# sync && zpool sync
Comment 26 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-15 01:40:58 UTC
Got another advice from Allan to force sync on a ZFS dataset:

# zfs set sync=always $dataset

But this all makes sense after we make sure that the process exits.  Cause
if it does not, then the ktrace system itself may keep tracing in buffers
for some time.

Also, I would second on Mark's suggestion to try GENERIC-DEBUG kernel from
14.2-RELEASE, or to compile kernel with 'options INVARIANTS' on the existing
installation.  Cause we are trying to lock a destroyed mutex it could that
with INVARIANTS we will crash earlier, and that earlier crash will give more
insight on the problem.
Comment 27 Eero Hänninen 2025-01-22 09:43:39 UTC
Created attachment 256894 [details]
telegraf daemon process ktrace.out

Finally got ktrace.out, added sync && zpool sync to telegraf rc script to the end of stop() function.
Comment 28 Eero Hänninen 2025-01-22 09:47:17 UTC
Created attachment 256895 [details]
core.txt for telegraf last ktrace.out, compressed with bzip2
Comment 29 Mark Johnston freebsd_committer freebsd_triage 2025-01-22 20:56:45 UTC
I think we really need to see output from a GENERIC-DEBUG kernel.  All the crash reports I see are from 14.1-RELEASE GENERIC, which is a non-debug kernel.

If you copy sys/amd64/conf/GENERIC-DEBUG and sys/conf/std.debug from a 14.2 tree into your 14.1 tree, you should be able to build a debug kernel with the usual "make buildkernel KERNCONF=GENERIC-DEBUG".
Comment 30 Gleb Smirnoff freebsd_committer freebsd_triage 2025-01-23 05:01:04 UTC
Eero, unfortunately there is nothing special in the ktrace that records the end
of the telegraf process :( It just closes all sockets and exits. It could be a
cred reference underleak outside of inpcb code actually. I second Mark's
suggestion on reproducing the problem on GENERIC-DEBUG. Also, if you can work
on a reproduce recipe in a test environment, that would help a lot.  E.g. start
a fresh virtual machine, install the telegraf package, do this and that until
it crashes.
Comment 31 Matthew L. Dailey 2025-02-11 14:07:09 UTC
I generated a panic on my test machine this morning that had been up for 12 days. It's at 14.2-RELEASE with a DEBUG kernel. Telegraf is at 1.33.0. The panic happened immediately after issuing "service telegraf restart".

I'm attaching the kernel buffer, backtrace and info on the last frame before the panic. I hope this is helpful.

I have not found a way to reproduce this without waiting 10 days or so.
Comment 32 Matthew L. Dailey 2025-02-11 14:07:59 UTC
Created attachment 257411 [details]
kernel buffer
Comment 33 Matthew L. Dailey 2025-02-11 14:08:26 UTC
Created attachment 257412 [details]
backtrace
Comment 34 Matthew L. Dailey 2025-02-11 14:08:45 UTC
Created attachment 257413 [details]
frame info
Comment 35 Gleb Smirnoff freebsd_committer freebsd_triage 2025-02-11 19:15:44 UTC
Matthew, is it possible to share the kernel core with me & Mark?

You can use my PGP key to encrypt it (or a private email that has a
sensitive link).

https://docs.freebsd.org/en/articles/pgpkeys/#_gleb_smirnoff_glebiusfreebsd_org
https://docs.freebsd.org/en/articles/pgpkeys/#_mark_johnston_markjfreebsd_org
Comment 36 Egor 2025-03-06 16:38:14 UTC
Hello, I have the same problem. After I reloaded telegraf service and my freebsd machines, which ere about 10, rebooted with kernel panic. On the machines telegraf uptime was over 10 days.

Freebsd version: 14.1p5
telegraf version: 1.31_1 or 1.32

I got this log and it's the same for the machines.



Mar  6 10:41:23 test-fw01 kernel: Fatal trap 12: page fault while in kernel mode
Mar  6 10:41:23 test-fw01 kernel: cpuid = 8; apic id = 08
Mar  6 10:41:23 test-fw01 kernel: fault virtual address  = 0x5bb5ce
Mar  6 10:41:23 test-fw01 kernel: fault code             = supervisor read data, page not present
Mar  6 10:41:23 test-fw01 kernel: instruction pointer    = 0x20:0xffffffff80af76f0
Mar  6 10:41:23 test-fw01 kernel: stack pointer          = 0x28:0xfffffe01703b5750
Mar  6 10:41:23 test-fw01 kernel: frame pointer          = 0x28:0xfffffe01703b5750
Mar  6 10:41:23 test-fw01 kernel: code segment           = base rx0, limit 0xfffff, type 0x1b
Mar  6 10:41:23 test-fw01 kernel:                        = DPL 0, pres 1, long 1, def32 0, gran 1
Mar  6 10:41:23 test-fw01 kernel: processor eflags       = interrupt enabled, resume, IOPL = 0
Mar  6 10:41:23 test-fw01 kernel: current process                = 86139 (atop)
Mar  6 10:41:23 test-fw01 kernel: rdi: fffff8037ac3e800 rsi: fffff8000410fb00 rdx: 0000000000000000
Mar  6 10:41:23 test-fw01 kernel: rcx: ffffffff818c6b40  r8: 0000000000000002  r9: ffffffffffffffff
Mar  6 10:41:23 test-fw01 kernel: rax: 00000000005bb58e rbx: fffff8037ac3e800 rbp: fffffe01703b5750
Mar  6 10:41:23 test-fw01 kernel: r10: 0000205f6af445a8 r11: fffff802149c9c60 r12: fffffe01703b57e8
Mar  6 10:41:23 test-fw01 kernel: r13: fffff802149c9740 r14: fffff800b3f22000 r15: fffffe01703b5780
Mar  6 10:41:23 test-fw01 kernel: trap number            = 12
Mar  6 10:41:23 test-fw01 kernel: panic: page fault
Mar  6 10:41:23 test-fw01 kernel: cpuid = 8
Mar  6 10:41:23 test-fw01 kernel: time = 1741257554
Mar  6 10:41:23 test-fw01 kernel: KDB: stack backtrace:
Mar  6 10:41:23 test-fw01 kernel: #0 0xffffffff80b7fefd at kdb_backtrace+0x5d
Mar  6 10:41:23 test-fw01 kernel: #1 0xffffffff80b32bd1 at vpanic+0x131
Mar  6 10:41:23 test-fw01 kernel: #2 0xffffffff80b32a93 at panic+0x43
Mar  6 10:41:23 test-fw01 kernel: #3 0xffffffff8100091b at trap_fatal+0x40b
Mar  6 10:41:23 test-fw01 kernel: #4 0xffffffff81000966 at trap_pfault+0x46
Mar  6 10:41:23 test-fw01 kernel: #5 0xffffffff80fd6d48 at calltrap+0x8
Mar  6 10:41:23 test-fw01 kernel: #6 0xffffffff80cec7e6 at cr_canseeinpcb+0x16
Mar  6 10:41:23 test-fw01 kernel: #7 0xffffffff80d16826 at tcp_pcblist+0x1f6
Mar  6 10:41:23 test-fw01 kernel: #8 0xffffffff80b443f1 at sysctl_root_handler_locked+0x91
Mar  6 10:41:23 test-fw01 kernel: #9 0xffffffff80b4380e at sysctl_root+0x23e
Mar  6 10:41:23 test-fw01 kernel: #10 0xffffffff80b43e7e at userland_sysctl+0x15e
Mar  6 10:41:23 test-fw01 kernel: #11 0xffffffff80b4422d at kern___sysctlbyname+0x21d
Mar  6 10:41:23 test-fw01 kernel: #12 0xffffffff80b4426d at sys___sysctlbyname+0x2d
Mar  6 10:41:23 test-fw01 kernel: #13 0xffffffff810011c0 at amd64_syscall+0x100
Mar  6 10:41:23 test-fw01 kernel: #14 0xffffffff80fd765b at fast_syscall_common+0xf8
Mar  6 10:41:23 test-fw01 kernel: Uptime: 148d12h45m59s
Mar  6 10:41:23 test-fw01 kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Mar  6 10:41:23 test-fw01 kernel: Rebooting...
Comment 37 Gleb Smirnoff freebsd_committer freebsd_triage 2025-03-19 02:59:20 UTC
Created attachment 258804 [details]
283747-long-ref.diff

Suggested patch
Comment 38 Gleb Smirnoff freebsd_committer freebsd_triage 2025-03-19 03:02:23 UTC
Thanks for Matthew for providing a core with INVARIANTS.  I have a guess
here.  Not 100% sure this is the root cause, but > 50% sure.

All who are affected, please patch your systems with patch from the above
attachment and let's see how it goes.  Running with INVARIANTS and patch
is also recommended, so that if system panics again, there will be more
information in the core.

Thanks and sorry for being slow on this.
Comment 39 Matthew L. Dailey 2025-03-19 13:32:09 UTC
Thanks, Gleb - this is fabulous! I'll try to get a kernel compiled with the patch and start testing soon. Given what we know so far, it will be at least 10 days before I'll know if this works, and will want to test far longer to be sure.

Given what you've learned about this, is there anything you can think of to accelerate the process or drive the issue to make it happen more quickly?
Comment 40 Gleb Smirnoff freebsd_committer freebsd_triage 2025-03-19 17:18:51 UTC
Matthew, this is just a hypothesis, not a confident fix.  Let's see how it
goes.

The guess is that telegraf would spawn lots of children for its purposes and
sometimes leaks a zombie.  Those zombies accumulate and each carry large
number of credential reference counts.  Given that cr_refs was extended to
long, but credbatch refs was forgotten, and the latter accumulates counts
from several crs, an overflow is possible.  But the overflow will be triggered
only when telegraf is killed and all leaked zombies are reaped.
Comment 41 Matthew L. Dailey 2025-03-19 19:17:21 UTC
Thanks for the explanation, Gleb. It sounds like this is definitely worth fixing, regardless of whether it's the cause of this specific bug. :-)

FWIW, neither of my test servers (uptime ~8 days with telegraf running constantly) showed any zombies:
# ps aux | grep Z | grep -v grep
USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED        TIME COMMAND

I was just now able to panic my hardware system (took just over 8 days) and will send a core the same way I did before. This is obviously from the unpatched kernel. The last frame before the panic is in crunusebatch() so this looks promising.

Unread portion of the kernel message buffer:
[691684] panic: crunusebatch: ref -4294967294 not >= 0 on cred 0xfffff8011b43cb00
[691684] cpuid = 17
[691684] time = 1742410811
[691684] KDB: stack backtrace:
[691684] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d3bfa870
[691684] vpanic() at vpanic+0x131/frame 0xfffffe00d3bfa9a0
[691684] panic() at panic+0x43/frame 0xfffffe00d3bfaa00
[691684] crunusebatch() at crunusebatch+0xfa/frame 0xfffffe00d3bfaa30
[691684] thread_reap_domain() at thread_reap_domain+0x28d/frame 0xfffffe00d3bfaae0
[691684] proc_reap() at proc_reap+0x660/frame 0xfffffe00d3bfab20
[691684] proc_to_reap() at proc_to_reap+0x3c4/frame 0xfffffe00d3bfab70
[691684] kern_wait6() at kern_wait6+0x1a6/frame 0xfffffe00d3bfac10
[691684] sys_wait4() at sys_wait4+0x6b/frame 0xfffffe00d3bfae00
[691684] amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d3bfaf30
[691684] fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d3bfaf30
[691684] --- syscall (7, FreeBSD ELF64, wait4), rip = 0x2893da, rsp = 0x821093d18, rbp = 0x821093d80 ---
[691684] KDB: enter: panic

Restarted both test systems (hardware and VM) with patched DEBUG kernels and will report back in 10 days or so.
Comment 42 Matthew L. Dailey 2025-03-19 19:17:23 UTC
Thanks for the explanation, Gleb. It sounds like this is definitely worth fixing, regardless of whether it's the cause of this specific bug. :-)

FWIW, neither of my test servers (uptime ~8 days with telegraf running constantly) showed any zombies:
# ps aux | grep Z | grep -v grep
USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED        TIME COMMAND

I was just now able to panic my hardware system (took just over 8 days) and will send a core the same way I did before. This is obviously from the unpatched kernel. The last frame before the panic is in crunusebatch() so this looks promising.

Unread portion of the kernel message buffer:
[691684] panic: crunusebatch: ref -4294967294 not >= 0 on cred 0xfffff8011b43cb00
[691684] cpuid = 17
[691684] time = 1742410811
[691684] KDB: stack backtrace:
[691684] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d3bfa870
[691684] vpanic() at vpanic+0x131/frame 0xfffffe00d3bfa9a0
[691684] panic() at panic+0x43/frame 0xfffffe00d3bfaa00
[691684] crunusebatch() at crunusebatch+0xfa/frame 0xfffffe00d3bfaa30
[691684] thread_reap_domain() at thread_reap_domain+0x28d/frame 0xfffffe00d3bfaae0
[691684] proc_reap() at proc_reap+0x660/frame 0xfffffe00d3bfab20
[691684] proc_to_reap() at proc_to_reap+0x3c4/frame 0xfffffe00d3bfab70
[691684] kern_wait6() at kern_wait6+0x1a6/frame 0xfffffe00d3bfac10
[691684] sys_wait4() at sys_wait4+0x6b/frame 0xfffffe00d3bfae00
[691684] amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d3bfaf30
[691684] fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d3bfaf30
[691684] --- syscall (7, FreeBSD ELF64, wait4), rip = 0x2893da, rsp = 0x821093d18, rbp = 0x821093d80 ---
[691684] KDB: enter: panic

Restarted both test systems (hardware and VM) with patched DEBUG kernels and will report back in 10 days or so.
Comment 43 Eero Hänninen 2025-03-21 00:40:14 UTC
Created attachment 258859 [details]
Crash data with 14.2-p1 debug kernel

I compiled a debug kernel with the patch (14.2-RELEASE-p1 or p2 - kernel is same) and restarted a heavily loaded machine with it. Within almost an hour, I encountered this crash. I noticed that the crash message changed from 'page fault' to 'panic: tcp_do_segment: sent too much.'"
Comment 44 Eero Hänninen 2025-03-21 00:52:24 UTC
I forgot to mention that this time the crash happened on its own, without any service restart, not even Telegraf.
Comment 45 Gleb Smirnoff freebsd_committer freebsd_triage 2025-03-21 05:10:01 UTC
On Fri Mar 21 00:40:14  2025 UTC, fax@nohik.ee wrote:
> I compiled a debug kernel with the patch (14.2-RELEASE-p1 or p2 - kernel is
> same) and restarted a heavily loaded machine with it. Within almost an hour, I
> encountered this crash. I noticed that the crash message changed from 'page
> fault' to 'panic: tcp_do_segment: sent too much.'"

Eero, this is completely different problem, and good news is that it is finally
is being worked on. See bug 282605. The patch has been committed to CURRENT
yesterday and will get into stable/14 real soon.  Meanwhile you can add the
patch to your build, grab it here:

https://github.com/freebsd/freebsd-src/commit/fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1.patch

Very sorry that two most difficult bugs of 14.2-RELEASE have hit you.
Comment 46 mike 2025-03-28 13:10:53 UTC
I think I am hitting this bug as well periodically.  Of the 3 times I saw it, once back in November on zoo.freebsd.org while updating telegraf, and then on a VM running stable from Jan 31s and then on a FreeBSD 14.2R box a week or so ago. 

Any ideas how to force the issue ? I had a test box running on an EPYC 7302P, running 12hrs 
while true
do
service telegraf restart
sleep $(jot -r 1 16 95)
done
but nothing.  Reading through this bug thread, it seems maybe the box needs to be up a while ?


prior to the crash this popped up on each of the machines

Mar  7 09:38:18 pf-cam2 kernel: freeing uidinfo: uid = 884, sbsize = 115664


Mar  7 09:42:17 pf-cam2 kernel: Fatal trap 12: page fault while in kernel mode
Mar  7 09:42:17 pf-cam2 kernel: cpuid = 0; apic id = 00
Mar  7 09:42:17 pf-cam2 kernel: fault virtual address      = 0x18
Mar  7 09:42:17 pf-cam2 kernel: fault code         = supervisor write data, page not present
Mar  7 09:42:17 pf-cam2 kernel: instruction pointer        = 0x20:0xffffffff80b36899
Mar  7 09:42:17 pf-cam2 kernel: stack pointer              = 0x28:0xfffffe0116613d18
Mar  7 09:42:17 pf-cam2 kernel: frame pointer              = 0x28:0xfffffe0116613d30
Mar  7 09:42:17 pf-cam2 kernel: code segment               = base rx0, limit 0xfffff, type 0x1b
Mar  7 09:42:17 pf-cam2 kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
Mar  7 09:42:17 pf-cam2 kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
Mar  7 09:42:17 pf-cam2 kernel: current process            = 2 (clock (0))
Mar  7 09:42:17 pf-cam2 kernel: rdi: 0000000000000000 rsi: fffff8003faf3aec rdx: 0000000000000000
Mar  7 09:42:17 pf-cam2 kernel: rcx: 7fffffffffffffff  r8: ffffffffffff3cd4  r9: 00000000ffff3cd4
Mar  7 09:42:17 pf-cam2 kernel: rax: 0000000000000000 rbx: fffff8003faf3ad0 rbp: fffffe0116613d30
Mar  7 09:42:17 pf-cam2 kernel: r10: ffffffffffff3cd4 r11: 00000000671a0b2a r12: 000f000000000000
Mar  7 09:42:17 pf-cam2 kernel: r13: fffff806da7b0aa0 r14: fffff8003faf3800 r15: fffff80001c4a740
Mar  7 09:42:17 pf-cam2 kernel: trap number                = 12
Mar  7 09:42:17 pf-cam2 kernel: panic: page fault
Mar  7 09:42:17 pf-cam2 kernel: cpuid = 0
Mar  7 09:42:17 pf-cam2 kernel: time = 1741358358
Mar  7 09:42:17 pf-cam2 kernel: KDB: stack backtrace:
Mar  7 09:42:17 pf-cam2 kernel: #0 0xffffffff80b8b89d at kdb_backtrace+0x5d
Mar  7 09:42:17 pf-cam2 kernel: #1 0xffffffff80b3dc01 at vpanic+0x131
Mar  7 09:42:17 pf-cam2 kernel: #2 0xffffffff80b3dac3 at panic+0x43
Mar  7 09:42:17 pf-cam2 kernel: #3 0xffffffff81025a0b at trap_fatal+0x40b
Mar  7 09:42:17 pf-cam2 kernel: #4 0xffffffff81025a56 at trap_pfault+0x46
Mar  7 09:42:17 pf-cam2 kernel: #5 0xffffffff80ffc398 at calltrap+0x8
Mar  7 09:42:17 pf-cam2 kernel: #6 0xffffffff80be2b66 at sorele_locked+0xa6
Mar  7 09:42:17 pf-cam2 kernel: #7 0xffffffff80d21ae0 at tcp_close+0x170
Mar  7 09:42:17 pf-cam2 kernel: #8 0xffffffff80d2b9b9 at tcp_timer_2msl+0xf9
Mar  7 09:42:17 pf-cam2 kernel: #9 0xffffffff80d2aa2e at tcp_timer_enter+0xfe
Mar  7 09:42:17 pf-cam2 kernel: #10 0xffffffff80b5bc0c at softclock_call_cc+0x12c
Mar  7 09:42:17 pf-cam2 kernel: #11 0xffffffff80b5d455 at softclock_thread+0xe5
Mar  7 09:42:17 pf-cam2 kernel: #12 0xffffffff80af760f at fork_exit+0x7f
Mar  7 09:42:17 pf-cam2 kernel: #13 0xffffffff80ffd3fe at fork_trampoline+0xe
Mar  7 09:42:17 pf-cam2 kernel: Uptime: 94d14h13m41s
Mar  7 09:42:17 pf-cam2 kernel: Dumping 4577 out of 65260 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>---


and just now in a VM a similar sort of pattern.  uid 884 is telegraf and I was mucking about with the configuration.

And on a RELENG_14 box (kernel from Jan 31st)

Mar 26 17:09:43 pf-firewall kernel: freeing uidinfo: uid = 884, sbsize = 99600
Mar 26 17:11:26 pf-firewall syslogd: kernel boot file is /boot/kernel/kernel
Mar 26 17:11:26 pf-firewall kernel: panic: vm_fault_lookup: fault on nofault entry, addr: 0xffffffff80aea000
Mar 26 17:11:26 pf-firewall kernel: cpuid = 0
Mar 26 17:11:26 pf-firewall kernel: time = 1743023443
Mar 26 17:11:26 pf-firewall kernel: KDB: stack backtrace:
Mar 26 17:11:26 pf-firewall kernel: #0 0xffffffff80b8d13d at kdb_backtrace+0x5d
Mar 26 17:11:26 pf-firewall kernel: #1 0xffffffff80b3ef01 at vpanic+0x161
Mar 26 17:11:26 pf-firewall kernel: #2 0xffffffff80b3ed93 at panic+0x43
Mar 26 17:11:26 pf-firewall kernel: #3 0xffffffff80ebcf67 at vm_fault+0x15b7
Mar 26 17:11:26 pf-firewall kernel: #4 0xffffffff80ebb8c1 at vm_fault_trap+0x81
Mar 26 17:11:26 pf-firewall kernel: #5 0xffffffff8102c97a at trap_pfault+0x20a
Mar 26 17:11:26 pf-firewall kernel: #6 0xffffffff81003748 at calltrap+0x8
Mar 26 17:11:26 pf-firewall kernel: #7 0xffffffff80b2d837 at crfree_final+0x17
Mar 26 17:11:26 pf-firewall kernel: #8 0xffffffff80cfc156 at in_pcbfree+0x2a6
Mar 26 17:11:26 pf-firewall kernel: #9 0xffffffff80be4229 at sorele_locked+0x89
Mar 26 17:11:26 pf-firewall kernel: #10 0xffffffff80d242b0 at tcp_close+0x170
Mar 26 17:11:26 pf-firewall kernel: #11 0xffffffff80d2e23b at tcp_timer_2msl+0xfb
Mar 26 17:11:26 pf-firewall kernel: #12 0xffffffff80d2d2b4 at tcp_timer_enter+0x104
Mar 26 17:11:26 pf-firewall kernel: #13 0xffffffff80b5d139 at softclock_call_cc+0x129
Mar 26 17:11:26 pf-firewall kernel: #14 0xffffffff80b5e905 at softclock_thread+0xe5
Mar 26 17:11:26 pf-firewall kernel: #15 0xffffffff80af8471 at fork_exit+0x81
Mar 26 17:11:26 pf-firewall kernel: #16 0xffffffff810047ae at fork_trampoline+0xe
Mar 26 17:11:26 pf-firewall kernel: Uptime: 54d4h48m5s
Mar 26 17:11:26 pf-firewall kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Mar 26 17:11:26 pf-firewall kernel: Rebooting...
Comment 47 Gleb Smirnoff freebsd_committer freebsd_triage 2025-03-28 18:06:03 UTC
Mike, my current hypothesis is that we have a 32-bit overflow in credential
reference counting.  The overflow happens, when we reap a group of processes,
and reference counts of the group summed up together overflow.  AFAIU, telegraf
will fork+exec arbitrary programs, which in their turn can also fork+exec more
programs.  While telegraf itself seems to do proper wait(2)-ing on zombies, but
some external program may leak zombies, and do not exit itself.  Then, when
telegraf is restarted, this pack of zombies is reaped and this is where
overflow could be hit.

This is fixed by attachment 258804 [details].  I am not sure in my hypothesis, that's why
it is not even committed to CURRENT.  However, everyone affected by the bug are
advices to use this patch and let's see what happens.  We still have some time
before 14.3.  I will probably start review process to get it into CURRENT,
anyway.

With this info, you may have some idea on how to reproduce it.  I know, you are
good at chasing bugs, Mike :) Sorry that it hits you, but I'm glad that you
joined the team chasing this bug.
Comment 48 mike 2025-03-28 18:13:25 UTC
(In reply to Gleb Smirnoff from comment #47)

Thank you for the vote of bug confidence Gleb :)  I will have a look at the 3 instances where this bug seems to have happened and look to see if there is anything in common on what I monitor. On my test box, I will try and load it up with a few extra checks, and especially ones that call external programs to see if I can randomly trigger it on my test machine.
Comment 49 Matthew L. Dailey 2025-03-29 22:06:52 UTC
I restarted telegraf on my two test machines (one hardware, one VM), both running with Gleb's proposed patch, and neither one panicked! They had both been up for just over 10 days, running telegraf constantly. I've never been able to reproduce this on the VM, but the hardware one has always consistently panicked after being up running telegraf for >8 days.

It's not quite definitive, but it certainly seems like the patch might have fixed this one.
Comment 50 commit-hook freebsd_committer freebsd_triage 2025-03-29 22:10:50 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cd46e980134f6fc765b28ee9c8bf41e8fc1b0261

commit cd46e980134f6fc765b28ee9c8bf41e8fc1b0261
Author:     Gleb Smirnoff <glebius@FreeBSD.org>
AuthorDate: 2025-03-29 22:09:15 +0000
Commit:     Gleb Smirnoff <glebius@FreeBSD.org>
CommitDate: 2025-03-29 22:09:15 +0000

    cred: fix struct credbatch to use long for refcount

    This structure collects count from multiple cred structures.  Of course it
    can't use a smaller type.

    PR:                     283747
    Reviewed by:            olce, mjg, markj
    Differential Revision:  https://reviews.freebsd.org/D49562
    Fixes:                  37337709d3334f32650ba3a7c529fa013ed5e1f2

 sys/kern/kern_prot.c | 2 +-
 sys/sys/ucred.h      | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
Comment 51 Eugene Kilimchuk 2025-03-31 22:19:52 UTC
Created attachment 259233 [details]
ktrace.out.tar.gz
Comment 52 Eugene Kilimchuk 2025-03-31 22:21:05 UTC
Hi!

I reproduced the bug in 14.2 with debugging. I wrote similar Go code to reproduce it:
https://github.com/ekilimchuk/junk/blob/main/fork-test/README.md

After waiting for a week, I reproduced the issue (crunuse’s ref overflow).
 
ekilimchuk@test-fw-debug01:~$ cat core/crash/info.0
Dump header from device: /dev/da1
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 304893952
  Blocksize: 512
  Compression: none
  Dumptime: 2025-03-31 10:25:04 +0000
  Hostname: test-fw-debug01
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.2-RELEASE-p2 releng/14.2-n269518-ac2cbb46b5f1 GENERIC-DEBUG
  Panic String: crunuse: ref -4294967295 not > 0 on cred 0xfffff8002f045300
  Dump Parity: 4010253368
  Bounds: 0
  Dump Status: good

Stack trace:
(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff80b4c960 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff80b4ce5e in vpanic (fmt=0xffffffff811bd95e "%s: ref %ld not > 0 on cred %p",
    ap=ap@entry=0xfffffe00037aba60) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff80b4cc03 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff80b39a15 in crunuse (td=0xfffff800457a7740) at /usr/src/sys/kern/kern_prot.c:1927
#6  0xffffffff80b39909 in crcowfree (td=<unavailable>, td@entry=0xfffff800457a7740)
    at /usr/src/sys/kern/kern_prot.c:1969
#7  0xffffffff80b657bc in thread_cow_free (td=0xfffff800457a7740)
    at /usr/src/sys/kern/kern_thread.c:880
#8  thread_wait (p=p@entry=0xfffffe00546adae0) at /usr/src/sys/kern/kern_thread.c:1055
#9  0xffffffff80afe430 in proc_reap (td=td@entry=0xfffff800035c0000,
    p=p@entry=0xfffffe00546adae0, status=status@entry=0xfffffe00037abde4, options=<optimized out>)
    at /usr/src/sys/kern/kern_exit.c:1033
#10 0xffffffff80afe874 in proc_to_reap (td=td@entry=0xfffff800035c0000,
    p=p@entry=0xfffffe00546adae0, idtype=idtype@entry=P_ALL, id=id@entry=0,
    status=status@entry=0xfffffe00037abde4, options=options@entry=48, wrusage=0x0, siginfo=0x0,
    check_only=0) at /usr/src/sys/kern/kern_exit.c:1197
#11 0xffffffff80afd9d6 in kern_wait6 (td=td@entry=0xfffff800035c0000, idtype=P_ALL, id=0,
    status=status@entry=0xfffffe00037abde4, options=48, wrusage=0x0, siginfo=0x0)
    at /usr/src/sys/kern/kern_exit.c:1326
#12 0xffffffff80afd5fb in kern_wait (status=0xfffffe00037abde4, options=<unavailable>,
    td=<optimized out>, pid=<optimized out>, rusage=<optimized out>)
    at /usr/src/sys/kern/kern_exit.c:1238
#13 sys_wait4 (td=0xfffff800035c0000, uap=0xfffff800035c0400) at /usr/src/sys/kern/kern_exit.c:864
#14 0xffffffff81069781 in syscallenter (td=0xfffff800035c0000)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#15 amd64_syscall (td=0xfffff800035c0000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1194
#16 <signal handler called>
#17 0x00000000002893da in ?? ()

Also see a ktrace: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=259233

I have a core dump with invariants and am ready to share the dump. I’ll also try to speed up reproducing the issue. I think it’s related to how Golang specifically handles execv.
Comment 53 Eugene Kilimchuk 2025-04-01 14:45:05 UTC
Hi!

I've written a multitask version with a workpool. You can checkout from the multitask branch and run the test.
Comment 54 mike 2025-04-01 14:48:10 UTC
(In reply to Eugene Kilimchuk from comment #53)
Very cool work!  For telegraf, is there a way to tell ahead of time if the box "ready" to panic after a restart ? I have a couple of machines running with INVARIANTS for about 3 days now.  I loaded up telegraf to spawn a number of ping processes as well as external scripts.  Or better to just wait a week or so ?