When server is up about 10 or more days and if I do restart to telegraf service then result is kernel panic. I don't think that this is telegraf itself related but telegraf is a "good" trigger. As far I have tried to reproduce this crash, I have noticed that this happens most of the times on machines with AMD cpu (for example EPYC 7443, EPYC 9354P, EPYC 7282). I'm not sure, but this panic happens with FreeBSD 14.0 and 14.1 (not tested on 14.2). Here is crash data: Dump header from device: /dev/ada9p1 Architecture: amd64 Architecture Version: 2 Dump Length: 25966280704 Blocksize: 512 Compression: none Dumptime: 2024-12-30 16:54:25 +0200 Hostname: xx.xx.xx Magic: FreeBSD Kernel Dump Version String: FreeBSD 14.1-RELEASE-p5 GENERIC Panic String: page fault Dump Parity: 1677611570 Bounds: 0 Dump Status: good Unread portion of the kernel message buffer: [1748990] [1748990] [1748990] Fatal trap 12: page fault while in kernel mode [1748990] cpuid = 0; apic id = 00 [1748990] fault virtual address = 0x458 [1748990] fault code = supervisor read data, page not present [1748990] instruction pointer = 0x20:0xffffffff80b0f0c9 [1748990] stack pointer = 0x28:0xfffffe089f2f3c60 [1748990] frame pointer = 0x28:0xfffffe089f2f3ce0 [1748990] code segment = base rx0, limit 0xfffff, type 0x1b [1748990] = DPL 0, pres 1, long 1, def32 0, gran 1 [1748990] processor eflags = interrupt enabled, resume, IOPL = 0 [1748990] current process = 2 (clock (0)) [1748990] rdi: fffff8047720e518 rsi: 0000000000000004 rdx: 0000000000000000 [1748990] rcx: 0000000000000000 r8: 00000000000000bd r9: fffffe089f2f4000 [1748990] rax: 0000000000000000 rbx: fffff8010813a740 rbp: fffffe089f2f3ce0 [1748990] r10: 0000000000001388 r11: 00000000e836231a r12: 0000000000000000 [1748990] r13: fffff8010813a740 r14: fffffe089f2f3c88 r15: fffff8047720e518 [1748990] trap number = 12 [1748990] panic: page fault [1748990] cpuid = 0 [1748990] time = 1735570465 [1748990] KDB: stack backtrace: [1748990] #0 0xffffffff80b7fefd at kdb_backtrace+0x5d [1748990] #1 0xffffffff80b32bd1 at vpanic+0x131 [1748990] #2 0xffffffff80b32a93 at panic+0x43 [1748990] #3 0xffffffff8100091b at trap_fatal+0x40b [1748990] #4 0xffffffff81000966 at trap_pfault+0x46 [1748990] #5 0xffffffff80fd6d48 at calltrap+0x8 [1748990] #6 0xffffffff80b1f6b9 at crfree+0xa9 [1748990] #7 0xffffffff80ceadb4 at in_pcbfree+0x2a4 [1748990] #8 0xffffffff80bd6049 at sorele_locked+0x89 [1748990] #9 0xffffffff80d12c20 at tcp_close+0x170 [1748990] #10 0xffffffff80d1cb09 at tcp_timer_2msl+0xf9 [1748990] #11 0xffffffff80d1bb7e at tcp_timer_enter+0xfe [1748990] #12 0xffffffff80b50b0c at softclock_call_cc+0x12c [1748990] #13 0xffffffff80b52355 at softclock_thread+0xe5 [1748990] #14 0xffffffff80aecf7f at fork_exit+0x7f [1748990] #15 0xffffffff80fd7dae at fork_trampoline+0xe [1748990] Uptime: 20d5h49m50s [1748990] Dumping 24763 out of 786107 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% Reading symbols from /boot/kernel/mrsas.ko... Reading symbols from /usr/lib/debug//boot/kernel/mrsas.ko.debug... Reading symbols from /boot/kernel/amdtemp.ko... Reading symbols from /usr/lib/debug//boot/kernel/amdtemp.ko.debug... Reading symbols from /boot/kernel/amdsmn.ko... Reading symbols from /usr/lib/debug//boot/kernel/amdsmn.ko.debug... Reading symbols from /boot/kernel/if_igb.ko... Reading symbols from /usr/lib/debug//boot/kernel/if_em.ko.debug... Reading symbols from /boot/kernel/accf_dns.ko... Reading symbols from /usr/lib/debug//boot/kernel/accf_dns.ko.debug... Reading symbols from /boot/kernel/zfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug... Reading symbols from /boot/kernel/accf_data.ko... Reading symbols from /usr/lib/debug//boot/kernel/accf_data.ko.debug... Reading symbols from /boot/kernel/accf_http.ko... Reading symbols from /usr/lib/debug//boot/kernel/accf_http.ko.debug... Reading symbols from /boot/kernel/pflog.ko... Reading symbols from /usr/lib/debug//boot/kernel/pflog.ko.debug... Reading symbols from /boot/kernel/pf.ko... Reading symbols from /usr/lib/debug//boot/kernel/pf.ko.debug... Reading symbols from /boot/kernel/ipmi.ko... Reading symbols from /usr/lib/debug//boot/kernel/ipmi.ko.debug... Reading symbols from /boot/kernel/smbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug... Reading symbols from /boot/kernel/intpm.ko... Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug... Reading symbols from /boot/kernel/cpuctl.ko... Reading symbols from /usr/lib/debug//boot/kernel/cpuctl.ko.debug... Reading symbols from /boot/kernel/if_lagg.ko... Reading symbols from /usr/lib/debug//boot/kernel/if_lagg.ko.debug... Reading symbols from /boot/kernel/if_infiniband.ko... Reading symbols from /usr/lib/debug//boot/kernel/if_infiniband.ko.debug... Reading symbols from /boot/kernel/uhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug... Reading symbols from /boot/kernel/ums.ko... Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug... Reading symbols from /boot/kernel/usbhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug... Reading symbols from /boot/kernel/hidbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug... Reading symbols from /boot/kernel/if_urndis.ko... Reading symbols from /usr/lib/debug//boot/kernel/if_urndis.ko.debug... Reading symbols from /boot/kernel/uether.ko... Reading symbols from /usr/lib/debug//boot/kernel/uether.ko.debug... Reading symbols from /boot/kernel/nullfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/nullfs.ko.debug... __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb)
^Triage: apparently this applies to port net-mgmt/telegraf. Cc: maintainer.
Hi! I use telegraf extensively and have not seen this. What is your telegraf configuration? Also, are you able to verify that it still exists on 14.2? Palle
Could you please show the backtrace from core.txt so that we have line numbers?
I just had this happen this morning on two machines running 14.2-RELEASE. In both cases, I had used pkg to upgrade telegraf from 1.32.3 to 1.33.0. After restarting telegraf, both panicked. Unfortunately, I didn't have dumps set up properly on either system, but I did manage to catch a screenshot of the panic on one system. It was a "Fatal trap 12: page fault while in kernel mode" I'm attaching the screenshot and our telegraf config. I was unable to replicate this on a VM that had been up for about the same about of time (~30 days). The only difference I can see if that the VM has no user traffic, while the others have significant nfs and smb traffic.
Created attachment 256501 [details] Screenshot of panic
Created attachment 256502 [details] Telegraf config
I should also add that the two machines that panicked for us this morning were both Intel, not AMD - Xeon Gold 5215 and Xeon E5-2650v3
Unfortunately there is nothing useful in core.txt.0, just a note about that gdb is not installed. As this crashed machine went to upgrade 14.2, I moved crash data to another machine, but now I can't debug because error of debug symbols, even if I create clean install 14.1 virtual machine and updated it to 14.1-p6, then I get warning: the debug information found in "/usr/lib/debug//boot/kernel/kernel.debug" does not match "/boot/kernel/kernel" (CRC mismatch). (No debugging symbols found in /boot/kernel/kernel) 'osreldate' has unknown type; cast it to its declared type (kgdb) So now I have crash data but can't deal with that and I cant share it in public. I can confirm that crash happens in 14.2 too. Unfortunately I haven't configured crashdump on that machine (Xeon(R) Silver 4310). In this week I have upgraded some machines to 14.2, so lets see in a week how its going and hopefully I can get useful crash data.
Created attachment 256596 [details] Telegraf config Our telegraf configuration - mostly generic and nothing special
(In reply to Eero Hänninen from comment #8) If you install gdb, you can use crashinfo(8) to get the full core.txt file. It would be really useful to see the backtrace from gdb - it's hard to see where exactly we're crashing using the info already here.
The panic from Matthew is different to panic from Eero, but let's keep them in one bug report, cause both are associated with telegraf and panic is around inp->inp_cred. The panic addresses are very strange, especially in Eero's case, the fault address is much bigger than size of cred or offset of cred in the inpcb. So it doesn't look like a NULL pointer. Getting access to vmcore file for me and/or Mark will be very helpful.
Finally got this crash info.
Created attachment 256664 [details] crashinfo outpout
Thanks, Eero! This gives a bit more of insight. So we see that the actual panic happens in kern_mutex.c:592: owner = lv_mtx_owner(v); if (TD_IS_RUNNING(owner)) { The TD_IS_RUNNING() dereferences td->td_state, which is exactly at 0x458 offset inside struct thread. So, lv_mtx_owner() gave us NULL on the previous line. But a few lines above we see that the lock value is not MTX_UNOWNED. This means that the value has one of the flags set: (MTX_RECURSED | MTX_CONTESTED | MTX_DESTROYED). Apparently it is MTX_DESTROYED, cause other flags are interal to the mutex subsystem and having them on would mean there is a bug in mutex itself and we would see all different kinds of a panics. So, we got valid inpcb pointer with valid cred pointer, but the cred itself has already went through crfree_final(). We got a step closer to understanding the problem. Myself or Mark might come up with an idea, but getting more information would be very helpful. If you find a way to reproduce that, please let us know.
(In reply to Gleb Smirnoff from comment #14) If possible, it would also be useful to try testing a debug kernel. Compiling a GENERIC kernel from the main branch would work, or a GENERIC-DEBUG kernel from 14.2 (new in 14.2, not available in 14.1).
Thanks for the explanation! Currently I don't know of any instant way to reproduce this panic. As I wrote before, in my case if telegraf process is older than 7 or 10 days and then I restart process, then this panic happens - sometimes there is about 30 - 60 seconds delay between telegraf restart and kernel panic. I'm not sure, but this panic seems to appears from 14.0 release. I'm not sure because our machines do not have crash dump configured by default, but pattern was the same. About 14.2 I can't confirm yet, but by the end of this week some machines will exceed the so-called critical time point, so we'll see if they crash with 14.2 or not. I will boot known "crash servers" to debug kernel, so I will get more detailed data. If you need, I have 3 more crash info samples - but to my eye it seems to be same.
My current theory is that telegraf closes active connection somehow different than other applications, which creates conditions to hit this bug. Otherwise we would see this panic with all kinds of software. The 30-60 period that you see is the TCP 2MSL timeout, and that matches the stack trace that you have. It happens in garbage collecting of TCP connections in timewait. If you have possibility, can you please intentionally panic a machine, but first run ktrace on the telegraf. Assuming telegraf PID is 6666, the command would be 'ktrace -p 6666'. It will write to ktrace.out that would be located in the current directory, so make sure it is not /tmp or other directory that is cleansed on startup. Since you are going to intentionally panic the machine anyway, would be also very helpful if you write kernel crashdump that can be later shared with me and Mark.
Created attachment 256673 [details] Telegraf ktrace.out before process restart Telegraf pid was 18418 if its important
Created attachment 256674 [details] Kernel crash info for telegraf ktrace.out Kernel crash info was unfortunately too big, so its compressed with bzip2
On Mon Jan 13 21:09:13 2025 UTC, fax@nohik.ee wrote: > Created attachment 256673 [details] > Telegraf ktrace.out before process restart > Telegraf pid was 18418 if its important But the ktrace.out has trace for PID 18727 (also telegraf). How did that happen? The ktrace.out doesn't have a single syscall on a socket, neither has call to exit the program. Maybe somehow traced a wrong PID?
(In reply to Gleb Smirnoff from comment #20) Sorry I was paste here wrong pid, as telegraf has two pids in process table, so I have done trace for telegraf process not for daemon process which pid I accidentally pasted here. # ps xauwf | grep telegraf telegraf 47615 1.7 0.0 5867944 216864 - S 22:54 0:48.38 /usr/local/bin/telegraf --quiet --config=/usr/local/etc/telegraf.conf root 47278 0.0 0.0 12828 2268 - Is 22:54 0:00.00 daemon: /usr/local/bin/telegraf[47615] (daemon) should I trace daemon process too ?
On Mon Jan 13 21:28:22 2025 UTC, fax@nohik.ee wrote: > should I trace daemon process too ? Yes, this definitely is the daemon process that owns the socket, that later goes into TCP timewait state and references a freed struct cred.
Tried 4 times but can't get proper ktrace for daemon process. If crash happens, then after boot I see 0 length ktrace file. Probably ktrace writes to buffer and its not synced on disk (zfs). Should I try with ufs ?
On Tue Jan 14 11:19:58 2025 UTC, fax@nohik.ee wrote: > then after boot I see 0 length ktrace file. If the traced process exits (e.g. you no longer observe its PID in the list of processes), then ktrace(2) definitely writes everything to the file system. Can we make sure that telegraf indeed exists quickly after restart and doesn't the old PID remains live until the panic? If it does exit, then it is ZFS that needs to be pushed to sync. I will try to find answer how to do that.
Can you please try just these two commands quickly after telegraf exited: # sync && zpool sync
Got another advice from Allan to force sync on a ZFS dataset: # zfs set sync=always $dataset But this all makes sense after we make sure that the process exits. Cause if it does not, then the ktrace system itself may keep tracing in buffers for some time. Also, I would second on Mark's suggestion to try GENERIC-DEBUG kernel from 14.2-RELEASE, or to compile kernel with 'options INVARIANTS' on the existing installation. Cause we are trying to lock a destroyed mutex it could that with INVARIANTS we will crash earlier, and that earlier crash will give more insight on the problem.
Created attachment 256894 [details] telegraf daemon process ktrace.out Finally got ktrace.out, added sync && zpool sync to telegraf rc script to the end of stop() function.
Created attachment 256895 [details] core.txt for telegraf last ktrace.out, compressed with bzip2
I think we really need to see output from a GENERIC-DEBUG kernel. All the crash reports I see are from 14.1-RELEASE GENERIC, which is a non-debug kernel. If you copy sys/amd64/conf/GENERIC-DEBUG and sys/conf/std.debug from a 14.2 tree into your 14.1 tree, you should be able to build a debug kernel with the usual "make buildkernel KERNCONF=GENERIC-DEBUG".
Eero, unfortunately there is nothing special in the ktrace that records the end of the telegraf process :( It just closes all sockets and exits. It could be a cred reference underleak outside of inpcb code actually. I second Mark's suggestion on reproducing the problem on GENERIC-DEBUG. Also, if you can work on a reproduce recipe in a test environment, that would help a lot. E.g. start a fresh virtual machine, install the telegraf package, do this and that until it crashes.
I generated a panic on my test machine this morning that had been up for 12 days. It's at 14.2-RELEASE with a DEBUG kernel. Telegraf is at 1.33.0. The panic happened immediately after issuing "service telegraf restart". I'm attaching the kernel buffer, backtrace and info on the last frame before the panic. I hope this is helpful. I have not found a way to reproduce this without waiting 10 days or so.
Created attachment 257411 [details] kernel buffer
Created attachment 257412 [details] backtrace
Created attachment 257413 [details] frame info
Matthew, is it possible to share the kernel core with me & Mark? You can use my PGP key to encrypt it (or a private email that has a sensitive link). https://docs.freebsd.org/en/articles/pgpkeys/#_gleb_smirnoff_glebiusfreebsd_org https://docs.freebsd.org/en/articles/pgpkeys/#_mark_johnston_markjfreebsd_org
Hello, I have the same problem. After I reloaded telegraf service and my freebsd machines, which ere about 10, rebooted with kernel panic. On the machines telegraf uptime was over 10 days. Freebsd version: 14.1p5 telegraf version: 1.31_1 or 1.32 I got this log and it's the same for the machines. Mar 6 10:41:23 test-fw01 kernel: Fatal trap 12: page fault while in kernel mode Mar 6 10:41:23 test-fw01 kernel: cpuid = 8; apic id = 08 Mar 6 10:41:23 test-fw01 kernel: fault virtual address = 0x5bb5ce Mar 6 10:41:23 test-fw01 kernel: fault code = supervisor read data, page not present Mar 6 10:41:23 test-fw01 kernel: instruction pointer = 0x20:0xffffffff80af76f0 Mar 6 10:41:23 test-fw01 kernel: stack pointer = 0x28:0xfffffe01703b5750 Mar 6 10:41:23 test-fw01 kernel: frame pointer = 0x28:0xfffffe01703b5750 Mar 6 10:41:23 test-fw01 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Mar 6 10:41:23 test-fw01 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Mar 6 10:41:23 test-fw01 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Mar 6 10:41:23 test-fw01 kernel: current process = 86139 (atop) Mar 6 10:41:23 test-fw01 kernel: rdi: fffff8037ac3e800 rsi: fffff8000410fb00 rdx: 0000000000000000 Mar 6 10:41:23 test-fw01 kernel: rcx: ffffffff818c6b40 r8: 0000000000000002 r9: ffffffffffffffff Mar 6 10:41:23 test-fw01 kernel: rax: 00000000005bb58e rbx: fffff8037ac3e800 rbp: fffffe01703b5750 Mar 6 10:41:23 test-fw01 kernel: r10: 0000205f6af445a8 r11: fffff802149c9c60 r12: fffffe01703b57e8 Mar 6 10:41:23 test-fw01 kernel: r13: fffff802149c9740 r14: fffff800b3f22000 r15: fffffe01703b5780 Mar 6 10:41:23 test-fw01 kernel: trap number = 12 Mar 6 10:41:23 test-fw01 kernel: panic: page fault Mar 6 10:41:23 test-fw01 kernel: cpuid = 8 Mar 6 10:41:23 test-fw01 kernel: time = 1741257554 Mar 6 10:41:23 test-fw01 kernel: KDB: stack backtrace: Mar 6 10:41:23 test-fw01 kernel: #0 0xffffffff80b7fefd at kdb_backtrace+0x5d Mar 6 10:41:23 test-fw01 kernel: #1 0xffffffff80b32bd1 at vpanic+0x131 Mar 6 10:41:23 test-fw01 kernel: #2 0xffffffff80b32a93 at panic+0x43 Mar 6 10:41:23 test-fw01 kernel: #3 0xffffffff8100091b at trap_fatal+0x40b Mar 6 10:41:23 test-fw01 kernel: #4 0xffffffff81000966 at trap_pfault+0x46 Mar 6 10:41:23 test-fw01 kernel: #5 0xffffffff80fd6d48 at calltrap+0x8 Mar 6 10:41:23 test-fw01 kernel: #6 0xffffffff80cec7e6 at cr_canseeinpcb+0x16 Mar 6 10:41:23 test-fw01 kernel: #7 0xffffffff80d16826 at tcp_pcblist+0x1f6 Mar 6 10:41:23 test-fw01 kernel: #8 0xffffffff80b443f1 at sysctl_root_handler_locked+0x91 Mar 6 10:41:23 test-fw01 kernel: #9 0xffffffff80b4380e at sysctl_root+0x23e Mar 6 10:41:23 test-fw01 kernel: #10 0xffffffff80b43e7e at userland_sysctl+0x15e Mar 6 10:41:23 test-fw01 kernel: #11 0xffffffff80b4422d at kern___sysctlbyname+0x21d Mar 6 10:41:23 test-fw01 kernel: #12 0xffffffff80b4426d at sys___sysctlbyname+0x2d Mar 6 10:41:23 test-fw01 kernel: #13 0xffffffff810011c0 at amd64_syscall+0x100 Mar 6 10:41:23 test-fw01 kernel: #14 0xffffffff80fd765b at fast_syscall_common+0xf8 Mar 6 10:41:23 test-fw01 kernel: Uptime: 148d12h45m59s Mar 6 10:41:23 test-fw01 kernel: Automatic reboot in 15 seconds - press a key on the console to abort Mar 6 10:41:23 test-fw01 kernel: Rebooting...
Created attachment 258804 [details] 283747-long-ref.diff Suggested patch
Thanks for Matthew for providing a core with INVARIANTS. I have a guess here. Not 100% sure this is the root cause, but > 50% sure. All who are affected, please patch your systems with patch from the above attachment and let's see how it goes. Running with INVARIANTS and patch is also recommended, so that if system panics again, there will be more information in the core. Thanks and sorry for being slow on this.
Thanks, Gleb - this is fabulous! I'll try to get a kernel compiled with the patch and start testing soon. Given what we know so far, it will be at least 10 days before I'll know if this works, and will want to test far longer to be sure. Given what you've learned about this, is there anything you can think of to accelerate the process or drive the issue to make it happen more quickly?
Matthew, this is just a hypothesis, not a confident fix. Let's see how it goes. The guess is that telegraf would spawn lots of children for its purposes and sometimes leaks a zombie. Those zombies accumulate and each carry large number of credential reference counts. Given that cr_refs was extended to long, but credbatch refs was forgotten, and the latter accumulates counts from several crs, an overflow is possible. But the overflow will be triggered only when telegraf is killed and all leaked zombies are reaped.
Thanks for the explanation, Gleb. It sounds like this is definitely worth fixing, regardless of whether it's the cause of this specific bug. :-) FWIW, neither of my test servers (uptime ~8 days with telegraf running constantly) showed any zombies: # ps aux | grep Z | grep -v grep USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND I was just now able to panic my hardware system (took just over 8 days) and will send a core the same way I did before. This is obviously from the unpatched kernel. The last frame before the panic is in crunusebatch() so this looks promising. Unread portion of the kernel message buffer: [691684] panic: crunusebatch: ref -4294967294 not >= 0 on cred 0xfffff8011b43cb00 [691684] cpuid = 17 [691684] time = 1742410811 [691684] KDB: stack backtrace: [691684] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d3bfa870 [691684] vpanic() at vpanic+0x131/frame 0xfffffe00d3bfa9a0 [691684] panic() at panic+0x43/frame 0xfffffe00d3bfaa00 [691684] crunusebatch() at crunusebatch+0xfa/frame 0xfffffe00d3bfaa30 [691684] thread_reap_domain() at thread_reap_domain+0x28d/frame 0xfffffe00d3bfaae0 [691684] proc_reap() at proc_reap+0x660/frame 0xfffffe00d3bfab20 [691684] proc_to_reap() at proc_to_reap+0x3c4/frame 0xfffffe00d3bfab70 [691684] kern_wait6() at kern_wait6+0x1a6/frame 0xfffffe00d3bfac10 [691684] sys_wait4() at sys_wait4+0x6b/frame 0xfffffe00d3bfae00 [691684] amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d3bfaf30 [691684] fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d3bfaf30 [691684] --- syscall (7, FreeBSD ELF64, wait4), rip = 0x2893da, rsp = 0x821093d18, rbp = 0x821093d80 --- [691684] KDB: enter: panic Restarted both test systems (hardware and VM) with patched DEBUG kernels and will report back in 10 days or so.
Created attachment 258859 [details] Crash data with 14.2-p1 debug kernel I compiled a debug kernel with the patch (14.2-RELEASE-p1 or p2 - kernel is same) and restarted a heavily loaded machine with it. Within almost an hour, I encountered this crash. I noticed that the crash message changed from 'page fault' to 'panic: tcp_do_segment: sent too much.'"
I forgot to mention that this time the crash happened on its own, without any service restart, not even Telegraf.
On Fri Mar 21 00:40:14 2025 UTC, fax@nohik.ee wrote: > I compiled a debug kernel with the patch (14.2-RELEASE-p1 or p2 - kernel is > same) and restarted a heavily loaded machine with it. Within almost an hour, I > encountered this crash. I noticed that the crash message changed from 'page > fault' to 'panic: tcp_do_segment: sent too much.'" Eero, this is completely different problem, and good news is that it is finally is being worked on. See bug 282605. The patch has been committed to CURRENT yesterday and will get into stable/14 real soon. Meanwhile you can add the patch to your build, grab it here: https://github.com/freebsd/freebsd-src/commit/fbcf3b74e8f2c0c5ba37f1839bfe9395eb2fd0b1.patch Very sorry that two most difficult bugs of 14.2-RELEASE have hit you.
I think I am hitting this bug as well periodically. Of the 3 times I saw it, once back in November on zoo.freebsd.org while updating telegraf, and then on a VM running stable from Jan 31s and then on a FreeBSD 14.2R box a week or so ago. Any ideas how to force the issue ? I had a test box running on an EPYC 7302P, running 12hrs while true do service telegraf restart sleep $(jot -r 1 16 95) done but nothing. Reading through this bug thread, it seems maybe the box needs to be up a while ? prior to the crash this popped up on each of the machines Mar 7 09:38:18 pf-cam2 kernel: freeing uidinfo: uid = 884, sbsize = 115664 Mar 7 09:42:17 pf-cam2 kernel: Fatal trap 12: page fault while in kernel mode Mar 7 09:42:17 pf-cam2 kernel: cpuid = 0; apic id = 00 Mar 7 09:42:17 pf-cam2 kernel: fault virtual address = 0x18 Mar 7 09:42:17 pf-cam2 kernel: fault code = supervisor write data, page not present Mar 7 09:42:17 pf-cam2 kernel: instruction pointer = 0x20:0xffffffff80b36899 Mar 7 09:42:17 pf-cam2 kernel: stack pointer = 0x28:0xfffffe0116613d18 Mar 7 09:42:17 pf-cam2 kernel: frame pointer = 0x28:0xfffffe0116613d30 Mar 7 09:42:17 pf-cam2 kernel: code segment = base rx0, limit 0xfffff, type 0x1b Mar 7 09:42:17 pf-cam2 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Mar 7 09:42:17 pf-cam2 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Mar 7 09:42:17 pf-cam2 kernel: current process = 2 (clock (0)) Mar 7 09:42:17 pf-cam2 kernel: rdi: 0000000000000000 rsi: fffff8003faf3aec rdx: 0000000000000000 Mar 7 09:42:17 pf-cam2 kernel: rcx: 7fffffffffffffff r8: ffffffffffff3cd4 r9: 00000000ffff3cd4 Mar 7 09:42:17 pf-cam2 kernel: rax: 0000000000000000 rbx: fffff8003faf3ad0 rbp: fffffe0116613d30 Mar 7 09:42:17 pf-cam2 kernel: r10: ffffffffffff3cd4 r11: 00000000671a0b2a r12: 000f000000000000 Mar 7 09:42:17 pf-cam2 kernel: r13: fffff806da7b0aa0 r14: fffff8003faf3800 r15: fffff80001c4a740 Mar 7 09:42:17 pf-cam2 kernel: trap number = 12 Mar 7 09:42:17 pf-cam2 kernel: panic: page fault Mar 7 09:42:17 pf-cam2 kernel: cpuid = 0 Mar 7 09:42:17 pf-cam2 kernel: time = 1741358358 Mar 7 09:42:17 pf-cam2 kernel: KDB: stack backtrace: Mar 7 09:42:17 pf-cam2 kernel: #0 0xffffffff80b8b89d at kdb_backtrace+0x5d Mar 7 09:42:17 pf-cam2 kernel: #1 0xffffffff80b3dc01 at vpanic+0x131 Mar 7 09:42:17 pf-cam2 kernel: #2 0xffffffff80b3dac3 at panic+0x43 Mar 7 09:42:17 pf-cam2 kernel: #3 0xffffffff81025a0b at trap_fatal+0x40b Mar 7 09:42:17 pf-cam2 kernel: #4 0xffffffff81025a56 at trap_pfault+0x46 Mar 7 09:42:17 pf-cam2 kernel: #5 0xffffffff80ffc398 at calltrap+0x8 Mar 7 09:42:17 pf-cam2 kernel: #6 0xffffffff80be2b66 at sorele_locked+0xa6 Mar 7 09:42:17 pf-cam2 kernel: #7 0xffffffff80d21ae0 at tcp_close+0x170 Mar 7 09:42:17 pf-cam2 kernel: #8 0xffffffff80d2b9b9 at tcp_timer_2msl+0xf9 Mar 7 09:42:17 pf-cam2 kernel: #9 0xffffffff80d2aa2e at tcp_timer_enter+0xfe Mar 7 09:42:17 pf-cam2 kernel: #10 0xffffffff80b5bc0c at softclock_call_cc+0x12c Mar 7 09:42:17 pf-cam2 kernel: #11 0xffffffff80b5d455 at softclock_thread+0xe5 Mar 7 09:42:17 pf-cam2 kernel: #12 0xffffffff80af760f at fork_exit+0x7f Mar 7 09:42:17 pf-cam2 kernel: #13 0xffffffff80ffd3fe at fork_trampoline+0xe Mar 7 09:42:17 pf-cam2 kernel: Uptime: 94d14h13m41s Mar 7 09:42:17 pf-cam2 kernel: Dumping 4577 out of 65260 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%---<<BOOT>>--- and just now in a VM a similar sort of pattern. uid 884 is telegraf and I was mucking about with the configuration. And on a RELENG_14 box (kernel from Jan 31st) Mar 26 17:09:43 pf-firewall kernel: freeing uidinfo: uid = 884, sbsize = 99600 Mar 26 17:11:26 pf-firewall syslogd: kernel boot file is /boot/kernel/kernel Mar 26 17:11:26 pf-firewall kernel: panic: vm_fault_lookup: fault on nofault entry, addr: 0xffffffff80aea000 Mar 26 17:11:26 pf-firewall kernel: cpuid = 0 Mar 26 17:11:26 pf-firewall kernel: time = 1743023443 Mar 26 17:11:26 pf-firewall kernel: KDB: stack backtrace: Mar 26 17:11:26 pf-firewall kernel: #0 0xffffffff80b8d13d at kdb_backtrace+0x5d Mar 26 17:11:26 pf-firewall kernel: #1 0xffffffff80b3ef01 at vpanic+0x161 Mar 26 17:11:26 pf-firewall kernel: #2 0xffffffff80b3ed93 at panic+0x43 Mar 26 17:11:26 pf-firewall kernel: #3 0xffffffff80ebcf67 at vm_fault+0x15b7 Mar 26 17:11:26 pf-firewall kernel: #4 0xffffffff80ebb8c1 at vm_fault_trap+0x81 Mar 26 17:11:26 pf-firewall kernel: #5 0xffffffff8102c97a at trap_pfault+0x20a Mar 26 17:11:26 pf-firewall kernel: #6 0xffffffff81003748 at calltrap+0x8 Mar 26 17:11:26 pf-firewall kernel: #7 0xffffffff80b2d837 at crfree_final+0x17 Mar 26 17:11:26 pf-firewall kernel: #8 0xffffffff80cfc156 at in_pcbfree+0x2a6 Mar 26 17:11:26 pf-firewall kernel: #9 0xffffffff80be4229 at sorele_locked+0x89 Mar 26 17:11:26 pf-firewall kernel: #10 0xffffffff80d242b0 at tcp_close+0x170 Mar 26 17:11:26 pf-firewall kernel: #11 0xffffffff80d2e23b at tcp_timer_2msl+0xfb Mar 26 17:11:26 pf-firewall kernel: #12 0xffffffff80d2d2b4 at tcp_timer_enter+0x104 Mar 26 17:11:26 pf-firewall kernel: #13 0xffffffff80b5d139 at softclock_call_cc+0x129 Mar 26 17:11:26 pf-firewall kernel: #14 0xffffffff80b5e905 at softclock_thread+0xe5 Mar 26 17:11:26 pf-firewall kernel: #15 0xffffffff80af8471 at fork_exit+0x81 Mar 26 17:11:26 pf-firewall kernel: #16 0xffffffff810047ae at fork_trampoline+0xe Mar 26 17:11:26 pf-firewall kernel: Uptime: 54d4h48m5s Mar 26 17:11:26 pf-firewall kernel: Automatic reboot in 15 seconds - press a key on the console to abort Mar 26 17:11:26 pf-firewall kernel: Rebooting...
Mike, my current hypothesis is that we have a 32-bit overflow in credential reference counting. The overflow happens, when we reap a group of processes, and reference counts of the group summed up together overflow. AFAIU, telegraf will fork+exec arbitrary programs, which in their turn can also fork+exec more programs. While telegraf itself seems to do proper wait(2)-ing on zombies, but some external program may leak zombies, and do not exit itself. Then, when telegraf is restarted, this pack of zombies is reaped and this is where overflow could be hit. This is fixed by attachment 258804 [details]. I am not sure in my hypothesis, that's why it is not even committed to CURRENT. However, everyone affected by the bug are advices to use this patch and let's see what happens. We still have some time before 14.3. I will probably start review process to get it into CURRENT, anyway. With this info, you may have some idea on how to reproduce it. I know, you are good at chasing bugs, Mike :) Sorry that it hits you, but I'm glad that you joined the team chasing this bug.
(In reply to Gleb Smirnoff from comment #47) Thank you for the vote of bug confidence Gleb :) I will have a look at the 3 instances where this bug seems to have happened and look to see if there is anything in common on what I monitor. On my test box, I will try and load it up with a few extra checks, and especially ones that call external programs to see if I can randomly trigger it on my test machine.
I restarted telegraf on my two test machines (one hardware, one VM), both running with Gleb's proposed patch, and neither one panicked! They had both been up for just over 10 days, running telegraf constantly. I've never been able to reproduce this on the VM, but the hardware one has always consistently panicked after being up running telegraf for >8 days. It's not quite definitive, but it certainly seems like the patch might have fixed this one.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cd46e980134f6fc765b28ee9c8bf41e8fc1b0261 commit cd46e980134f6fc765b28ee9c8bf41e8fc1b0261 Author: Gleb Smirnoff <glebius@FreeBSD.org> AuthorDate: 2025-03-29 22:09:15 +0000 Commit: Gleb Smirnoff <glebius@FreeBSD.org> CommitDate: 2025-03-29 22:09:15 +0000 cred: fix struct credbatch to use long for refcount This structure collects count from multiple cred structures. Of course it can't use a smaller type. PR: 283747 Reviewed by: olce, mjg, markj Differential Revision: https://reviews.freebsd.org/D49562 Fixes: 37337709d3334f32650ba3a7c529fa013ed5e1f2 sys/kern/kern_prot.c | 2 +- sys/sys/ucred.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
Created attachment 259233 [details] ktrace.out.tar.gz
Hi! I reproduced the bug in 14.2 with debugging. I wrote similar Go code to reproduce it: https://github.com/ekilimchuk/junk/blob/main/fork-test/README.md After waiting for a week, I reproduced the issue (crunuse’s ref overflow). ekilimchuk@test-fw-debug01:~$ cat core/crash/info.0 Dump header from device: /dev/da1 Architecture: amd64 Architecture Version: 2 Dump Length: 304893952 Blocksize: 512 Compression: none Dumptime: 2025-03-31 10:25:04 +0000 Hostname: test-fw-debug01 Magic: FreeBSD Kernel Dump Version String: FreeBSD 14.2-RELEASE-p2 releng/14.2-n269518-ac2cbb46b5f1 GENERIC-DEBUG Panic String: crunuse: ref -4294967295 not > 0 on cred 0xfffff8002f045300 Dump Parity: 4010253368 Bounds: 0 Dump Status: good Stack trace: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff80b4c960 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523 #3 0xffffffff80b4ce5e in vpanic (fmt=0xffffffff811bd95e "%s: ref %ld not > 0 on cred %p", ap=ap@entry=0xfffffe00037aba60) at /usr/src/sys/kern/kern_shutdown.c:967 #4 0xffffffff80b4cc03 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891 #5 0xffffffff80b39a15 in crunuse (td=0xfffff800457a7740) at /usr/src/sys/kern/kern_prot.c:1927 #6 0xffffffff80b39909 in crcowfree (td=<unavailable>, td@entry=0xfffff800457a7740) at /usr/src/sys/kern/kern_prot.c:1969 #7 0xffffffff80b657bc in thread_cow_free (td=0xfffff800457a7740) at /usr/src/sys/kern/kern_thread.c:880 #8 thread_wait (p=p@entry=0xfffffe00546adae0) at /usr/src/sys/kern/kern_thread.c:1055 #9 0xffffffff80afe430 in proc_reap (td=td@entry=0xfffff800035c0000, p=p@entry=0xfffffe00546adae0, status=status@entry=0xfffffe00037abde4, options=<optimized out>) at /usr/src/sys/kern/kern_exit.c:1033 #10 0xffffffff80afe874 in proc_to_reap (td=td@entry=0xfffff800035c0000, p=p@entry=0xfffffe00546adae0, idtype=idtype@entry=P_ALL, id=id@entry=0, status=status@entry=0xfffffe00037abde4, options=options@entry=48, wrusage=0x0, siginfo=0x0, check_only=0) at /usr/src/sys/kern/kern_exit.c:1197 #11 0xffffffff80afd9d6 in kern_wait6 (td=td@entry=0xfffff800035c0000, idtype=P_ALL, id=0, status=status@entry=0xfffffe00037abde4, options=48, wrusage=0x0, siginfo=0x0) at /usr/src/sys/kern/kern_exit.c:1326 #12 0xffffffff80afd5fb in kern_wait (status=0xfffffe00037abde4, options=<unavailable>, td=<optimized out>, pid=<optimized out>, rusage=<optimized out>) at /usr/src/sys/kern/kern_exit.c:1238 #13 sys_wait4 (td=0xfffff800035c0000, uap=0xfffff800035c0400) at /usr/src/sys/kern/kern_exit.c:864 #14 0xffffffff81069781 in syscallenter (td=0xfffff800035c0000) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #15 amd64_syscall (td=0xfffff800035c0000, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1194 #16 <signal handler called> #17 0x00000000002893da in ?? () Also see a ktrace: https://bugs.freebsd.org/bugzilla/attachment.cgi?id=259233 I have a core dump with invariants and am ready to share the dump. I’ll also try to speed up reproducing the issue. I think it’s related to how Golang specifically handles execv.
Hi! I've written a multitask version with a workpool. You can checkout from the multitask branch and run the test.
(In reply to Eugene Kilimchuk from comment #53) Very cool work! For telegraf, is there a way to tell ahead of time if the box "ready" to panic after a restart ? I have a couple of machines running with INVARIANTS for about 3 days now. I loaded up telegraf to spawn a number of ping processes as well as external scripts. Or better to just wait a week or so ?