Now that after base 325aa4dbd10d bug 286495 caused __FreeBSD_version bump to 1500039, poudriere users get automatic drm-kmod rebuild, and the following stderr on amdgpu when running x11-wm/sway, after pkg upgrade: $ dbus-run-session seatd-launch sway [wlr] [backend/drm/atomic.c:81] connector DP-1: Atomic commit failed: Operation not permitted Rebuilding {libdrm,mesa,wlroots,sway} after base 325aa4dbd10d doesn't help. Reverting base 325aa4dbd10d helps. // Before I get questions I can say that local https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=286311#c14 is unrelated to this.
I didn't read bug 286495 closely enough, sorry. So, to be clear, if the kernel+linuxkpi.ko contain base 325aa4dbd10d, but amdgpu.ko is compiled against a tree without that commit, then you see no problems?
(In reply to Mark Johnston from comment #1) > if the kernel+linuxkpi.ko contain base 325aa4dbd10d, but amdgpu.ko is compiled against a tree without that commit, then you see no problems? Exactly, yes.
To narrow things down a bit: Do you get any messages from the kernel? What happens after the error? Is there a hang or anything like that? If not, do you get consistent results after starting sway multiple times?
One more thing that would be useful: please set dev.drm.__drm_debug=0xffffffff before starting sway (assuming it is not already set), reproduce the error, and show the resulting kernel messages here.
(In reply to Mark Johnston from comment #3) > Do you get any messages from the kernel? No. > What happens after the error? > Is there a hang or anything like that? Yes, sway hangs and I can't switch vt, but ssh session works normally and I can recover control over vt after killing sway via ssh. > please set dev.drm.__drm_debug=0xffffffff before starting sway > (assuming it is not already set), reproduce the error, and show > the resulting kernel messages here. dev.drm.__drm_debug is 0 by default otherwise /var/log/ would eat all free space. Thanks. I'll report back with the logs tomorrow, it is late where I am to do that now.
Created attachment 260076 [details] kernel+325aa4dbd log when starting sway This is log for kernel+325aa4dbd and drm-kmod built against 325aa4dbd. I'll attach another log for kernel+325aa4dbd and drm-kmod built against 325aa4dbd~ (4fa275a5f357) in a moment.
Created attachment 260077 [details] kernel+325aa4dbd (but drm-kmod built against 325aa4dbd~) log when starting sway FYI, I tested both cases with local drm-kmod patch reverted to avoid unrelated noise and started sway via: $ WLR_RENDER_NO_EXPLICIT_SYNC=1 dbus-run-session seatd-launch sway
Running kernel stayed 325aa4dbd but drm-kmod built against 325aa4dbd~ works. Maybe the cause is in some header file changed by base 325aa4dbd?
(In reply to Evgenii Khramtsov from comment #8) Yes, there is some subtle incompatibility here. If you compile the kernel+drm-kmod with the patch below, is the problem still there? diff --git a/sys/compat/linuxkpi/common/include/linux/workqueue.h b/sys/compat/linuxkpi/common/include/linux/workqueue.h index 25ee861d3015..66d3981d4229 100644 --- a/sys/compat/linuxkpi/common/include/linux/workqueue.h +++ b/sys/compat/linuxkpi/common/include/linux/workqueue.h @@ -90,7 +90,7 @@ struct delayed_work { struct { struct callout callout; struct mtx mtx; - long expires; + unsigned long expires; } timer; };
(In reply to Mark Johnston from comment #9) > is the problem still there? Unfortunately, this doesn't help, no change from 325aa4dbd10d.
(In reply to Evgenii Khramtsov from comment #10) Thanks, I will keep digging. One more thing that is useful: after the hang occurs, could you ssh in and get collect output from "procstat -kka" as root?
Created attachment 260089 [details] kka-325aa4dbd+comment9.txt.zst procstat -kka when sway is hanging attached. GENERIC-NODEBUG kernel + local netinet6,nat64lsn,vmm patches on top of base 4e3a6fe0134e. If local changes could bring unrelated noise (no idea what KSTACK offsets are relative to), then tell and I'll try again with GENERIC base 4e3a6fe0134e kernel.
Using drm-66-kmod's amdgpu, I'm unable to run X after 325aa4dbd10d. The X server simply exits with [ 60.098] (EE) modeset(0): failed to set mode: No such file or directory [ 115.161] (EE) Fatal server error: [ 115.161] (EE) Caught signal 6 (Abort trap). Server aborting The kernel produces some noise: May 2 23:25:34 hemlock kernel: [drm ERROR :amdgpu_job_timedout] ring gfx_0.0.0 timeout, signaled seq=4, emitted seq=6 May 2 23:25:34 hemlock kernel: [drm ERROR :amdgpu_job_timedout] Process information: process pid 101244 thread pid 101244 May 2 23:25:34 hemlock kernel: drmn0: GPU reset begin! May 2 23:25:35 hemlock kernel: drmn0: BACO reset May 2 23:25:38 hemlock kernel: drmn0: GPU reset succeeded, trying to resume May 2 23:25:38 hemlock kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000). May 2 23:25:38 hemlock kernel: [drm] VRAM is lost due to GPU reset! And quite a bit more, which I can attach upon request. amdgpu has been built from post-325aa4dbd10d sources. This is on a GENERIC kernel.
Created attachment 260127 [details] proposed patch Could anyone affected by this please try the attached patch? Be sure to rebuild+reboot the kernel as well as the DRM modules.
I tried your patch - it did not fix the problem. Just to make sure, I reverted 325aa4dbd10d04a61a9529e1d76212b5649b3c73, 901256f6ea3cda2e0951b80b9be466e1b596f7fa. and 87e57632bf88b270f3f9a09f579372d3437aeb17 and built a kernel & driver. It works correctly after reversion.
(In reply to crahman from comment #15) Thanks. If you're able, full output from the kernel as described in comment 4 would be appreciated. With and without the reversion would be even better.
Created attachment 260130 [details] Patched boot log
Created attachment 260131 [details] Reverted boot log
The messages in the reverted log about 'Failed to retrieve enabled ppfeatures!' and 'attempt to set divider for DECFCLK Failed' started when I switched to drm-66-kmod, and likely have nothing to do with the current problem.
One other thing to try, please: with the existing patch applied and nothing reverted, try booting a kernel with the patch below applied. Is the problem still there: diff --git a/sys/kern/subr_param.c b/sys/kern/subr_param.c index f4359efec466..f22c5e59eedf 100644 --- a/sys/kern/subr_param.c +++ b/sys/kern/subr_param.c @@ -193,11 +193,13 @@ init_param1(void) tick_bt = sbttobt(tick_sbt); tick_seconds_max = INT_MAX / hz; +#if 0 /* * Arrange for ticks to wrap 10 minutes after boot to help catch * sign problems sooner. */ ticksl = INT_MAX - (hz * 10 * 60); +#endif vn_lock_pair_pause_max = hz / 100; if (vn_lock_pair_pause_max == 0)
Ok, I've added your second patch to the first, on unreverted sources, and it does not correct the problem.
To eliminate drm-66-kmod as a problem I've also tested this with drm-61-kmod. The results are unchanged.
I haven't studied this enough to know if troublesome values can occur, but in jiffies.h, there are some problematic uses of unsigned and signed variables, e.g. in linux_timer_jiffies() or time_after(). For example, if you want to know if a > b, (long)(b - a) < 0 will not always provide the correct answer when a and b are unsigned.
(In reply to crahman from comment #23) Ok, I've looked into this and the intervals shouldn't be so big as to cause problems, and the techniques used are chosen to avoid problems with wraparounds.
Created attachment 260151 [details] proposed patch Another patch to try, based on code inspection. I'm not too hopeful that this will be it, but there's some chance.
Created attachment 260155 [details] comment25-drm-log.txt.zst (In reply to Mark Johnston from comment #25) > Another patch to try, based on code inspection. Not working yet, but I've seen some negative values that I haven't seen before. Thank you for your effort.
Hi, I'd like to signal that, after upgrading to recent head, I'm also experiencing something like this. When starting lightdm I get repeatedly the following error in the console: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence! [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed. and xorg dumps core. Reverting base 325aa4dbd10d and recompiling kernel end drm makes it work again. I'm sending this "me too" hoping it can help. Thanks all for your efforts!
(In reply to Guido Falsi from comment #27) Which version of the drm modules are you using? > [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence! So dma_fence_wait_timeout() returned -1. For amdgpu, this means that dma_fence_default_wait(MAX_SCHEDULE_TIMEOUT == LONG_MAX) was used. A return value of -1 must have come from schedule_timeout(), but linux_schedule_timeout(MAX_SCHEDULE_TIMEOUT) should always return MAX_SCHEDULE_TIMEOUT. So this error should be impossible, unless I'm looking at the wrong version of the drm-kmod sources... The affected code lives in dmabuf.ko. Are you sure this is getting recompiled together with amdgpu.ko?
(In reply to Mark Johnston from comment #28) > Which version of the drm modules are you using? Also which src revision did you compile your kernel from?
(In reply to Mark Johnston from comment #28) Not madpilot@ but my own amdgpu machine also gets bit by dma_fence_wait_timeout(). While cinnamon and X both crash, only cinnamon is reflected in dmesg: May 3 09:37:33 plymouth kernel: FreeBSD 15.0-CURRENT #34 main-n276940-8d136fb027ba: Sat May 3 07:06:20 EDT 2025 [snip] May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence! May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed. May 3 09:39:10 plymouth kernel: pid 79001 (cinnamon), jid 0, uid 1001: exited on signal 6 (core dumped) May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence! May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed. May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence! May 3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed. drm-66-kmod was built on this src revision as part of my own mass rebuild after your __FreeBSD_version bump. The cinnamon backtrace leads to mesa, for which I will need to rebuild without stripping so we get some symbols, if you want to see the userspace side of things.
(In reply to Charlie Li from comment #30) Can you please run the following in a terminal while those amdgpu errors are occurring, and post the output file here? Basically, run this command, start cinnamon, and once the crash has happened, ctrl-C the dtrace command. # dtrace -n 'fbt::linux_schedule_timeout:entry {printf("%x", args[0]); stack();} fbt::linux_schedule_timeout:return {printf("%x", args[1]);}' -o output.txt
(In reply to Mark Johnston from comment #29) I updated sources and recompiled src (creating a pkgbase set with poudriere), then, from the same poudriere jail I forced rebuilding the drm-66-kmod package. The files amdgpu.ko and dmabuf.ko do have the same timestamps. The version I'm using is base 83507f9e6fedbc02d1acecc9fb5c09eae34b1ae6 from Sun May 4 07:49:28 2025 My sources also include an IPv6 patch I'm working on, but that should have no relation with drm code.
Created attachment 260182 [details] dtrace output drm-61-kmod (In reply to Mark Johnston from comment #31) Same waiting for fence! here on drm 61, dtrace ouput attached.
(In reply to Mark Johnston from comment #28) I missed the bug the first time I read through that code. Can anyone test this patch? https://github.com/freebsd/drm-kmod/pull/351
(In reply to Mark Johnston from comment #34) This patch allows the desktop to continue loading as intended. Before comment 33 beat me to it, my output would probably be the same.
(In reply to Mark Johnston from comment #34) Base without reverting base 325aa4dbd10d and without suggested here base patches, and patched drm-66-kmod works for me as well. Thank you.
Thanks to everyone who tested and provided debug output. I'll get this landed asap.
Emmanuel, could you update the drm-*-kmod ports now that https://github.com/freebsd/drm-kmod/pull/351 landed?
Created attachment 260304 [details] Mark's patch, as a port patch for graphics/drm-6[1,6]-kmod Here's Mark's patch, in a form suitable for insertion in the appropriate port's files directory until the patch makes it into the tree. Thanks for fixing this. I have also tested it and the problem has been solved.
This should be fixed in the ports tree now after commits 8495963ac01c02db3f0d11253732c0c925d52fb6 and 3adb702829143411225480d4aa30bbc51bad4803. Thanks to everyone who helped and provided debug output.