Bug 286512 - linuxkpi: amdgpu regressed after base 325aa4dbd10d
Summary: linuxkpi: amdgpu regressed after base 325aa4dbd10d
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2025-05-01 20:19 UTC by Evgenii Khramtsov
Modified: 2025-05-12 14:17 UTC (History)
8 users (show)

See Also:


Attachments
kernel+325aa4dbd log when starting sway (3.91 KB, application/octet-stream)
2025-05-02 11:32 UTC, Evgenii Khramtsov
no flags Details
kernel+325aa4dbd (but drm-kmod built against 325aa4dbd~) log when starting sway (7.38 KB, application/octet-stream)
2025-05-02 11:34 UTC, Evgenii Khramtsov
no flags Details
kka-325aa4dbd+comment9.txt.zst (4.55 KB, application/octet-stream)
2025-05-02 17:09 UTC, Evgenii Khramtsov
no flags Details
proposed patch (1.33 KB, patch)
2025-05-03 21:22 UTC, Mark Johnston
no flags Details | Diff
Patched boot log (48.33 KB, text/plain)
2025-05-03 22:48 UTC, crahman
no flags Details
Reverted boot log (47.45 KB, text/plain)
2025-05-03 22:49 UTC, crahman
no flags Details
proposed patch (4.14 KB, patch)
2025-05-04 16:34 UTC, Mark Johnston
no flags Details | Diff
comment25-drm-log.txt.zst (2.42 KB, application/zstd)
2025-05-04 18:41 UTC, Evgenii Khramtsov
no flags Details
dtrace output drm-61-kmod (1.92 KB, text/plain)
2025-05-05 21:26 UTC, Benjamin Jacobs
no flags Details
Mark's patch, as a port patch for graphics/drm-6[1,6]-kmod (407 bytes, patch)
2025-05-10 01:51 UTC, crahman
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Evgenii Khramtsov 2025-05-01 20:19:23 UTC
Now that after base 325aa4dbd10d bug 286495 caused __FreeBSD_version bump to 1500039, poudriere users get automatic drm-kmod rebuild, and the following stderr on amdgpu when running x11-wm/sway, after pkg upgrade:

$ dbus-run-session seatd-launch sway
[wlr] [backend/drm/atomic.c:81] connector DP-1: Atomic commit failed: Operation not permitted

Rebuilding {libdrm,mesa,wlroots,sway} after base 325aa4dbd10d doesn't help.

Reverting base 325aa4dbd10d helps.

// Before I get questions I can say that local https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=286311#c14 is unrelated to this.
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2025-05-01 20:35:15 UTC
I didn't read bug 286495 closely enough, sorry.

So, to be clear, if the kernel+linuxkpi.ko contain base 325aa4dbd10d, but amdgpu.ko is compiled against a tree without that commit, then you see no problems?
Comment 2 Evgenii Khramtsov 2025-05-01 20:38:53 UTC
(In reply to Mark Johnston from comment #1)

> if the kernel+linuxkpi.ko contain base 325aa4dbd10d, but amdgpu.ko is compiled against a tree without that commit, then you see no problems?

Exactly, yes.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2025-05-01 20:49:37 UTC
To narrow things down a bit:

Do you get any messages from the kernel?

What happens after the error?  Is there a hang or anything like that?  If not, do you get consistent results after starting sway multiple times?
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2025-05-01 21:00:37 UTC
One more thing that would be useful: please set dev.drm.__drm_debug=0xffffffff before starting sway (assuming it is not already set), reproduce the error, and show the resulting kernel messages here.
Comment 5 Evgenii Khramtsov 2025-05-01 21:06:45 UTC
(In reply to Mark Johnston from comment #3)

> Do you get any messages from the kernel?

No.

> What happens after the error?
> Is there a hang or anything like that?

Yes, sway hangs and I can't switch vt, but ssh session works normally
and I can recover control over vt after killing sway via ssh.

> please set dev.drm.__drm_debug=0xffffffff before starting sway
> (assuming it is not already set), reproduce the error, and show
> the resulting kernel messages here.

dev.drm.__drm_debug is 0 by default otherwise /var/log/ would eat all free space.

Thanks. I'll report back with the logs tomorrow, it is late where I am to do that now.
Comment 6 Evgenii Khramtsov 2025-05-02 11:32:27 UTC
Created attachment 260076 [details]
kernel+325aa4dbd log when starting sway

This is log for kernel+325aa4dbd and drm-kmod built against 325aa4dbd.

I'll attach another log for kernel+325aa4dbd and drm-kmod built against 325aa4dbd~ (4fa275a5f357) in a moment.
Comment 7 Evgenii Khramtsov 2025-05-02 11:34:29 UTC
Created attachment 260077 [details]
kernel+325aa4dbd (but drm-kmod built against 325aa4dbd~) log when starting sway

FYI, I tested both cases with local drm-kmod patch reverted to avoid unrelated noise and started sway via:

$ WLR_RENDER_NO_EXPLICIT_SYNC=1 dbus-run-session seatd-launch sway
Comment 8 Evgenii Khramtsov 2025-05-02 11:36:53 UTC
Running kernel stayed 325aa4dbd but drm-kmod built against 325aa4dbd~ works.

Maybe the cause is in some header file changed by base 325aa4dbd?
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2025-05-02 16:12:40 UTC
(In reply to Evgenii Khramtsov from comment #8)
Yes, there is some subtle incompatibility here.

If you compile the kernel+drm-kmod with the patch below, is the problem still there?

diff --git a/sys/compat/linuxkpi/common/include/linux/workqueue.h b/sys/compat/linuxkpi/common/include/linux/workqueue.h
index 25ee861d3015..66d3981d4229 100644
--- a/sys/compat/linuxkpi/common/include/linux/workqueue.h
+++ b/sys/compat/linuxkpi/common/include/linux/workqueue.h
@@ -90,7 +90,7 @@ struct delayed_work {
        struct {
                struct callout callout;
                struct mtx mtx;
-               long    expires;
+               unsigned long expires;
        } timer;
 };
Comment 10 Evgenii Khramtsov 2025-05-02 16:42:38 UTC
(In reply to Mark Johnston from comment #9)

> is the problem still there?

Unfortunately, this doesn't help, no change from 325aa4dbd10d.
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2025-05-02 16:47:02 UTC
(In reply to Evgenii Khramtsov from comment #10)
Thanks, I will keep digging.  One more thing that is useful: after the hang occurs, could you ssh in and get collect output from "procstat -kka" as root?
Comment 12 Evgenii Khramtsov 2025-05-02 17:09:22 UTC
Created attachment 260089 [details]
kka-325aa4dbd+comment9.txt.zst

procstat -kka when sway is hanging attached.

GENERIC-NODEBUG kernel + local netinet6,nat64lsn,vmm patches on top of base 4e3a6fe0134e.

If local changes could bring unrelated noise (no idea what KSTACK offsets are relative to), then tell and I'll try again with GENERIC base 4e3a6fe0134e kernel.
Comment 13 crahman 2025-05-03 06:29:21 UTC
Using drm-66-kmod's amdgpu, I'm unable to run X after 325aa4dbd10d.

The X server simply exits with

[    60.098] (EE) modeset(0): failed to set mode: No such file or directory
[   115.161] (EE) 
Fatal server error:
[   115.161] (EE) Caught signal 6 (Abort trap). Server aborting

The kernel produces some noise:

May  2 23:25:34 hemlock kernel: [drm ERROR :amdgpu_job_timedout] ring gfx_0.0.0 timeout, signaled seq=4, emitted seq=6
May  2 23:25:34 hemlock kernel: [drm ERROR :amdgpu_job_timedout] Process information: process  pid 101244 thread  pid 101244
May  2 23:25:34 hemlock kernel: drmn0: GPU reset begin!
May  2 23:25:35 hemlock kernel: drmn0: BACO reset
May  2 23:25:38 hemlock kernel: drmn0: GPU reset succeeded, trying to resume
May  2 23:25:38 hemlock kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
May  2 23:25:38 hemlock kernel: [drm] VRAM is lost due to GPU reset!

And quite a bit more, which I can attach upon request.

amdgpu has been built from post-325aa4dbd10d sources.  This is on a GENERIC kernel.
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2025-05-03 21:22:32 UTC
Created attachment 260127 [details]
proposed patch

Could anyone affected by this please try the attached patch?  Be sure to rebuild+reboot the kernel as well as the DRM modules.
Comment 15 crahman 2025-05-03 22:07:22 UTC
I tried your patch - it did not fix the problem.

Just to make sure, I reverted

325aa4dbd10d04a61a9529e1d76212b5649b3c73,
901256f6ea3cda2e0951b80b9be466e1b596f7fa. and
87e57632bf88b270f3f9a09f579372d3437aeb17 and built a kernel & driver.

It works correctly after reversion.
Comment 16 Mark Johnston freebsd_committer freebsd_triage 2025-05-03 22:26:06 UTC
(In reply to crahman from comment #15)
Thanks.  If you're able, full output from the kernel as described in comment 4 would be appreciated.  With and without the reversion would be even better.
Comment 17 crahman 2025-05-03 22:48:37 UTC
Created attachment 260130 [details]
Patched boot log
Comment 18 crahman 2025-05-03 22:49:15 UTC
Created attachment 260131 [details]
Reverted boot log
Comment 19 crahman 2025-05-03 22:52:59 UTC
The messages in the reverted log about 'Failed to retrieve enabled ppfeatures!' and 'attempt to set divider for DECFCLK Failed' started when I switched to drm-66-kmod, and likely have nothing to do with the current problem.
Comment 20 Mark Johnston freebsd_committer freebsd_triage 2025-05-03 23:00:44 UTC
One other thing to try, please: with the existing patch applied and nothing reverted, try booting a kernel with the patch below applied.  Is the problem still there:

diff --git a/sys/kern/subr_param.c b/sys/kern/subr_param.c
index f4359efec466..f22c5e59eedf 100644
--- a/sys/kern/subr_param.c
+++ b/sys/kern/subr_param.c
@@ -193,11 +193,13 @@ init_param1(void)
        tick_bt = sbttobt(tick_sbt);
        tick_seconds_max = INT_MAX / hz;
 
+#if 0
        /*
         * Arrange for ticks to wrap 10 minutes after boot to help catch
         * sign problems sooner.
         */
        ticksl = INT_MAX - (hz * 10 * 60);
+#endif
 
        vn_lock_pair_pause_max = hz / 100;
        if (vn_lock_pair_pause_max == 0)
Comment 21 crahman 2025-05-04 00:58:05 UTC
Ok, I've added your second patch to the first, on unreverted sources, and it does not correct the problem.
Comment 22 crahman 2025-05-04 04:26:21 UTC
To eliminate drm-66-kmod as a problem I've also tested this with drm-61-kmod.  The results are unchanged.
Comment 23 crahman 2025-05-04 11:24:24 UTC
I haven't studied this enough to know if troublesome values can occur, but in jiffies.h, there are some problematic uses of unsigned and signed variables, e.g. in linux_timer_jiffies() or time_after().

For example, if you want to know if a > b, (long)(b - a) < 0 will not always provide the correct answer when a and b are unsigned.
Comment 24 crahman 2025-05-04 11:48:25 UTC
(In reply to crahman from comment #23)
Ok, I've looked into this and the intervals shouldn't be so big as to cause problems, and the techniques used are chosen to avoid problems with wraparounds.
Comment 25 Mark Johnston freebsd_committer freebsd_triage 2025-05-04 16:34:15 UTC
Created attachment 260151 [details]
proposed patch

Another patch to try, based on code inspection.  I'm not too hopeful that this will be it, but there's some chance.
Comment 26 Evgenii Khramtsov 2025-05-04 18:41:11 UTC
Created attachment 260155 [details]
comment25-drm-log.txt.zst

(In reply to Mark Johnston from comment #25)

> Another patch to try, based on code inspection.

Not working yet, but I've seen some negative values that I haven't seen before.

Thank you for your effort.
Comment 27 Guido Falsi freebsd_committer freebsd_triage 2025-05-04 21:20:40 UTC
Hi,

I'd like to signal that, after upgrading to recent head, I'm also experiencing something like this.

When starting lightdm I get repeatedly the following error in the console:

[drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence!
[drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed.


and xorg dumps core.


Reverting base 325aa4dbd10d and recompiling kernel end drm makes it work again.


I'm sending this "me too" hoping it can help.


Thanks all for your efforts!
Comment 28 Mark Johnston freebsd_committer freebsd_triage 2025-05-05 15:34:46 UTC
(In reply to Guido Falsi from comment #27)
Which version of the drm modules are you using?

> [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence!

So dma_fence_wait_timeout() returned -1.  For amdgpu, this means that dma_fence_default_wait(MAX_SCHEDULE_TIMEOUT == LONG_MAX) was used.  A return value of -1 must have come from schedule_timeout(), but linux_schedule_timeout(MAX_SCHEDULE_TIMEOUT) should always return MAX_SCHEDULE_TIMEOUT.  So this error should be impossible, unless I'm looking at the wrong version of the drm-kmod sources...

The affected code lives in dmabuf.ko.  Are you sure this is getting recompiled together with amdgpu.ko?
Comment 29 Mark Johnston freebsd_committer freebsd_triage 2025-05-05 16:04:24 UTC
(In reply to Mark Johnston from comment #28)
> Which version of the drm modules are you using?

Also which src revision did you compile your kernel from?
Comment 30 Charlie Li freebsd_committer freebsd_triage 2025-05-05 18:38:01 UTC
(In reply to Mark Johnston from comment #28)
Not madpilot@ but my own amdgpu machine also gets bit by dma_fence_wait_timeout(). While cinnamon and X both crash, only cinnamon is reflected in dmesg:

May  3 09:37:33 plymouth kernel: FreeBSD 15.0-CURRENT #34 main-n276940-8d136fb027ba: Sat May  3 07:06:20 EDT 2025
[snip]
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence!
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed.
May  3 09:39:10 plymouth kernel: pid 79001 (cinnamon), jid 0, uid 1001: exited on signal 6 (core dumped)
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence!
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed.
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_ctx_wait_prev_fence] Error (-1) waiting for fence!
May  3 09:39:10 plymouth kernel: [drm ERROR :amdgpu_cs_sync_rings] amdgpu_ctx_wait_prev_fence failed.

drm-66-kmod was built on this src revision as part of my own mass rebuild after your __FreeBSD_version bump.

The cinnamon backtrace leads to mesa, for which I will need to rebuild without stripping so we get some symbols, if you want to see the userspace side of things.
Comment 31 Mark Johnston freebsd_committer freebsd_triage 2025-05-05 18:49:06 UTC
(In reply to Charlie Li from comment #30)
Can you please run the following in a terminal while those amdgpu errors are occurring, and post the output file here?  Basically, run this command, start cinnamon, and once the crash has 
happened, ctrl-C the dtrace command.

# dtrace -n 'fbt::linux_schedule_timeout:entry {printf("%x", args[0]); stack();} fbt::linux_schedule_timeout:return {printf("%x", args[1]);}' -o output.txt
Comment 32 Guido Falsi freebsd_committer freebsd_triage 2025-05-05 21:08:35 UTC
(In reply to Mark Johnston from comment #29)

I updated sources and recompiled src (creating a pkgbase set with poudriere), then, from the same poudriere jail I forced rebuilding the drm-66-kmod package.

The files amdgpu.ko and dmabuf.ko do have the same timestamps.


The version I'm using is base 83507f9e6fedbc02d1acecc9fb5c09eae34b1ae6 from Sun May 4 07:49:28 2025


My sources also include an IPv6 patch I'm working on, but that should have no relation with drm code.
Comment 33 Benjamin Jacobs 2025-05-05 21:26:49 UTC
Created attachment 260182 [details]
dtrace output drm-61-kmod

(In reply to Mark Johnston from comment #31)

Same waiting for fence! here on drm 61, dtrace ouput attached.
Comment 34 Mark Johnston freebsd_committer freebsd_triage 2025-05-05 22:22:39 UTC
(In reply to Mark Johnston from comment #28)
I missed the bug the first time I read through that code.

Can anyone test this patch? https://github.com/freebsd/drm-kmod/pull/351
Comment 35 Charlie Li freebsd_committer freebsd_triage 2025-05-06 04:26:57 UTC
(In reply to Mark Johnston from comment #34)
This patch allows the desktop to continue loading as intended. Before comment 33 beat me to it, my output would probably be the same.
Comment 36 Evgenii Khramtsov 2025-05-06 09:03:21 UTC
(In reply to Mark Johnston from comment #34)

Base without reverting base 325aa4dbd10d and without suggested here base patches, and patched drm-66-kmod works for me as well. Thank you.
Comment 37 Mark Johnston freebsd_committer freebsd_triage 2025-05-06 13:47:55 UTC
Thanks to everyone who tested and provided debug output.  I'll get this landed asap.
Comment 38 Mark Johnston freebsd_committer freebsd_triage 2025-05-08 15:48:18 UTC
Emmanuel, could you update the drm-*-kmod ports now that https://github.com/freebsd/drm-kmod/pull/351 landed?
Comment 39 crahman 2025-05-10 01:51:08 UTC
Created attachment 260304 [details]
Mark's patch, as a port patch for graphics/drm-6[1,6]-kmod

Here's Mark's patch, in a form suitable for insertion in the appropriate port's files directory until the patch makes it into the tree.

Thanks for fixing this.  I have also tested it and the problem has been solved.
Comment 40 Mark Johnston freebsd_committer freebsd_triage 2025-05-12 14:17:05 UTC
This should be fixed in the ports tree now after commits 8495963ac01c02db3f0d11253732c0c925d52fb6 and 3adb702829143411225480d4aa30bbc51bad4803.  Thanks to everyone who helped and provided debug output.