Bug 237544

Summary: graphics/drm-fbsd12.0-kmod: panic on 12-STABLE with Radeon HD 7450 (but not with drm-fbsd11.2-kmod)
Product: Base System Reporter: sigsys
Component: kernAssignee: freebsd-x11 (Nobody) <x11>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: bsd, danfe, fullermd, jmd, manu, noisetube, thierry, zeising
Priority: --- Keywords: crash, needs-qa
Version: 12.0-STABLE   
Hardware: amd64   
OS: Any   
See Also: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253461

Description sigsys 2019-04-25 03:09:23 UTC
I have a computer with a Radeon HD 7450 that panics when using the drm-fbsd12.0-kmod driver but works OK with the drm-fbsd11.2-kmod one.  It usually takes a few days to panic (after which I revert it to the older kmod).  Can't tell if it's doing something in particular that triggers it (but that's with browsers and media players always opened).  It did this under 12.0-RELEASE as well.  I tried the newer kmod after upgrading to 12-STABLE to see if that fixed it, but it still panics.

That's with radeonkms. amdgpu doesn't work with it (loading it does not recognize any video card it seems).  I did not have the xf86-video-ati or xf86-video-amdgpu Xorg drivers installed.  I take it they are not needed anymore?

Also oddly enough the screen has a bright gray background after the module loads.  The text is still readable but very difficult to read. It fixes itself after starting Xorg (then switching back to a console VT has a black background).

[66314]
[66314]
[66314] Fatal trap 12: page fault while in kernel mode
[66314] cpuid = 3; apic id = 03
[66314] fault virtual address   = 0x0
[66314] fault code              = supervisor write data  , page not present
[66314] instruction pointer     = 0x20:0xffffffff8334d93c
[66314] stack pointer           = 0x28:0xfffffe004132b610
[66314] frame pointer           = 0x28:0xfffffe004132b640
[66314] code segment            = base rx0, limit 0xfffff, type 0x1b
[66314]                         = DPL 0, pres 1, long 1, def32 0, gran 1
[66314] processor eflags        = interrupt enabled, resume, IOPL = 3
[66314] current process         = 23751 (X:rcs0)
[66314] trap number             = 12
[66314] panic: page fault
[66314] cpuid = 3
[66314] time = 1555971829
[66314] KDB: stack backtrace:
[66314] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004132b2c0
[66314] vpanic() at vpanic+0x19d/frame 0xfffffe004132b310
[66314] panic() at panic+0x43/frame 0xfffffe004132b370
[66314] trap_fatal() at trap_fatal+0x394/frame 0xfffffe004132b3d0
[66314] trap_pfault() at trap_pfault+0x49/frame 0xfffffe004132b430
[66314] trap() at trap+0x29f/frame 0xfffffe004132b540
[66314] calltrap() at calltrap+0x8/frame 0xfffffe004132b540
[66314] --- trap 0xc, rip = 0xffffffff8334d93c, rsp = 0xfffffe004132b610, rbp = 0xfffffe004132b640 ---
[66314] ____rb_erase_color() at ____rb_erase_color+0x8c/frame 0xfffffe004132b640
[66314] drm_mm_remove_node() at drm_mm_remove_node+0x2bc/frame 0xfffffe004132b680
[66314] ttm_bo_man_put_node() at ttm_bo_man_put_node+0x3c/frame 0xfffffe004132b6a0
[66314] ttm_bo_cleanup_refs_or_queue() at ttm_bo_cleanup_refs_or_queue+0x202/frame 0xfffffe004132b6f0
[66314] ttm_bo_unref() at ttm_bo_unref+0x7e/frame 0xfffffe004132b720
[66314] radeon_bo_unref() at radeon_bo_unref+0x22/frame 0xfffffe004132b740
[66314] radeon_gem_object_free() at radeon_gem_object_free+0x1e/frame 0xfffffe004132b760
[66314] drm_gem_object_release_handle() at drm_gem_object_release_handle+0xd3/frame 0xfffffe004132b790
[66314] drm_gem_handle_delete() at drm_gem_handle_delete+0x8c/frame 0xfffffe004132b7d0
[66314] drm_ioctl_kernel() at drm_ioctl_kernel+0xf5/frame 0xfffffe004132b820
[66314] drm_ioctl() at drm_ioctl+0x27f/frame 0xfffffe004132b910
[66314] linux_file_ioctl() at linux_file_ioctl+0x298/frame 0xfffffe004132b970
[66314] kern_ioctl() at kern_ioctl+0x274/frame 0xfffffe004132b9e0
[66314] sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe004132bab0
[66314] amd64_syscall() at amd64_syscall+0x364/frame 0xfffffe004132bbf0
[66314] fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe004132bbf0
[66314] --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x800cb560a, rsp = 0x7fffdfffddf8, rbp = 0x7fffdfffde20 ---
[66314] Uptime: 18h25m14s

I'll try again next time the driver gets updated.
Comment 1 sigsys 2019-04-28 09:12:26 UTC
I tried again with drm-fbsd12.0-kmod-4.16.g20190424, it still panics. Took 3 days this time. Now it's mentioning chrome.

[265338]
[265338]
[265338] Fatal trap 12: page fault while in kernel mode
[265338] cpuid = 0; apic id = 00
[265338] fault virtual address  = 0x4000023f8
[265338] fault code             = supervisor read data  , page not present
[265338] instruction pointer    = 0x20:0xffffffff831fd5b4
[265338] stack pointer          = 0x28:0xfffffe007d222180
[265338] frame pointer          = 0x28:0xfffffe007d2221b0
[265338] code segment           = base rx0, limit 0xfffff, type 0x1b
[265338]                        = DPL 0, pres 1, long 1, def32 0, gran 1
[265338] processor eflags       = interrupt enabled, resume, IOPL = 0
[265338] current process                = 69174 (chrome:rcs0)
[265338] trap number            = 12
[265338] panic: page fault
[265338] cpuid = 0
[265338] time = 1556434411
[265338] KDB: stack backtrace:
[265338] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe007d221e30
[265338] vpanic() at vpanic+0x19d/frame 0xfffffe007d221e80
[265338] panic() at panic+0x43/frame 0xfffffe007d221ee0
[265338] trap_fatal() at trap_fatal+0x394/frame 0xfffffe007d221f40
[265338] trap_pfault() at trap_pfault+0x49/frame 0xfffffe007d221fa0
[265338] trap() at trap+0x29f/frame 0xfffffe007d2220b0
[265338] calltrap() at calltrap+0x8/frame 0xfffffe007d2220b0
[265338] --- trap 0xc, rip = 0xffffffff831fd5b4, rsp = 0xfffffe007d222180, rbp = 0xfffffe007d2221b0 ---
[265338] radeon_fence_signaled() at radeon_fence_signaled+0x34/frame 0xfffffe007d2221b0
[265338] radeon_sa_bo_new() at radeon_sa_bo_new+0x251/frame 0xfffffe007d222310
[265338] radeon_ib_get() at radeon_ib_get+0x2f/frame 0xfffffe007d222350
[265338] radeon_cs_ioctl() at radeon_cs_ioctl+0x25d/frame 0xfffffe007d2227d0
[265338] drm_ioctl_kernel() at drm_ioctl_kernel+0xf5/frame 0xfffffe007d222820
[265338] drm_ioctl() at drm_ioctl+0x27f/frame 0xfffffe007d222910
[265338] linux_file_ioctl() at linux_file_ioctl+0x298/frame 0xfffffe007d222970
[265338] kern_ioctl() at kern_ioctl+0x274/frame 0xfffffe007d2229e0
[265338] sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe007d222ab0
[265338] amd64_syscall() at amd64_syscall+0x364/frame 0xfffffe007d222bf0
[265338] fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe007d222bf0
[265338] --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80db4e60a, rsp = 0x7fffdfffdeb8, rbp = 0x7fffdfffdee0 ---
[265338] KDB: enter: panic

I'll try to get a core dump next time and find out the line numbers.
Comment 2 sigsys 2019-07-07 18:50:19 UTC
Panic with the same card on a different computer running 12.0-STABLE r349799 GENERIC.  This time with drm-fbsd12.0-kmod-4.16.g20190624 and gpu-firmware-kmod-g20190620.  But seems to work well with drm-fbsd11.2-kmod-4.11g20190424 so far.


Fatal trap 12: page fault while in kernel mode
cpuid = 15; apic id = 0f
fault virtual address   = 0x5
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff82d6a559
stack pointer           = 0x28:0xfffffe00cfe091d8
frame pointer           = 0x28:0xfffffe00cfe09210
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 3
current process         = 56502 (X:rcs0)
trap number             = 12
panic: page fault
cpuid = 15
time = 1562524377
KDB: stack backtrace:
#0 0xffffffff80c17707 at kdb_backtrace+0x67
#1 0xffffffff80bcb1ad at vpanic+0x19d
#2 0xffffffff80bcb003 at panic+0x43
#3 0xffffffff8109f4bc at trap_fatal+0x39c
#4 0xffffffff8109f509 at trap_pfault+0x49
#5 0xffffffff8109eaff at trap+0x29f
#6 0xffffffff81079bb5 at calltrap+0x8
#7 0xffffffff82cba6ba at radeon_cs_ioctl+0xa4a
#8 0xffffffff82db5ca6 at drm_ioctl_kernel+0xf6
#9 0xffffffff82db5f41 at drm_ioctl+0x281
#10 0xffffffff82df7d38 at linux_file_ioctl+0x298
#11 0xffffffff80c35867 at kern_ioctl+0x267
#12 0xffffffff80c3558d at sys_ioctl+0x15d
#13 0xffffffff810a0074 at amd64_syscall+0x364
#14 0xffffffff8107a49d at fast_syscall_common+0x101
Uptime: 8h37m13s
Comment 3 Niclas Zeising freebsd_committer freebsd_triage 2019-07-10 08:53:23 UTC
Is this still an issue?
Can you report this on our github, https://github.com/FreeBSDDesktop/kms-drm , and link the issue here?
Thanks!
Comment 4 Ivan 2019-07-11 01:30:10 UTC
(In reply to Niclas Zeising from comment #3)
https://github.com/FreeBSDDesktop/kms-drm/issues/130
Comment 5 Emmanuel Vadot freebsd_committer freebsd_triage 2021-11-19 15:05:03 UTC
No news from reporter, closing.
Comment 6 Bill Paul 2021-12-28 00:05:56 UTC
Unfortunately, this bug is still present, even using FreeBSD 12.3-RELEASE.

Note that I have observed it with several Radeon cards, so it doesn't appear specific to a particular card. The same problem does not happen with the Intel i915kms.ko driver from the same package.

Based on the hint here, I created my own drm-fbsd11.2-kmod port (the original has been deleted from the repo as obsolete) and massaged it a bit to compile on FreeBSD 12.3, and that version works without crashing.

I suspect there is a locking problem somewhere in the drm or ttm modules in the newer code. Sadly, nobody seems to care enough to fix it. (And I don't understand the code well enough to fix it myself. The differences between the 4.11 and 4.16 versions of the driver are substantial.)
Comment 7 Bill Paul 2021-12-28 00:07:51 UTC
Oh, if anyone cares, my patched version of the drm-fbsd11.2-kmod port for FreeBSD 12.3 is here:

https://people.freebsd.org/~wpaul/radeon
Comment 8 Alexey Dokuchaev freebsd_committer freebsd_triage 2021-12-28 12:34:20 UTC
(In reply to Bill Paul from comment #7)
> my patched version of the drm-fbsd11.2-kmod port for FreeBSD 12.3
Thanks Bill, that could be quite helpful.  Ideally graphics/drm-*-kmod ports should have never been tied to a particular version of *FreeBSD*, but tracked that of *Linux* instead, so users could choose between e.g. drm-4.11-kmod, drm-4.16-kmod, or drm-5.4-kmod regardless of whether they're running 12.x or -CURRENT.

My two-core, ten-year-old desktop at $work with Radeon HD 4350/4550 (RV710) is running 2019'ish -CURRENT with drm-legacy-kmod-g20190213 and works flawlessly for many months, rebooting only when there's a power outage.  Both 2D (movies) and 3D (games) are full and properly accelerated, CPU is mostly idle, I can e.g. build -jX LLVM and watch some TV show fullscreen with no stuttering -- something I cannot do on the four-core 2018 laptop with recent DRM bits.  Situation with graphics on FreeBSD had never been worse, AFAIR. :-(
Comment 9 Emmanuel Vadot freebsd_committer freebsd_triage 2021-12-29 07:38:16 UTC
(In reply to Alexey Dokuchaev from comment #8)

> Ideally graphics/drm-*-kmod ports should have never been tied to a particular version of *FreeBSD*, 
> but tracked  that of *Linux* instead, so users could choose between e.g. drm-4.11-kmod, drm-4.16-kmod, or drm-5.4-kmod regardless of whether they're running 12.x or -CURRENT.

In theory that could have been done, but that would have been a nightmare to manage.
Also I don't see why we should give the possibility to users to install old drivers, instead we should fix bugs in newer version if one have a problem, something that I and others do regularly based on reports. None of us are upstream DRM developers and we don't know the code base much as we only look at some parts when we have some problems. You can't expect us drm-kmod people (mostly wulf@ Greg V and I currently) to magically fix every bug. You can't also expect us to have all supported hardware to test.

> My two-core, ten-year-old desktop at $work with Radeon HD 4350/4550 (RV710) is running 2019'ish -CURRENT with drm-legacy-kmod-g20190213 and works flawlessly for many months, rebooting only when there's a power outage.

 Congratulation on running old code on old hardware. RV710 is 14 years old not ten, and in the GPU world 4 years is a *lot*. It's also using the radeonkms driver which have more or less been abandonware upstream for quite some time and again none of us are AMD/Radeon developers.

> Both 2D (movies) and 3D (games) are full and properly accelerated, CPU is mostly idle, I can e.g. build -jX LLVM and watch some TV show fullscreen with no stuttering -- 
> something I cannot do on the four-core 2018 laptop with recent DRM bits.  

 And something that *I* can do on all my Intel or AMD recent hardware.
 Did you open a PR here or on the github page for drm-kmod so we can (maybe) help you ? (I don't remember a PR from you)

> Situation with graphics on FreeBSD had never been worse, AFAIR. :-(

This kind of comment is really hurtful, demotivating and simply mean.
Comment 10 Alexey Dokuchaev freebsd_committer freebsd_triage 2021-12-29 10:21:01 UTC
(In reply to Emmanuel Vadot from comment #9)
> Also I don't see why we should give the possibility to users to install
> old drivers
Because with every Linux version bump (4.11 -> 4.16 -> 5.4) come not only fixes and better support, but also various regressions.

> instead we should fix bugs in newer version if one have a problem,
> something that I and others do regularly based on reports.
Ideally yes, but bugs arrive at a higher rate than fixes, unfortunately.

> It's also using the radeonkms driver which have more or less been abandonware
> upstream
I can't care less for its status so long as it delivers and works perfectly.  On a larger scale, I've noticed that often abandonware means lack of problems, while supported means "prepare for a bump ride".  Go figure! :)

> Did you open a PR here or on the github page for drm-kmod?
We had discussed that in a different email, I will once I have concrete reproducible results with at least some preliminary analysis (e.g., revision or approximate date of regression).  Vague reports are likely to be closed as overcome by events.

> This kind of comment is really hurtful, demotivating and simply mean.
I'm sorry, but I think you got me wrong.  I appreciate the work you guys put into this shit, esp. given the limited resources we have, but with Linux being our upstream, it should not come as a surprise as Linux is known for its mediocre quality.  I mean, it's most likely not your fault in the first place, but what we're being fed.
Comment 11 Bill Paul 2021-12-30 20:55:13 UTC
So, since I'm off work this week and have not much else to do, I decided to try isolating the actual problem here. Now that I have a known working set of code (drm-fbsd11.2-kmod) I thought I could compare it to the non-working code (drm-fbsd12.0-kmod) and gradually bisect things to narrow down the fault

After much hair-pulling and gnashing of teeth, I finally isolated things down to the dma-fence module in the linuxkpi code.

Here's what I tried:

- Replaced the contents of the drivers/gpu/drm/radeon directory in drm-fbsd12.0-kmod with the contents from the radeon directory in drm-fbsd11.2-kmod
- Result: no change, panic still occurred

- Replaced the contents of the drivers/gpu/drm/ttm directory in drm-fbsd12.0-kmod with the contents of the drm directory in drm-fbsd11.2-kmod (as well as the associated header files)
- Result: no change, panic still occurred

- Replaced the contents of the linuxkpi and drivers/gpu/drm/ttm directories in drm-fbsd12.0-kmod with the contents of linuxkpi and ttm directories from drm-fbsd11.2-kmod (as well as the associated header files)
- Result: No panic

- Replaced _just_ the contents of the linuxkpi directory in drm-fbsd12.0-kmod with the contents of the linuxkpi directory in drm-fbsd11.2-kmod (this time taking care to preserve the ttm module; they are somewhat tightly coupled so this took a bit more effort)
- Result: No panic

- Replaced _just_ the dma-fence.h and linux_dmafence.c modules in the linuxkpi directory in drm-fbsd12.0-kmod with the ones from drm-fbsd11.2-kmod, and also tweaked linux_synx_file.c a little (it uses an API from the 12.0 code which isn't in the 11.2 code)
- Result: No panic

I'm still not exactly sure what's wrong here, but there seems to be a problem in the dma-fence module with locking and/or reference counting that causes fence structures to be deleted unexpectedly. This is what leads to the traps on bad pointers.

I created a custom tarball of the drm-fbsd12.0-kmod port which includes patches to the 4.16 FreeBSDDesktop 4.16 code to revert the dma-fence code as described above. You can download it from here:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

The specific things I did are:

1) Replaced dma-fence.h and linux_dmafence.c in the drm-fbsd12.0-kmod port with the versions drm-fbsd11.2-kmod.

2) Added a compat wrapper function in dma-fence.h for dma_fence_get_rcu_safe() which just calls dma_fence_get_rcu().

3) Added a compat macro in dma-fence.h for dma_fence_is_signaled_locked() which just calls dma_fence_is_signaled()

4) In linux_sync_file.c, changed the sync_fill_fence_info() function back to how it looked in the 11.2 codebase, because it uses dma_fence_get_status() and DMA_FENCE_FLAG_TIMESTAMP_BIT, which were not available in the older 11.2 dma-fence code

Just unpack the tarball under /usr/ports/graphics in place of the old one and then run make, followed by "make deinstall" and "make reinstall".

It occurred to me that instead of taking the older 11.2 dma-fence module and porting it forward, it might make more sense to take the 13.0 module and port it back. But this assumes that the drm-fbsd13.0-kmod code doesn't have the same stability problem it in as drm-fbsd12.0-kmod, and I don't know if that's true. (So far nobody has said whether or not they're using a Radeon card with 13.0 and whether or not they've encountered the same problems.) I may still try this anyway if I'm still sufficiently bored.

So far I've tested this on two devices:

vgapci0@pci0:1:0:0:     class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
    class      = display
    subclass   = VGA

vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00 hdr=0x00
vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]'
device = 'Sumo [Radeon HD 6480G]'
class = display
subclass = VGA

I'm using the machine with the CEDAR device right now. The laptop with the SUMO device is much more prone to crashing. Usually what I do to provoke it is:

- Boot and load the driver
- Plug in my phone and set up tethering over USB
- Start KDE5
- Start Firefox
- Browse Facebook or Reddit for a while

It usually panics within a few minutes.

Lastly, I have a question: I followed up to this particular PR because the it seemed to most closely match the problems I was having, but it's been closed. Should I open a new PR? This bug is still present with 12.3 and I'm clearly not the only one affected by it. (I also still can't explain why it doesn't seem to affect the i915kms driver.)
Comment 12 sigsys 2021-12-31 20:54:37 UTC
(In reply to Emmanuel Vadot from comment #9)
Yeah thanks for your work on this. This is clearly hard, complicated work not a lot of people could do. With how complex GPUs are (and the hardware variety there is), I don't see how else it could be done than by making it as easy as possible to integrate existing drivers with as little modifications as possible. Making new drivers would be an INSANE amount of work...

amdgpu does work a lot better here on an RX560 with pretty much constant use on 13-STABLE. It just works. Pretty much perfectly. The only problem I can notice is rare, temporary screen corruption sometimes (but I saw those on Windows too, though more rarely).

But radeonkms still panics. As you said it's really old hardware at this point but it's what's built-in the motherboard of a lot of older computers. The panics can take days or weeks to happen though and it seems pretty damn random. I can imagine something like this must be hell to debug. And running desktops on old computers is probably not something a lot of people are gonna want to do... It would make more sense to prioritize supporting newer GPUs I suppose. Though those old Radeon cards can be useful for boards/CPUs without integrated graphics (especially since we're in a GPUpocalypse right now).

(In reply to Bill Paul from comment #11)
Oh wow seems like you narrowed it down a lot.

radeonkms panics still happened for me on 14-CURRENT (with the corresponding drm-kmod port). I only noticed them on a test computer that wasn't getting a lot of desktop use (but that was left on running things) so it's hard to say if they were more or less frequent.

When I'm done shuffling hardware I'm gonna get my radeonkms test computer back on my desk and leave it running and gather more infos on crashes if that helps. In my case it wasn't easy to trigger the panics.  It could take a few days. I could try whatever versions you think might be useful.
Comment 13 Bill Paul 2021-12-31 22:01:57 UTC
> But radeonkms still panics. As you said it's really old hardware at this point[...]

I really don't think it's fair to call it really old hardware. It may not be current hardware, but it's still perfectly serviceable, and I suspect it would work fine with the same driver code in Linux. I think the real problem here is a bug in the Linux compatibility code, and the fact that it only happens to trip with Radeon hardware is just dumb luck.

To be fair, the reason I have this hardware in the first place is that I've been salvaging it from the e-waste bin at work. Bear in mind, sometimes stuff just ends up in there because it was used for one project and then forgotten. (I have a 32-core, 32GB system at work because of this.) As a result, I have several Radeon graphics cards, and recently I also ended up with a laptop with built-in R600/SUMO graphics as well. The laptop works great with FreeBSD 12.3 on it, except for the stupid panic in the radeonkms driver. Now that I seem to have worked around that too, I have no complaints.

Anyway, if I understand you correctly, it sounds like the problem is still present in FreeBSD 14-CURRENT. From what I can tell, the major difference from the older drm-fbsd11.2-kmod code and the later code is that the dmafence code was updated to include support for some new APIs in Linux. In particular, there is a dma_fence_get_status() function which wasn't there before, and support for tracking timestamps. There also seems to be a dma_fence_get_rcu_locked() routine which wasn't there before (there was just dma_fence_get_rcu()). I suspect that the implementation of these routines is not quite correct, but only the Radeon driver seems to use them in a way which causes them to fail.

Note that these routines are used both by the Radeon driver directly and internally by other modules, e.g. linux_sync_file and drm_syncobj. The use patterns may also depend on how the user-space drivers that are part of the X server use these facilities too, and I don't know much about that.

> Oh wow seems like you narrowed it down a lot.

I think I've narrowed it down even more. I realized that the only function difference in linux_dmafence.c between the 11.2 and 12.0 cases is the addition of the dma_fence_get_status() function, so I created a smaller patch for this module that just #ifdefs this function off and leaves everything else alone. I'm using that code right now, and it seems to be holding up. I updated the tarball with the new patch:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

This means that right now, the only major difference is that I'm using the older version of dma-fence.h from the drm-fbsd11.2-kmod code, with only one minor compatibility fixup in linux_sync_file.c.

Unfortunately I can't easily analyze the 14-CURRENT code right now because I don't have a machine with it installed. I might be able to fix that once I get back to the office next week. One thing I've noticed is that the linuxkpi directory in the drm-fbsd-kmod package gets smaller for more recent versions of FreeBSD. I guess this is because the GPL'ed modules in the drm driver are gradually being rewritten and migrated into the FreeBSD kernel sources proper. I don't know if the dma-fence code is part of that. It seems that at least for drm-fbsd13.0-kmod the dma-fence code is still in the driver package, but I haven't checked for -CURRENT yet

If the dma-fence code is still in the driver package, then it may still have the same bug, but I can't easily kludge up a patch for it just yet.

> When I'm done shuffling hardware I'm gonna get my radeonkms test computer back > on my desk and leave it running and gather more infos on crashes if that helps[...]

It would be interesting to see if the stack traces are similar. I would expect to see the same drm_ioctl()->drm_ioctl_kernel()->radeon_cs() path as that seems to be the most common.

What would really help is if whoever put together the dma-fence support in the linuxkpi module would step forward and maybe review it a bit and maybe offer some guidance. I'm only vaguely familiar with what facility is even for; it would be nice if someone with some more insight would comment.

Also, my other question remains: should I open an new PR for this?
Comment 14 Bill Paul 2022-01-01 20:00:52 UTC
I think I figured it out.

The problem seems to be in dma_fence_signal_locked_sub():

static inline void
dma_fence_signal_locked_sub(struct dma_fence *fence)
{
        struct dma_fence_cb *cur;

        while ((cur = list_first_entry_or_null(&fence->cb_list,
                    struct dma_fence_cb, node)) != NULL) {
                list_del_init(&cur->node);
                spin_unlock(fence->lock);
                cur->func(fence, cur);
                spin_lock(fence->lock);
        }
}

This function is shared by dma_fence_signal() and dma_fence_signal_unlocked(). It looks like the problem is the spin_unlock()/spin_lock() calls used to drop the fence lock while calling the signal callbacks. The drm-fbsd11.2-kmod code did not do this, and for that matter it looks like the most recent Linux code doesn't do it either. As far as I can tell, dropping this lock here is what causes the race condition: the rest of the code is not expecting this to happen when dma_fence_signal() is called: it's only dma_fence_signal_locked() that should work this way.

If I patch the drm-fbsd12.0-kmod code to remove the spin_unlock()/spin_lock() calls, I also don't get any crashes.

I created a new tarball with a single patch that has just this fix:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

I've been running with this patch for the last day or so and haven't had any panics. I would appreciate it if anyone else who has been experiencing this same crash (i.e. similar to the panics in this PR) could test this patch and see if it fixes for you.

It would also be nice if someone could also review the code and confirm if my findings make sense.

Oh, one last thing: from a cursory inspection of the FreeBSD 13 code, I don't see this same problem, so if you claim that you're experiencing "the same crash" with FreeBSD 13 or later, please back up your claim by showing me the panic stack trace. If it doesn't match the examples in this PR, they your problem may be something entirely different. I'm sorry if your system is also unstable, but it's important to be sure, because I don't want to waste a lot of time on something that turns out to be unrelated.