Bug 253461 - LinuxKPI: [AMD/ATI] RV730 PRO [Radeon HD 4650] crashes kernel
Summary: LinuxKPI: [AMD/ATI] RV730 PRO [Radeon HD 4650] crashes kernel
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.2-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, needs-qa
Depends on: 261501
Blocks:
  Show dependency treegraph
 
Reported: 2021-02-12 15:35 UTC by Patrick Mackinlay
Modified: 2022-08-30 16:41 UTC (History)
4 users (show)

See Also:
koobs: maintainer-feedback? (wulf)
koobs: maintainer-feedback? (manu)
koobs: maintainer-feedback? (x11)
koobs: mfc-stable13?
koobs: mfc-stable12?


Attachments
Patch for dma-fence.h (419 bytes, patch)
2022-01-04 22:53 UTC, Bill Paul
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Mackinlay 2021-02-12 15:35:30 UTC
I am running X11 on FreeBSD 12.2-RELEASE-p2 with a Radeon 4650. I am using the radeonkms kernel module from drm-fbsd12.0-kmod-4.16.g20201016_1 / gpu-firmware-kmod-g20200920 with the xf86-video-ati-19.1.0_3,1 driver. The machine has 2 monitors running at resolution 1920x1200.
With this setup the machine randomly panics (can be after days). This seems to happen most often when the screen saver kicks in and the monitors are put to sleep.
The last time I go a kernel dump which reports the following:

[694796] 
[694796] 
[694796] Fatal trap 12: page fault while in kernel mode
[694796] cpuid = 7; apic id = 07
[694796] fault virtual address  = 0x69
[694796] fault code             = supervisor write data, page not present
[694796] instruction pointer    = 0x20:0xffffffff83122a60
[694796] stack pointer          = 0x28:0xfffffe012a967640
[694796] frame pointer          = 0x28:0xfffffe012a967660
[694796] code segment           = base rx0, limit 0xfffff, type 0x1b
[694796]                        = DPL 0, pres 1, long 1, def32 0, gran 1
[694796] processor eflags       = interrupt enabled, resume, IOPL = 0
[694796] current process                = 4954 (Xorg)
[694796] trap number            = 12
[694796] panic: page fault
[694796] cpuid = 7
[694796] time = 1613135656
[694796] KDB: stack backtrace:
[694796] #0 0xffffffff80ba5605 at kdb_backtrace+0x65
[694796] #1 0xffffffff80b598bb at vpanic+0x17b
[694796] #2 0xffffffff80b59733 at panic+0x43
[694796] #3 0xffffffff81032911 at trap_fatal+0x391
[694796] #4 0xffffffff8103296f at trap_pfault+0x4f
[694796] #5 0xffffffff81031fb6 at trap+0x286
[694796] #6 0xffffffff8100a828 at calltrap+0x8
[694796] #7 0xffffffff83124538 at ttm_unmap_and_unpopulate_pages+0xe8
[694796] #8 0xffffffff8311b320 at ttm_tt_destroy+0x40
[694796] #9 0xffffffff83120da7 at ttm_bo_cleanup_refs_or_queue+0x1c7
[694796] #10 0xffffffff8311d571 at ttm_bo_unref+0x81
[694796] #11 0xffffffff82fd2312 at radeon_bo_unref+0x22
[694796] #12 0xffffffff82fc4d7e at radeon_gem_object_free+0x1e
[694796] #13 0xffffffff830adc8b at drm_gem_object_release_handle+0xcb
[694796] #14 0xffffffff830adb5e at drm_gem_handle_delete+0x8e
[694796] #15 0xffffffff830b12e1 at drm_ioctl_kernel+0xf1
[694796] #16 0xffffffff830b1589 at drm_ioctl+0x289
[694796] #17 0xffffffff830f2e58 at linux_file_ioctl+0x318
[694796] Uptime: 8d0h59m56s
[694796] Dumping 8874 out of 32590 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Comment 1 Patrick Mackinlay 2021-02-12 23:18:49 UTC
Another crash

[32719] 
[32719] 
[32719] Fatal trap 12: page fault while in kernel mode
[32719] cpuid = 2; apic id = 02
[32719] fault virtual address   = 0xffffffff83066a58
[32719] fault code              = supervisor read instruction, protection violation
[32719] instruction pointer     = 0x20:0xffffffff83066a58
[32719] stack pointer           = 0x28:0xfffffe00a5c3b338
[32719] frame pointer           = 0x28:0xfffffe00a5c3b370
[32719] code segment            = base rx0, limit 0xfffff, type 0x1b
[32719]                         = DPL 0, pres 1, long 1, def32 0, gran 1
[32719] processor eflags        = interrupt enabled, resume, IOPL = 0
[32719] current process         = 2849 (Xorg)
[32719] trap number             = 12
[32719] panic: page fault
[32719] cpuid = 2
[32719] time = 1613168457
[32719] KDB: stack backtrace:
[32719] #0 0xffffffff80ba5605 at kdb_backtrace+0x65
[32719] #1 0xffffffff80b598bb at vpanic+0x17b
[32719] #2 0xffffffff80b59733 at panic+0x43
[32719] #3 0xffffffff81032911 at trap_fatal+0x391
[32719] #4 0xffffffff8103296f at trap_pfault+0x4f
[32719] #5 0xffffffff81031fb6 at trap+0x286
[32719] #6 0xffffffff8100a828 at calltrap+0x8
[32719] #7 0xffffffff82fb5921 at radeon_cs_ioctl+0xa41
[32719] #8 0xffffffff830b12e1 at drm_ioctl_kernel+0xf1
[32719] #9 0xffffffff830b1589 at drm_ioctl+0x289
[32719] #10 0xffffffff830f2e58 at linux_file_ioctl+0x318
[32719] #11 0xffffffff80bc33b7 at kern_ioctl+0x2b7
[32719] #12 0xffffffff80bc305a at sys_ioctl+0xfa
[32719] #13 0xffffffff810334c7 at amd64_syscall+0x387
[32719] #14 0xffffffff8100b14e at fast_syscall_common+0xf8
Comment 2 Patrick Mackinlay 2021-02-15 09:46:38 UTC
[204299] 
[204299] 
[204299] Fatal trap 12: page fault while in kernel mode
[204299] cpuid = 2; apic id = 02
[204299] fault virtual address  = 0xc0
[204299] fault code             = supervisor read data, page not present
[204299] instruction pointer    = 0x20:0xffffffff830b6137
[204299] stack pointer          = 0x28:0xfffffe00aa30f660
[204299] frame pointer          = 0x28:0xfffffe00aa30f690
[204299] code segment           = base rx0, limit 0xfffff, type 0x1b
[204299]                        = DPL 0, pres 1, long 1, def32 0, gran 1
[204299] processor eflags       = interrupt enabled, resume, IOPL = 0
[204299] current process                = 2968 (Xorg)
[204299] trap number            = 12
[204299] panic: page fault
[204299] cpuid = 2
[204299] time = 1613377107
[204299] KDB: stack backtrace:
[204299] #0 0xffffffff80ba5605 at kdb_backtrace+0x65
[204299] #1 0xffffffff80b598bb at vpanic+0x17b
[204299] #2 0xffffffff80b59733 at panic+0x43
[204299] #3 0xffffffff81032911 at trap_fatal+0x391
[204299] #4 0xffffffff8103296f at trap_pfault+0x4f
[204299] #5 0xffffffff81031fb6 at trap+0x286
[204299] #6 0xffffffff8100a828 at calltrap+0x8
[204299] #7 0xffffffff83121cac at ttm_bo_man_put_node+0x3c
[204299] #8 0xffffffff83120ddc at ttm_bo_cleanup_refs_or_queue+0x1fc
[204299] #9 0xffffffff8311d571 at ttm_bo_unref+0x81
[204299] #10 0xffffffff82fd2312 at radeon_bo_unref+0x22
[204299] #11 0xffffffff82fc4d7e at radeon_gem_object_free+0x1e
[204299] #12 0xffffffff830adc8b at drm_gem_object_release_handle+0xcb
[204299] #13 0xffffffff830adb5e at drm_gem_handle_delete+0x8e
[204299] #14 0xffffffff830b12e1 at drm_ioctl_kernel+0xf1
[204299] #15 0xffffffff830b1589 at drm_ioctl+0x289
[204299] #16 0xffffffff830f2e58 at linux_file_ioctl+0x318
[204299] #17 0xffffffff80bc33b7 at kern_ioctl+0x2b7
Comment 3 Bill Paul 2022-01-04 22:52:02 UTC
I believe I have a fix for this bug. It is a problem with the linuxkpi code in the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution.

Notes:

- This problem has been there for some time. I've had it happen in FreeBSD 12.2-RELEASE and FreeBSD 12.3-RELEASE.

- It's not confined to a single Radeon card. I've observed the problem with the following hardware on different machines:

vgapci0@pci0:1:0:0:     class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
    class      = display
    subclass   = VGA

vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Sumo [Radeon HD 6480G]'
    class      = display
    subclass   = VGA

vgapci1@pci0:131:0:0:   class=0x030000 card=0x90b8103c chip=0x67711002 rev=0x00 hdr=0x00
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]'
    class      = display
    subclass   = VGA

(Note that the Sumo device is built into a laptop, an HP ProBook 4535S.)

- This problem has been reported by others. PR 237544 is a duplicate. The panics I experienced had the same stack traces as shown in both PRs.

- PR 237544 provides an important hint that this crash did _not_ happen with the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was able to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE system (the laptop) and the crashes went away.

- In my case, the panics were more likely to occur when the system was under load. The laptop seemed to trigger it more frequently (which actually made it easier to track it down).

I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tree to see what effect that would have. Eventually I traced the problem to the linuxkpi code, and then to the dma-fence code, and then finally, to this function in linuxkpi/gplv2/include/linux/dma-fence.h:

static inline void
dma_fence_signal_locked_sub(struct dma_fence *fence)
{
        struct dma_fence_cb *cur;

        while ((cur = list_first_entry_or_null(&fence->cb_list,
                    struct dma_fence_cb, node)) != NULL) {
                list_del_init(&cur->node);
                spin_unlock(fence->lock);   /* <-- No! */
                cur->func(fence, cur);
                spin_lock(fence->lock);     /* <-- No! */
        }
} 

Note the two lines highlited above.

The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signal() and dma_fence_signal_locked(). The latter function is intended to be used when the caller is already holding the fence spinlock. The former takes the spinlock itself.

The problem is that the above code causes the spinlock to be dropped in the case where dma_fence_signal() is called. This is not the same behavior as the older 11.2 code: in that case, the lock is held while the callouts are invoked. (I *think* this is also the case in the later code in FreeBSD 13 too.) I believe that dropping the lock before calling the callouts opens a race condition window and this is what leads to the crash. It's difficult to ascertain that this is the what's happening from the crash stack traces, but in my analysis I found that at least sometimes the problem was that something was trying to dereference a NULL DMA fence pointer.

I patched my copy of the code to remove the spin_unlock() and spin_lock() calls shown above, and that seemed to fix the problem. The laptop has not crashed since I did this. I also made the same change to the 12.2-RELEASE system with the "Cedar" card and exercised it a bit, and that one seemed to run ok too. I have just patched the "Caicos" machine today and so far it's running stable as well (this is my work machine and this is my first day back at the office for the new year).

I created a version of the drm-fbsd12.0-kmod port with this change included as a patch, which can be downloaded from here:

http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz

I will also attach the patch to this PR.

Can someone please test this to see if it fixes the problem for them too?

Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued these from the e-waste bin) and would be happen to send one to a developer if that would help (assuming they have a machine with a slot that can accommodate it).
Comment 4 Bill Paul 2022-01-04 22:53:27 UTC
Created attachment 230717 [details]
Patch for dma-fence.h

Proposed fix for crash with Radeon KMS driver.
Comment 5 Kubilay Kocak freebsd_committer freebsd_triage 2022-01-05 00:54:18 UTC
Thank you for the report, analysis and patch Patrick
Comment 6 Bill Paul 2022-01-05 03:31:02 UTC
(In reply to Kubilay Kocak from comment #5)
It's not Patrick, it's Bill. :)
Comment 7 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-01-09 22:39:45 UTC
(In reply to Bill Paul from comment #3)

The spin_unlock/spin_lock pair is FreeBSD addition introduced in this commit: https://github.com/FreeBSDDesktop/kms-drm/commit/81fee07842909530e5ef02d43076a92b22180b48

It is accidentally mostly reverted before 5.4 in commit 20a3e51a54f45df76f82bc381d6375e72021f1a1
Comment 8 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-01-09 22:52:07 UTC
(In reply to Vladimir Kondratyev from comment #7)

And previous change is refactoring of following commit:

https://github.com/FreeBSDDesktop/kms-drm/commit/c34ed6fa2a559aacb21a195cf138d122f0fc263c
Comment 9 Bill Paul 2022-01-09 23:30:52 UTC
(In reply to Vladimir Kondratyev from comment #7)

It looked to me that originally Linux had the dma_fence_signal() API and later a new API dma_fence_signal_locked() was added. According to what I've read, the idea is that dma_fence_signal_locked() can be used if the caller is already holding the DMA fence object spinlock, while the older dma_fence_signal() function takes the lock for you.

The question here is: when you signal a dma fence object, and you invoke its attached callout routines, do you hold the spinlock or do you drop it?

The older linuxkpi code in drm-fbsd11.2-kmod was based on Linux 4.11 and only had the dma_fence_signal() API, and that code always held the fence spinlock when invoking the callouts.

In drm-fbsd12.0-kmod, based on Linux 4.16, both dma_fence_signal() and dma_fence_signal_locked() are present. HOWEVER, the logic is now such that both functions drop the dma fence spinlock when calling the callouts.

This changes the behavior of dma_fence_signal(), and I think the change was wrong (though likely unintentional). Now, dma_fence_signal() drops the spinlock when invoking the callouts. This does not seem to harm the Intel i915kms.ko driver, but it seems to cause the radeonkms.ko driver driver to panic when the system is under load. I must assume that dropping the lock leads to a race condition when two different threads try to access the same dma fence object.

If you browse the most recent Linux kernel code, you can also see that this behavior is inconsistent with the native Linux implementations of dma_fence_signal() and dma_fence_signal_locked():

https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L376

The dma_fence_signal_timestamp_locked() function shown here is used by both dma_fence_signal() and dma_fence_signal_locked(). dma_fence_signal() takes the fence spinlock before calling it. Note that the fence spinlock is _not_ released when invoking the callbacks.

From this I am forced to conclude:

- When calling dma_fence_signal(), the fence spinlock is supposed to be held until the function returns, including when the callbacks are called.
- When calling dma_fence_signal_locked(), the same is true, except it is the caller that's expected to take the fence spinlock.
- The current behavior in drm-fbsd12.0-kmod where the lock is dropped when invoking the callouts is therefore wrong on two counts: it deviates from the Linux behavior, which breaks synchronization in the Radeon driver.

I think my fix preserves the expected behavior of both routines, because dma_fence_signal_unlocked() does not call dma_fence_signal_unlocked_sub() with the spinlock held, while dma_fence_signal() does.

My office machine with the CAICOS chipset has been running with this fix for a week now and has been stable. I've also been using the same fix on my laptop with the SUMO chipset with the same fix for a bit longer and it also hasn't crashed. Before the laptop would not last more than 5 minutes before it would panic.

-Bill
Comment 10 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-01-10 15:44:47 UTC
(In reply to Bill Paul from comment #9)
The analysis and patch LGTM. Thank you.

This unlock/lock pair exists only in FreeBSD version of drm. It was introduced in 4.12 by johalun@ to workaround unknown bug, probably taking sleepable lock with non-slepable lock held and effectively reverted in 4.19 with comment "Fixes amdgpu".

I'll try to test it on i915 after getting some spare time to ensure that the "unknown bug" does not resurrect.
Comment 11 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-01-23 19:48:55 UTC
(In reply to Bill Paul from comment #9)
Finally I installed 12.3 and tested your patch on i915.

It resulted in lots of witness warning, but cherry-picking of https://github.com/freebsd/drm-kmod/commit/b1530ea5ff4a81e1122e80f1b8c7ce6e31929215 fixed them.
Comment 12 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-04-01 20:33:07 UTC
20220126 version of drm-fbsd12.0-kmod now includes the fix

Thank you for investigation!
Comment 13 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-06-05 21:04:42 UTC
Bill / Wulf, can this be closed?
Comment 14 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-06-05 22:00:42 UTC
(In reply to Bjoern A. Zeeb from comment #13)
> Bill / Wulf, can this be closed?

It is up to Bill. I have no means to test drm-fbsd12.0-kmod on Radeon
Comment 15 Vladimir Kondratyev freebsd_committer freebsd_triage 2022-06-05 22:06:39 UTC
Proposed patch is included in to drm-fbsd12.0-kmod starting from 982315681dd2
Comment 16 Bill Paul 2022-06-05 22:39:34 UTC
Yes, you can close this bug. I installed the updated drm-fbsd12.0-kmod package on my work machine a few weeks ago and it worked fine, no panics. Thanks for integrating the fix.