| Summary: | LinuxKPI: [AMD/ATI] RV730 PRO [Radeon HD 4650] crashes kernel | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | Patrick Mackinlay <freebsd.68fba> | ||||
| Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||
| Status: | Closed FIXED | ||||||
| Severity: | Affects Many People | CC: | manu, noisetube, wulf, x11 | ||||
| Priority: | --- | Keywords: | crash, needs-qa | ||||
| Version: | 12.2-RELEASE | Flags: | koobs:
maintainer-feedback?
(wulf) koobs: maintainer-feedback? (manu) koobs: maintainer-feedback? (x11) koobs: mfc-stable13? koobs: mfc-stable12? |
||||
| Hardware: | amd64 | ||||||
| OS: | Any | ||||||
| See Also: | https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237544 | ||||||
| Bug Depends on: | 261501 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Patrick Mackinlay
2021-02-12 15:35:30 UTC
Another crash [32719] [32719] [32719] Fatal trap 12: page fault while in kernel mode [32719] cpuid = 2; apic id = 02 [32719] fault virtual address = 0xffffffff83066a58 [32719] fault code = supervisor read instruction, protection violation [32719] instruction pointer = 0x20:0xffffffff83066a58 [32719] stack pointer = 0x28:0xfffffe00a5c3b338 [32719] frame pointer = 0x28:0xfffffe00a5c3b370 [32719] code segment = base rx0, limit 0xfffff, type 0x1b [32719] = DPL 0, pres 1, long 1, def32 0, gran 1 [32719] processor eflags = interrupt enabled, resume, IOPL = 0 [32719] current process = 2849 (Xorg) [32719] trap number = 12 [32719] panic: page fault [32719] cpuid = 2 [32719] time = 1613168457 [32719] KDB: stack backtrace: [32719] #0 0xffffffff80ba5605 at kdb_backtrace+0x65 [32719] #1 0xffffffff80b598bb at vpanic+0x17b [32719] #2 0xffffffff80b59733 at panic+0x43 [32719] #3 0xffffffff81032911 at trap_fatal+0x391 [32719] #4 0xffffffff8103296f at trap_pfault+0x4f [32719] #5 0xffffffff81031fb6 at trap+0x286 [32719] #6 0xffffffff8100a828 at calltrap+0x8 [32719] #7 0xffffffff82fb5921 at radeon_cs_ioctl+0xa41 [32719] #8 0xffffffff830b12e1 at drm_ioctl_kernel+0xf1 [32719] #9 0xffffffff830b1589 at drm_ioctl+0x289 [32719] #10 0xffffffff830f2e58 at linux_file_ioctl+0x318 [32719] #11 0xffffffff80bc33b7 at kern_ioctl+0x2b7 [32719] #12 0xffffffff80bc305a at sys_ioctl+0xfa [32719] #13 0xffffffff810334c7 at amd64_syscall+0x387 [32719] #14 0xffffffff8100b14e at fast_syscall_common+0xf8 [204299] [204299] [204299] Fatal trap 12: page fault while in kernel mode [204299] cpuid = 2; apic id = 02 [204299] fault virtual address = 0xc0 [204299] fault code = supervisor read data, page not present [204299] instruction pointer = 0x20:0xffffffff830b6137 [204299] stack pointer = 0x28:0xfffffe00aa30f660 [204299] frame pointer = 0x28:0xfffffe00aa30f690 [204299] code segment = base rx0, limit 0xfffff, type 0x1b [204299] = DPL 0, pres 1, long 1, def32 0, gran 1 [204299] processor eflags = interrupt enabled, resume, IOPL = 0 [204299] current process = 2968 (Xorg) [204299] trap number = 12 [204299] panic: page fault [204299] cpuid = 2 [204299] time = 1613377107 [204299] KDB: stack backtrace: [204299] #0 0xffffffff80ba5605 at kdb_backtrace+0x65 [204299] #1 0xffffffff80b598bb at vpanic+0x17b [204299] #2 0xffffffff80b59733 at panic+0x43 [204299] #3 0xffffffff81032911 at trap_fatal+0x391 [204299] #4 0xffffffff8103296f at trap_pfault+0x4f [204299] #5 0xffffffff81031fb6 at trap+0x286 [204299] #6 0xffffffff8100a828 at calltrap+0x8 [204299] #7 0xffffffff83121cac at ttm_bo_man_put_node+0x3c [204299] #8 0xffffffff83120ddc at ttm_bo_cleanup_refs_or_queue+0x1fc [204299] #9 0xffffffff8311d571 at ttm_bo_unref+0x81 [204299] #10 0xffffffff82fd2312 at radeon_bo_unref+0x22 [204299] #11 0xffffffff82fc4d7e at radeon_gem_object_free+0x1e [204299] #12 0xffffffff830adc8b at drm_gem_object_release_handle+0xcb [204299] #13 0xffffffff830adb5e at drm_gem_handle_delete+0x8e [204299] #14 0xffffffff830b12e1 at drm_ioctl_kernel+0xf1 [204299] #15 0xffffffff830b1589 at drm_ioctl+0x289 [204299] #16 0xffffffff830f2e58 at linux_file_ioctl+0x318 [204299] #17 0xffffffff80bc33b7 at kern_ioctl+0x2b7 I believe I have a fix for this bug. It is a problem with the linuxkpi code in the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution.
Notes:
- This problem has been there for some time. I've had it happen in FreeBSD 12.2-RELEASE and FreeBSD 12.3-RELEASE.
- It's not confined to a single Radeon card. I've observed the problem with the following hardware on different machines:
vgapci0@pci0:1:0:0: class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00 hdr=0x00
vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]'
device = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]'
class = display
subclass = VGA
vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00 hdr=0x00
vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]'
device = 'Sumo [Radeon HD 6480G]'
class = display
subclass = VGA
vgapci1@pci0:131:0:0: class=0x030000 card=0x90b8103c chip=0x67711002 rev=0x00 hdr=0x00
vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]'
device = 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]'
class = display
subclass = VGA
(Note that the Sumo device is built into a laptop, an HP ProBook 4535S.)
- This problem has been reported by others. PR 237544 is a duplicate. The panics I experienced had the same stack traces as shown in both PRs.
- PR 237544 provides an important hint that this crash did _not_ happen with the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was able to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE system (the laptop) and the crashes went away.
- In my case, the panics were more likely to occur when the system was under load. The laptop seemed to trigger it more frequently (which actually made it easier to track it down).
I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tree to see what effect that would have. Eventually I traced the problem to the linuxkpi code, and then to the dma-fence code, and then finally, to this function in linuxkpi/gplv2/include/linux/dma-fence.h:
static inline void
dma_fence_signal_locked_sub(struct dma_fence *fence)
{
struct dma_fence_cb *cur;
while ((cur = list_first_entry_or_null(&fence->cb_list,
struct dma_fence_cb, node)) != NULL) {
list_del_init(&cur->node);
spin_unlock(fence->lock); /* <-- No! */
cur->func(fence, cur);
spin_lock(fence->lock); /* <-- No! */
}
}
Note the two lines highlited above.
The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signal() and dma_fence_signal_locked(). The latter function is intended to be used when the caller is already holding the fence spinlock. The former takes the spinlock itself.
The problem is that the above code causes the spinlock to be dropped in the case where dma_fence_signal() is called. This is not the same behavior as the older 11.2 code: in that case, the lock is held while the callouts are invoked. (I *think* this is also the case in the later code in FreeBSD 13 too.) I believe that dropping the lock before calling the callouts opens a race condition window and this is what leads to the crash. It's difficult to ascertain that this is the what's happening from the crash stack traces, but in my analysis I found that at least sometimes the problem was that something was trying to dereference a NULL DMA fence pointer.
I patched my copy of the code to remove the spin_unlock() and spin_lock() calls shown above, and that seemed to fix the problem. The laptop has not crashed since I did this. I also made the same change to the 12.2-RELEASE system with the "Cedar" card and exercised it a bit, and that one seemed to run ok too. I have just patched the "Caicos" machine today and so far it's running stable as well (this is my work machine and this is my first day back at the office for the new year).
I created a version of the drm-fbsd12.0-kmod port with this change included as a patch, which can be downloaded from here:
http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz
I will also attach the patch to this PR.
Can someone please test this to see if it fixes the problem for them too?
Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued these from the e-waste bin) and would be happen to send one to a developer if that would help (assuming they have a machine with a slot that can accommodate it).
Created attachment 230717 [details]
Patch for dma-fence.h
Proposed fix for crash with Radeon KMS driver.
Thank you for the report, analysis and patch Patrick (In reply to Kubilay Kocak from comment #5) It's not Patrick, it's Bill. :) (In reply to Bill Paul from comment #3) The spin_unlock/spin_lock pair is FreeBSD addition introduced in this commit: https://github.com/FreeBSDDesktop/kms-drm/commit/81fee07842909530e5ef02d43076a92b22180b48 It is accidentally mostly reverted before 5.4 in commit 20a3e51a54f45df76f82bc381d6375e72021f1a1 (In reply to Vladimir Kondratyev from comment #7) And previous change is refactoring of following commit: https://github.com/FreeBSDDesktop/kms-drm/commit/c34ed6fa2a559aacb21a195cf138d122f0fc263c (In reply to Vladimir Kondratyev from comment #7) It looked to me that originally Linux had the dma_fence_signal() API and later a new API dma_fence_signal_locked() was added. According to what I've read, the idea is that dma_fence_signal_locked() can be used if the caller is already holding the DMA fence object spinlock, while the older dma_fence_signal() function takes the lock for you. The question here is: when you signal a dma fence object, and you invoke its attached callout routines, do you hold the spinlock or do you drop it? The older linuxkpi code in drm-fbsd11.2-kmod was based on Linux 4.11 and only had the dma_fence_signal() API, and that code always held the fence spinlock when invoking the callouts. In drm-fbsd12.0-kmod, based on Linux 4.16, both dma_fence_signal() and dma_fence_signal_locked() are present. HOWEVER, the logic is now such that both functions drop the dma fence spinlock when calling the callouts. This changes the behavior of dma_fence_signal(), and I think the change was wrong (though likely unintentional). Now, dma_fence_signal() drops the spinlock when invoking the callouts. This does not seem to harm the Intel i915kms.ko driver, but it seems to cause the radeonkms.ko driver driver to panic when the system is under load. I must assume that dropping the lock leads to a race condition when two different threads try to access the same dma fence object. If you browse the most recent Linux kernel code, you can also see that this behavior is inconsistent with the native Linux implementations of dma_fence_signal() and dma_fence_signal_locked(): https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L376 The dma_fence_signal_timestamp_locked() function shown here is used by both dma_fence_signal() and dma_fence_signal_locked(). dma_fence_signal() takes the fence spinlock before calling it. Note that the fence spinlock is _not_ released when invoking the callbacks. From this I am forced to conclude: - When calling dma_fence_signal(), the fence spinlock is supposed to be held until the function returns, including when the callbacks are called. - When calling dma_fence_signal_locked(), the same is true, except it is the caller that's expected to take the fence spinlock. - The current behavior in drm-fbsd12.0-kmod where the lock is dropped when invoking the callouts is therefore wrong on two counts: it deviates from the Linux behavior, which breaks synchronization in the Radeon driver. I think my fix preserves the expected behavior of both routines, because dma_fence_signal_unlocked() does not call dma_fence_signal_unlocked_sub() with the spinlock held, while dma_fence_signal() does. My office machine with the CAICOS chipset has been running with this fix for a week now and has been stable. I've also been using the same fix on my laptop with the SUMO chipset with the same fix for a bit longer and it also hasn't crashed. Before the laptop would not last more than 5 minutes before it would panic. -Bill (In reply to Bill Paul from comment #9) The analysis and patch LGTM. Thank you. This unlock/lock pair exists only in FreeBSD version of drm. It was introduced in 4.12 by johalun@ to workaround unknown bug, probably taking sleepable lock with non-slepable lock held and effectively reverted in 4.19 with comment "Fixes amdgpu". I'll try to test it on i915 after getting some spare time to ensure that the "unknown bug" does not resurrect. (In reply to Bill Paul from comment #9) Finally I installed 12.3 and tested your patch on i915. It resulted in lots of witness warning, but cherry-picking of https://github.com/freebsd/drm-kmod/commit/b1530ea5ff4a81e1122e80f1b8c7ce6e31929215 fixed them. 20220126 version of drm-fbsd12.0-kmod now includes the fix Thank you for investigation! Bill / Wulf, can this be closed? (In reply to Bjoern A. Zeeb from comment #13) > Bill / Wulf, can this be closed? It is up to Bill. I have no means to test drm-fbsd12.0-kmod on Radeon Proposed patch is included in to drm-fbsd12.0-kmod starting from 982315681dd2 Yes, you can close this bug. I installed the updated drm-fbsd12.0-kmod package on my work machine a few weeks ago and it worked fine, no panics. Thanks for integrating the fix. |