Bug 271333 - kernel panic 13.2-RELEASE with drmn0: [drm] GPU HANG
Summary: kernel panic 13.2-RELEASE with drmn0: [drm] GPU HANG
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-05-09 18:36 UTC by Greg Balfour
Modified: 2024-01-09 18:01 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Balfour 2023-05-09 18:36:18 UTC
Only seen once.  Was visiting duckduckgo with the tor-browser from pkg at the time.

FreeBSD desktop.example.com 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64

May  8 12:38:43 desktop syslogd: kernel boot file is /boot/kernel/kernel
May  8 12:38:43 desktop kernel: drmn0: [drm] GPU HANG: ecode 6:1:39393938, in Isolated Web Conten [109930]
May  8 12:38:43 desktop kernel: drmn0: [drm] Resetting chip for stopped heartbeat on rcs0
May  8 12:38:43 desktop kernel: drmn0: [drm] Isolated Web Conten[109930] context reset due to GPU hang
May  8 12:38:43 desktop kernel: drmn0: [drm] GPU HANG: ecode 6:1:39393938, in Isolated Web Conten [109930]
May  8 12:38:43 desktop kernel: 
May  8 12:38:43 desktop syslogd: last message repeated 1 times
May  8 12:38:43 desktop kernel: Fatal trap 12: page fault while in kernel mode
May  8 12:38:43 desktop kernel: cpuid = 6; apic id = 06
May  8 12:38:43 desktop kernel: fault virtual address   = 0x61
May  8 12:38:43 desktop kernel: fault code              = supervisor read data, page not present
May  8 12:38:43 desktop kernel: instruction pointer     = 0x20:0xffffffff80f55587
May  8 12:38:43 desktop kernel: stack pointer           = 0x28:0xfffffe001b7c9b60
May  8 12:38:43 desktop kernel: frame pointer           = 0x28:0xfffffe001b7c9ba0
May  8 12:38:43 desktop kernel: code segment            = base rx0, limit 0xfffff, type 0x1b
May  8 12:38:43 desktop kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
May  8 12:38:43 desktop kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
May  8 12:38:43 desktop kernel: current process         = 0 (linuxkpi_short_wq_0)
May  8 12:38:43 desktop kernel: trap number             = 12
May  8 12:38:43 desktop kernel: panic: page fault
May  8 12:38:43 desktop kernel: cpuid = 6
May  8 12:38:43 desktop kernel: time = 1683567443
May  8 12:38:43 desktop kernel: KDB: stack backtrace:
May  8 12:38:43 desktop kernel: #0 0xffffffff80c53dc5 at kdb_backtrace+0x65
May  8 12:38:43 desktop kernel: #1 0xffffffff80c06741 at vpanic+0x151
May  8 12:38:43 desktop kernel: #2 0xffffffff80c065e3 at panic+0x43
May  8 12:38:43 desktop kernel: #3 0xffffffff810b1fa7 at trap_fatal+0x387
May  8 12:38:43 desktop kernel: #4 0xffffffff810b1fff at trap_pfault+0x4f
May  8 12:38:43 desktop kernel: #5 0xffffffff81088e78 at calltrap+0x8
May  8 12:38:43 desktop kernel: #6 0xffffffff80f5567d at kmem_free+0x2d
May  8 12:38:43 desktop kernel: #7 0xffffffff8271e81d at __i915_gpu_coredump_free+0x12d
May  8 12:38:43 desktop kernel: #8 0xffffffff826efdd9 at intel_gt_handle_error+0xa9
May  8 12:38:43 desktop kernel: #9 0xffffffff826db131 at heartbeat+0x2a1
May  8 12:38:43 desktop kernel: #10 0xffffffff80e63653 at linux_work_fn+0xe3
May  8 12:38:43 desktop kernel: #11 0xffffffff80c68961 at taskqueue_run_locked+0x191
May  8 12:38:43 desktop kernel: #12 0xffffffff80c69c23 at taskqueue_thread_loop+0xc3
May  8 12:38:43 desktop kernel: #13 0xffffffff80bc2fce at fork_exit+0x7e
May  8 12:38:43 desktop kernel: #14 0xffffffff81089eee at fork_trampoline+0xe
May  8 12:38:43 desktop kernel: Uptime: 15d21h24m51s
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2023-05-09 19:05:02 UTC
Looks like a LinuxKPI bug:
- i915_vma_coredump contains an array of pages, freed in __i915_gpu_coredump_free() -> cleanup_gt() -> i915_vma_coredump_free() with

    for (page = 0; page < vma->page_count; page++)
        free_page((unsigned long)vma->pages[page]);

- free_page() just calls FreeBSD's kmem_free().  That is, it expects to receive a page mapped into the kernel map.

- Looks like those pages are allocated by pool_alloc() in i915_gpu_error.c.  It uses alloc_page() in the LinuxKPI, which just allocates and returns an unmapped page.  pool_alloc() extracts the direct map address.

So, i915kms is passing a direct map address to free_page(), which doesn't handle that.  Probably free_page() should handle direct-mapped addresses by resolving them to a page and freeing that with linux_free_pages(page, 0).
Comment 2 Mark Johnston freebsd_committer freebsd_triage 2023-05-09 19:22:28 UTC
https://reviews.freebsd.org/D40028
Comment 3 Greg Balfour 2023-05-10 03:34:39 UTC
Thanks for looking at this.  Just curious if this crash was related
to the GPU HANG.  Ever since 12.3-RELEASE I've seen the occasional
GPU HANG (but never a crash) and was wondering if this fix might
take care of those.  See
https://lists.freebsd.org/archives/freebsd-stable/2022-January/000501.html
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2023-05-10 13:22:04 UTC
(In reply to Greg Balfour from comment #3)
Hmm, I don't *think* the patch is likely to help with that.  If the bug fixed by the patch is triggered, I'd expect to see a kernel panic.

I'm not sure how to begin tracking down the cause of GPU driver hangs.
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2023-08-30 20:59:41 UTC
^Triage: clarify Summary.
Comment 6 commit-hook freebsd_committer freebsd_triage 2023-10-17 15:56:27 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=6223d0b67af923f53d962a9bf594dc37004dffe8

commit 6223d0b67af923f53d962a9bf594dc37004dffe8
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-17 14:26:18 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-17 15:19:06 +0000

    linuxkpi: Handle direct-mapped addresses in linux_free_kmem()

    See the analysis in PR 271333.  It is possible for driver code to
    allocate a page, store its address as returned by page_address(), then
    call free_page() on that address.  On most systems that'll result in the
    LinuxKPI calling kmem_free() with a direct-mapped address, which is not
    legal.

    Fix the problem by making linux_free_kmem() check the address to see
    whether it's direct-mapped or not, and handling it appropriately.

    PR:             271333, 274515
    Reviewed by:    hselasky, bz
    Tested by:      trasz
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D40028

 sys/compat/linuxkpi/common/src/linux_page.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2023-10-24 13:39:23 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4862eb8604d503b52e7c3aa7ff32155b75a1ff93

commit 4862eb8604d503b52e7c3aa7ff32155b75a1ff93
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-17 14:26:18 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-24 13:20:01 +0000

    linuxkpi: Handle direct-mapped addresses in linux_free_kmem()

    See the analysis in PR 271333.  It is possible for driver code to
    allocate a page, store its address as returned by page_address(), then
    call free_page() on that address.  On most systems that'll result in the
    LinuxKPI calling kmem_free() with a direct-mapped address, which is not
    legal.

    Fix the problem by making linux_free_kmem() check the address to see
    whether it's direct-mapped or not, and handling it appropriately.

    PR:             271333, 274515
    Reviewed by:    hselasky, bz
    Tested by:      trasz
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D40028

    (cherry picked from commit 6223d0b67af923f53d962a9bf594dc37004dffe8)

 sys/compat/linuxkpi/common/src/linux_page.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
Comment 8 commit-hook freebsd_committer freebsd_triage 2023-10-25 16:57:05 UTC
A commit in branch releng/14.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=87dbb943df73022dd98487c123aeb125da11c4af

commit 87dbb943df73022dd98487c123aeb125da11c4af
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-17 14:26:18 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-25 16:53:01 +0000

    linuxkpi: Handle direct-mapped addresses in linux_free_kmem()

    See the analysis in PR 271333.  It is possible for driver code to
    allocate a page, store its address as returned by page_address(), then
    call free_page() on that address.  On most systems that'll result in the
    LinuxKPI calling kmem_free() with a direct-mapped address, which is not
    legal.

    Fix the problem by making linux_free_kmem() check the address to see
    whether it's direct-mapped or not, and handling it appropriately.

    Approved by:    re (gjb)
    PR:             271333, 274515
    Reviewed by:    hselasky, bz
    Tested by:      trasz
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D40028

    (cherry picked from commit 6223d0b67af923f53d962a9bf594dc37004dffe8)
    (cherry picked from commit 4862eb8604d503b52e7c3aa7ff32155b75a1ff93)

 sys/compat/linuxkpi/common/src/linux_page.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2024-01-09 18:01:04 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=6cba7aec21bcd957478a987f9391fd33a4babdac

commit 6cba7aec21bcd957478a987f9391fd33a4babdac
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-17 14:26:18 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2024-01-09 17:59:49 +0000

    linuxkpi: Handle direct-mapped addresses in linux_free_kmem()

    See the analysis in PR 271333.  It is possible for driver code to
    allocate a page, store its address as returned by page_address(), then
    call free_page() on that address.  On most systems that'll result in the
    LinuxKPI calling kmem_free() with a direct-mapped address, which is not
    legal.

    Fix the problem by making linux_free_kmem() check the address to see
    whether it's direct-mapped or not, and handling it appropriately.

    PR:             271333, 274515
    Reviewed by:    hselasky, bz
    Tested by:      trasz
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D40028

    (cherry picked from commit 6223d0b67af923f53d962a9bf594dc37004dffe8)

 sys/compat/linuxkpi/common/src/linux_page.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)