276057 – crash in xa_destroy() while initializing the i915kms module

Bug 276057 - crash in xa_destroy() while initializing the i915kms module

Summary: crash in xa_destroy() while initializing the i915kms module

Status:	New

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	14.0-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	freebsd-bugs (Nobody)

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2024-01-01 16:14 UTC by Donn Seeley
Modified:	2024-01-18 23:30 UTC (History)
CC List:	5 users (show)

See Also:	270888

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Donn Seeley 2024-01-01 16:14:45 UTC

While trying to get Xorg graphics to work on my Alder Lake P laptop with FreeBSD 14 RELEASE, I hit a problem where loading the i915kms module appeared to cause the system to hang.  I searched for reports about this problem and found a discussion at https://github.com/freebsd/drm-kmod/issues/252.  I did eventually get graphics to work, but it required a small kernel change; I made a comment (as @hirairaka5) in the github thread to let people know what I had to do.

I'm reporting the kernel issue here.  There's a fairly obvious bug in the xarray (re-)implementation in linuxkpi that causes a kernel segfault in i195kms (from drm-515-kmod).  There are a number of ways to fix it; I picked a simple one.  Here is my local git commit for the fix, with the details:

commit 31fc171e7ff77dfa01df39a0598b167fbd6a24bf (HEAD -> xarray-2)
Author: Donn Seeley <donn@xmission.com>
Date:   Mon Jan 1 08:56:08 2024 -0700

    xa_destroy: leave the xarray re-usable, like Linux does
    
    When trying to get graphics to work on an Alder Lake P laptop, I got a
    crash.  I copied down the information from ddb:
    
      --- trap 0xc, rip=0xffffffff80b302ec, rsp=0xfffffe013acf44e0 ...
      __mtx_lock_sleep() at __mtx_lock_sleep+0xbc
      xa_load() at xa_load+0x73
      guc_lrc_desc_pin() at guc_lrc_desc_pin+0x49
      intel_guc_submission_enable() at intel_guc_submission_enable+0x7d
      __uc_init_hw() at __uc_init_hw+0x46f
      intel_gt_init_hw() at intel_gt_init_hw+0x481
      intel_gt_resume() at intel_gt_resume+0x5cc
      intel_gt_init() at intel_gt_init+0x213
      i915_gem_init() at i915_gem_init+0x92
      i915_driver_probe() at i915_driver_probe+0x40
      i915_pci_probe() at i915_pci_probe+0x40
      linux_pci_attach_device() at linux_pci_attach_device+0x420
      device_attach() at device_attach+0x3be
      bus_generic_driver_added() at bus_generic_driver_added+0xa1
      [etc.]
    
    We got a segfault in __mtx_lock_sleep() when trying to dereference a
    null thread pointer.  The sequence of events looks like this:
    
      intel_guc_submission_enable(guc)
        guc_init_lrc_mapping(guc)
          xa_destroy(&guc->context_lookup)
            mtx_destroy(&xa->mtx)
              _mtx_destroy(&(m)->mtx_lock)
                m->mtx_lock = MTX_DESTROYED
          guc_kernel_context_pin(guc, ce)
            guc_lrc_desc_pin(guc, loop=true)
              lrc_desc_registered(guc, id)
                __get_context(guc, id)
                  xa_load(xa=&guc->context_lookup, index=id)
                    xa_lock(xa)
                      mtx_lock(&(xa)->mtx)
                        mtx_lock_flags((m), 0)
                          _mtx_lock_flags((m), (opts), (file), (line))
                            __mtx_lock_flags(&(m)->mtx_lock, o, f, l)
                              _mtx_obtain_lock_fetch(m, &v, tid)
                                atomic_fcmpset_acq_ptr(&(mp)->mtx_lock, vp, (tid))
                              _mtx_lock_sleep(m, v, opts, file, line)
                                __mtx_lock_sleep(&(m)->mtx_lock, v)
                                  owner = lv_mtx_owner(v)
                                    ((struct thread *)((v) & ~MTX_FLAGMASK))
                                  TD_IS_RUNNING(owner)
                                    TD_GET_STATE(td) == TDS_RUNNING
                                      (td)->td_state
    
    We crash because v == MTX_DESTROYED, which means that td, the thread
    pointer part of the owner cookie, is NULL and we can't dereference it.
    
    The Linux version of xa_destroy() is careful to fix up the xarray so
    that it can be used again.  The FreeBSD version is simpler: we just need
    to maintain the xarray flags when we reinitialize the xarray structure.

diff --git a/sys/compat/linuxkpi/common/src/linux_xarray.c b/sys/compat/linuxkpi/common/src/linux_xarray.c
index 44900666242f..856a67ba7e3c 100644
--- a/sys/compat/linuxkpi/common/src/linux_xarray.c
+++ b/sys/compat/linuxkpi/common/src/linux_xarray.c
@@ -352,6 +352,9 @@ xa_destroy(struct xarray *xa)
        radix_tree_for_each_slot(ppslot, &xa->root, &iter, 0)
                radix_tree_iter_delete(&xa->root, &iter, ppslot);
        mtx_destroy(&xa->mtx);
+
+       /* Linux leaves the xarray re-usable after xa_destroy().  */
+       xa_init_flags(xa, xa->flags);
 }
 
 /*

Comment 1 rkoberman 2024-01-01 17:42:50 UTC

The original ticket on this is 270888, but it appears that it may be a different problem, though it might be related. My issue was a total deadlock. System doe not respond to pings and nothing logged. I'm hoping that this problem might tie into mine. It is the issue in #252.

Comment 2 Konstantin Belousov freebsd_committer

2024-01-01 20:08:05 UTC

We generally cannot re-init mutex that is destroyed.  It is probably fine for
the non-debugging kernels, but for kernels with WITNESS it causes some data
recorded for WITNESS book tracking.  Then unloading the module causes dandling
pointers left, and panic occurs later.

Comment 3 Donn Seeley 2024-01-01 22:01:52 UTC

(In reply to Konstantin Belousov from comment #2)
In that case, I think we can just drop the mtx_destroy() call in xa_destroy().  I'll test that change and let you know.

Comment 4 Konstantin Belousov freebsd_committer

2024-01-01 22:09:07 UTC

(In reply to Donn Seeley from comment #3)
That would leave us with the same issue. Without mtx_destroy(), there are dandling
pointers in witness machinery left.

Comment 5 Kostas Oikonomou 2024-01-01 23:46:17 UTC

(In reply to rkoberman from comment #1)
As you know, I have the same "freezing" problem (with i5-13600 and UHD 770 Alderlake-S GPU), so I'm hoping some progress will be made on this.
What makes you think the problem discussed here may be different?

Comment 6 Donn Seeley 2024-01-01 23:49:51 UTC

(In reply to Konstantin Belousov from comment #4)
Of course I realized that there would still be a problem when omitting mtx_destroy() almost immediately after I posted my reply.  D'oh.

I've been staring at the Linux xarray code and the i915kms driver, trying to figure out a good way to handle this issue.  I could change the i915kms driver to avoid calling xa_destroy() in guc_init_lrc_mapping() on FreeBSD, but then I would need to be sure that we _do_ call xa_destroy() when the driver finally does free the xarray.  I'll look at this in my spare time and I'll see if there's a reasonable solution.

Comment 7 Konstantin Belousov freebsd_committer

2024-01-05 05:42:50 UTC

(In reply to Donn Seeley from comment #6)
Does unloading i915 work?  If not, the call to xa_destroy could be simply
removed.

If unloading does work, perhaps module unload is the right moment to do it.

Comment 8 Donn Seeley 2024-01-13 18:44:06 UTC

(In reply to Konstantin Belousov from comment #2)
I had some time to look at this issue this morning.  Some questions:

  If you can't reinitialize a mutex, then what is the point of MTX_NEW?
  The mutex(9) man page says that MTX_NEW prevents an assertion from
  firing if you call mtx_init() on a mutex without calling mtx_destroy()
  first.  That language implies that you *can* safely call mtx_init() on
  a mutex that has been destroyed.

  If there's some reason why the witness code would have a problem with
  reinitializing a mutex, could we not just initialize the mutex with
  MTX_NOWITNESS?  I noticed that the FreeBSD emulation of Linux
  spinlocks uses MTX_NOWITNESS, and the xarray code in Linux uses a
  spinlock for protection, so there's some precedent here.

Comments, suggestions?

Comment 9 Konstantin Belousov freebsd_committer

2024-01-13 19:10:59 UTC

(In reply to Donn Seeley from comment #8)
MTX_NEW suppresses the check that the mutex memory is zeroed (in reality it is
less smart, but lets pretend).

You can do mtx_init()/mtx_destroy()/mtx_init() as much as you want.
You should not do mtx_init() and then unload the module, or mtx_init()/mtx_init().

Comment 10 Donn Seeley 2024-01-13 22:04:35 UTC

(In reply to Konstantin Belousov from comment #9)
Got it, thanks.  So it sounds to me like the only risk here is unloading
the module when witnessing is enabled.  In that case, I think the best
approach is to keep the code that reinitializes the mutex, and add
MTX_NOWITNESS to the mtx_init() flags in xa_init_flags().  Does that
seem reasonable to you?

Comment 11 Konstantin Belousov freebsd_committer

2024-01-18 23:30:37 UTC

(In reply to Donn Seeley from comment #10)
Loosing witness support is very undesirable.

Could you track the state of the xarray and explicitly reinit it when a call
detects that xarray was finalized?