While trying to get Xorg graphics to work on my Alder Lake P laptop with FreeBSD 14 RELEASE, I hit a problem where loading the i915kms module appeared to cause the system to hang. I searched for reports about this problem and found a discussion at https://github.com/freebsd/drm-kmod/issues/252. I did eventually get graphics to work, but it required a small kernel change; I made a comment (as @hirairaka5) in the github thread to let people know what I had to do. I'm reporting the kernel issue here. There's a fairly obvious bug in the xarray (re-)implementation in linuxkpi that causes a kernel segfault in i195kms (from drm-515-kmod). There are a number of ways to fix it; I picked a simple one. Here is my local git commit for the fix, with the details: commit 31fc171e7ff77dfa01df39a0598b167fbd6a24bf (HEAD -> xarray-2) Author: Donn Seeley <donn@xmission.com> Date: Mon Jan 1 08:56:08 2024 -0700 xa_destroy: leave the xarray re-usable, like Linux does When trying to get graphics to work on an Alder Lake P laptop, I got a crash. I copied down the information from ddb: --- trap 0xc, rip=0xffffffff80b302ec, rsp=0xfffffe013acf44e0 ... __mtx_lock_sleep() at __mtx_lock_sleep+0xbc xa_load() at xa_load+0x73 guc_lrc_desc_pin() at guc_lrc_desc_pin+0x49 intel_guc_submission_enable() at intel_guc_submission_enable+0x7d __uc_init_hw() at __uc_init_hw+0x46f intel_gt_init_hw() at intel_gt_init_hw+0x481 intel_gt_resume() at intel_gt_resume+0x5cc intel_gt_init() at intel_gt_init+0x213 i915_gem_init() at i915_gem_init+0x92 i915_driver_probe() at i915_driver_probe+0x40 i915_pci_probe() at i915_pci_probe+0x40 linux_pci_attach_device() at linux_pci_attach_device+0x420 device_attach() at device_attach+0x3be bus_generic_driver_added() at bus_generic_driver_added+0xa1 [etc.] We got a segfault in __mtx_lock_sleep() when trying to dereference a null thread pointer. The sequence of events looks like this: intel_guc_submission_enable(guc) guc_init_lrc_mapping(guc) xa_destroy(&guc->context_lookup) mtx_destroy(&xa->mtx) _mtx_destroy(&(m)->mtx_lock) m->mtx_lock = MTX_DESTROYED guc_kernel_context_pin(guc, ce) guc_lrc_desc_pin(guc, loop=true) lrc_desc_registered(guc, id) __get_context(guc, id) xa_load(xa=&guc->context_lookup, index=id) xa_lock(xa) mtx_lock(&(xa)->mtx) mtx_lock_flags((m), 0) _mtx_lock_flags((m), (opts), (file), (line)) __mtx_lock_flags(&(m)->mtx_lock, o, f, l) _mtx_obtain_lock_fetch(m, &v, tid) atomic_fcmpset_acq_ptr(&(mp)->mtx_lock, vp, (tid)) _mtx_lock_sleep(m, v, opts, file, line) __mtx_lock_sleep(&(m)->mtx_lock, v) owner = lv_mtx_owner(v) ((struct thread *)((v) & ~MTX_FLAGMASK)) TD_IS_RUNNING(owner) TD_GET_STATE(td) == TDS_RUNNING (td)->td_state We crash because v == MTX_DESTROYED, which means that td, the thread pointer part of the owner cookie, is NULL and we can't dereference it. The Linux version of xa_destroy() is careful to fix up the xarray so that it can be used again. The FreeBSD version is simpler: we just need to maintain the xarray flags when we reinitialize the xarray structure. diff --git a/sys/compat/linuxkpi/common/src/linux_xarray.c b/sys/compat/linuxkpi/common/src/linux_xarray.c index 44900666242f..856a67ba7e3c 100644 --- a/sys/compat/linuxkpi/common/src/linux_xarray.c +++ b/sys/compat/linuxkpi/common/src/linux_xarray.c @@ -352,6 +352,9 @@ xa_destroy(struct xarray *xa) radix_tree_for_each_slot(ppslot, &xa->root, &iter, 0) radix_tree_iter_delete(&xa->root, &iter, ppslot); mtx_destroy(&xa->mtx); + + /* Linux leaves the xarray re-usable after xa_destroy(). */ + xa_init_flags(xa, xa->flags); } /*
The original ticket on this is 270888, but it appears that it may be a different problem, though it might be related. My issue was a total deadlock. System doe not respond to pings and nothing logged. I'm hoping that this problem might tie into mine. It is the issue in #252.
We generally cannot re-init mutex that is destroyed. It is probably fine for the non-debugging kernels, but for kernels with WITNESS it causes some data recorded for WITNESS book tracking. Then unloading the module causes dandling pointers left, and panic occurs later.
(In reply to Konstantin Belousov from comment #2) In that case, I think we can just drop the mtx_destroy() call in xa_destroy(). I'll test that change and let you know.
(In reply to Donn Seeley from comment #3) That would leave us with the same issue. Without mtx_destroy(), there are dandling pointers in witness machinery left.
(In reply to rkoberman from comment #1) As you know, I have the same "freezing" problem (with i5-13600 and UHD 770 Alderlake-S GPU), so I'm hoping some progress will be made on this. What makes you think the problem discussed here may be different?
(In reply to Konstantin Belousov from comment #4) Of course I realized that there would still be a problem when omitting mtx_destroy() almost immediately after I posted my reply. D'oh. I've been staring at the Linux xarray code and the i915kms driver, trying to figure out a good way to handle this issue. I could change the i915kms driver to avoid calling xa_destroy() in guc_init_lrc_mapping() on FreeBSD, but then I would need to be sure that we _do_ call xa_destroy() when the driver finally does free the xarray. I'll look at this in my spare time and I'll see if there's a reasonable solution.
(In reply to Donn Seeley from comment #6) Does unloading i915 work? If not, the call to xa_destroy could be simply removed. If unloading does work, perhaps module unload is the right moment to do it.
(In reply to Konstantin Belousov from comment #2) I had some time to look at this issue this morning. Some questions: If you can't reinitialize a mutex, then what is the point of MTX_NEW? The mutex(9) man page says that MTX_NEW prevents an assertion from firing if you call mtx_init() on a mutex without calling mtx_destroy() first. That language implies that you *can* safely call mtx_init() on a mutex that has been destroyed. If there's some reason why the witness code would have a problem with reinitializing a mutex, could we not just initialize the mutex with MTX_NOWITNESS? I noticed that the FreeBSD emulation of Linux spinlocks uses MTX_NOWITNESS, and the xarray code in Linux uses a spinlock for protection, so there's some precedent here. Comments, suggestions?
(In reply to Donn Seeley from comment #8) MTX_NEW suppresses the check that the mutex memory is zeroed (in reality it is less smart, but lets pretend). You can do mtx_init()/mtx_destroy()/mtx_init() as much as you want. You should not do mtx_init() and then unload the module, or mtx_init()/mtx_init().
(In reply to Konstantin Belousov from comment #9) Got it, thanks. So it sounds to me like the only risk here is unloading the module when witnessing is enabled. In that case, I think the best approach is to keep the code that reinitializes the mutex, and add MTX_NOWITNESS to the mtx_init() flags in xa_init_flags(). Does that seem reasonable to you?
(In reply to Donn Seeley from comment #10) Loosing witness support is very undesirable. Could you track the state of the xarray and explicitly reinit it when a call detects that xarray was finalized?