Bug 280546 - iwlwifi panic in BT COEX code
Summary: iwlwifi panic in BT COEX code
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: wireless (show other bugs)
Version: 15.0-CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Bjoern A. Zeeb
URL:
Keywords:
: 283399 (view as bug list)
Depends on:
Blocks: iwlwifi
  Show dependency treegraph
 
Reported: 2024-08-01 02:07 UTC by rkoberman
Modified: 2025-02-05 06:40 UTC (History)
6 users (show)

See Also:


Attachments
core.txt from panic (181.43 KB, application/x-troff-man)
2024-08-01 02:07 UTC, rkoberman
no flags Details
Crash dumps from 14.2-RELEASE (46.05 KB, application/gzip)
2025-01-26 20:15 UTC, Robin Haberkorn
no flags Details
Proposed fix (1.64 KB, patch)
2025-01-30 01:37 UTC, Robin Haberkorn
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description rkoberman 2024-08-01 02:07:23 UTC
Created attachment 252415 [details]
core.txt from panic

System is a T16-Gen1 with the Alder Lake wifi. When starting the network, it panics with:
Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 12
fault virtual address   = 0xc
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff8359afd3
stack pointer           = 0x28:0xfffffe00f1341c80
frame pointer           = 0x28:0xfffffe00f1341d00
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (linuxkpi_short_wq_1)
rdi: fffffe016695c4f8 rsi: fffffe00f1341c48 rdx: ffffffff8118971b
rcx: 0000000000000000  r8: 0000000000000001  r9: ffffffffffffffff
rax: 0000000000000000 rbx: fffffe0167886e80 rbp: fffffe00f1341d00
r10: 0000000000010000 r11: 0000000000000001 r12: fffffe0167887478
r13: 0000000000000000 r14: fffffe016695c4c8 r15: fffff80003bec540

core.txt is attached. The backtrace shows the following items:
iwl_mvm_bt_notif_iterator() at iwl_mvm_bt_notif_iterator+0xf3/frame 0xfffffe00f1341d00
linuxkpi_ieee80211_iterate_interfaces() at
linuxkpi_ieee80211_iterate_interfaces+0x84/frame 0xfffffe00f1341d40
iwl_mvm_bt_coex_notif_handle() at iwl_mvm_bt_coex_notif_handle+0x7c/frame 0xfffffe00f1341da0
iwl_mvm_async_handlers_wk() at iwl_mvm_async_handlers_wk+0x110/frame 0xfffffe00f1341df0
Comment 1 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-08-03 18:51:41 UTC
For some analysis see:
https://lists.freebsd.org/archives/freebsd-wireless/2024-July/002305.html
Comment 2 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-11-10 01:38:07 UTC
Has this been reproducible for you?
Comment 3 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-12-18 17:49:51 UTC
*** Bug 283399 has been marked as a duplicate of this bug. ***
Comment 4 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-12-18 17:53:27 UTC
There's another report in 

https://bugs.freebsd.org/bugzilla/attachment.cgi?id=255934

attached to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280430 which should belong here.
Comment 5 mmatalka 2024-12-18 21:40:00 UTC
Thank you.  Anything I can do?

A few extra notes:

1. I believe this started after the most recent intel wifi update.  I have noticed an increase in hard freezes since then.  They seem to happen at random times.  Not too frequently, like maybe after 24+ hours of running.

2. I cannot confirm that this is the cause of (1), but it seems likely.  I think I just got lucky that I happened to hit the crash this early.
Comment 6 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-12-18 21:55:41 UTC
(In reply to mmatalka from comment #5)

Are you using BT with that machine or in the vicinity?

It sounds like it is reproducible for you (even at random)?  Would you be willing/able to further debug or run with debug a patch to get more info into the crashinfo?
Comment 7 mmatalka 2024-12-19 13:26:50 UTC
Bluetooth drivers get loaded but do not work on this laptop (AX2000 or whatever the chipset is).

The crash happens to me once every day or 2 so I'm happy to run anything you want.  I am currently not exporting crash dumps anywhere so please let me know what I need to do.  I compile 15.0-CURRENT from source already so I can apply and install a patch easily enough.
Comment 8 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-12-20 14:42:51 UTC
(In reply to mmatalka from comment #7)

The question is not so much if the iwmbt or later bluetooth driver gets loaded but if any bluetooth is used around you (could be headphones to a phone, a keyboard to a desktop, or otherwise).


I'll try to come up with some diagnostic patch based on a previous debugging session where Andrea had already provided some good insights and we thought it was due to a mentioned race in the iwlwifi code.  Back then it was the only substantial report but it seems we have at least 3 now.
Comment 9 mmatalka 2024-12-22 06:51:44 UTC
(In reply to Bjoern A. Zeeb from comment #8)
Yes, there is definitely bluetooth going in in my general vicinity.  Thank you.
Comment 10 Robin Haberkorn 2025-01-26 20:10:42 UTC
I just had the same crash apparently and will upload my logs as well. It happened on 14.2-RELEASE. And yes, I do use Bluetooth and there are quite a few Bluetooth devices around.

kgdb quickly reveals a NULL-pointer dereference (chanctx_conf->def.chan == NULL). But I will spare you the details since bz@freebsd.org did already guess it from the previous report. I will happily do some more debugging with my vmcore on your request. I am not really willing to share the vmcore file, though.

I don't understand this code enough to see whether it is valid that this pointer can be NULL. At least it is checked in other functions like iwl_mvm_bt_coex_notif_handle(). So I would suggest to simply add another NULL-pointer check to the existing if-statement in iwl_mvm_bt_notif_per_link(). And I would add the same to iwl_get_coex_type() as well.
On the other hand, this pointer becoming NULL might have a deeper meaning and point to another problem. But even then, it could make sense to add an assertion after fixing the underlying problem.
Comment 11 Robin Haberkorn 2025-01-26 20:15:34 UTC
Created attachment 257016 [details]
Crash dumps from 14.2-RELEASE

Crash dumps without vmcore. Created on
FreeBSD thinkpad-x270 14.2-RELEASE FreeBSD 14.2-RELEASE releng/14.2-n269506-c8918d6c7412 GENERIC amd64
Comment 12 Robin Haberkorn 2025-01-26 20:18:39 UTC
Well actually, this might already give you some vital information, so here's the dump of chanctx_conf:

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff80b3d797 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:523
#3  0xffffffff80b3dc6e in vpanic (fmt=0xffffffff8117704f "%s", ap=ap@entry=0xfffffe00c2937ae0) at /usr/src/sys/kern/kern_shutdown.c:967
#4  0xffffffff80b3dac3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:891
#5  0xffffffff81025a0b in trap_fatal (frame=0xfffffe00c2937bc0, eva=12) at /usr/src/sys/amd64/amd64/trap.c:952
#6  0xffffffff81025a56 in trap_pfault (frame=<unavailable>, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760
#7  <signal handler called>
#8  0xffffffff83c68076 in iwl_mvm_bt_notif_per_link (mvm=0xfffffe00f17eb508, vif=0xfffffe00f4c24e80, data=0xfffffe00c2937d40, link_id=0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:292
#9  iwl_mvm_bt_notif_iterator (_data=_data@entry=0xfffffe00c2937d40, mac=<optimized out>, vif=vif@entry=0xfffffe00f4c24e80) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:442
#10 0xffffffff80da3234 in linuxkpi_ieee80211_iterate_interfaces (hw=<optimized out>, flags=flags@entry=(IEEE80211_IFACE_ITER_NORMAL | IEEE80211_IFACE_ITER_ACTIVE | IEEE80211_IFACE_ITER__ATOMIC),
    iterfunc=0xffffffff83c67fd0 <iwl_mvm_bt_notif_iterator>, arg=arg@entry=0xfffffe00c2937d40) at /usr/src/sys/compat/linuxkpi/common/src/linux_80211.c:4865
#11 0xffffffff83c6789c in ieee80211_iterate_active_interfaces_atomic (hw=0xfffffe00c2937d40, flags=(IEEE80211_IFACE_ITER_NORMAL | IEEE80211_IFACE_ITER_ACTIVE | IEEE80211_IFACE_ITER__ATOMIC),
    arg=0xfffffe00c2937d40, iterfunc=<optimized out>) at /usr/src/sys/compat/linuxkpi/common/include/net/mac80211.h:1468
#12 iwl_mvm_bt_coex_notif_handle (mvm=0xfffffe00f17eb508) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:459
#13 0xffffffff83c7eb7b in iwl_mvm_async_handlers_wk (wk=<optimized out>) at /usr/src/sys/contrib/dev/iwlwifi/mvm/ops.c:1563
#14 0xffffffff80dc3064 in linux_work_fn (context=0xfffffe00f17eb588, pending=<optimized out>) at /usr/src/sys/compat/linuxkpi/common/src/linux_work.c:301
#15 0xffffffff80ba0d82 in taskqueue_run_locked (queue=queue@entry=0xfffff800048f3500) at /usr/src/sys/kern/subr_taskqueue.c:518
#16 0xffffffff80ba2002 in taskqueue_thread_loop (arg=arg@entry=0xfffff800048322c0) at /usr/src/sys/kern/subr_taskqueue.c:830
#17 0xffffffff80af760f in fork_exit (callout=0xffffffff80ba1f40 <taskqueue_thread_loop>, arg=0xfffff800048322c0, frame=0xfffffe00c2937f40) at /usr/src/sys/kern/kern_fork.c:1164
#18 <signal handler called>
(kgdb) f 8
#8  0xffffffff83c68076 in iwl_mvm_bt_notif_per_link (mvm=0xfffffe00f17eb508, vif=0xfffffe00f4c24e80, data=0xfffffe00c2937d40, link_id=0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:292
292                  chanctx_conf->def.chan->band != NL80211_BAND_2GHZ)) {
(kgdb) p chanctx_conf
$1 = (struct ieee80211_chanctx_conf *) 0xfffff8004aeebd40
(kgdb) p *chanctx_conf
$2 = {def = {chan = 0x0, width = NL80211_CHAN_WIDTH_20_NOHT, center_freq1 = 0, center_freq2 = 0, punctured = 0}, min_def = {chan = 0xfffff80048feb538, width = 82386608, center_freq1 = 4294965248,
    center_freq2 = 0, punctured = 0}, ap = {chan = 0x0, width = NL80211_CHAN_WIDTH_20_NOHT, center_freq1 = 0, center_freq2 = 0, punctured = 0}, rx_chains_dynamic = 1 '\001', rx_chains_static = 1 '\001',
  radar_enabled = false, drv_priv = 0xfffff8004aeebdc0 ""}
Comment 13 Robin Haberkorn 2025-01-28 23:31:24 UTC
(In reply to Robin Haberkorn from comment #10)

That crash just repeated - I have the same chanctx_conf->def.chan == NULL.
Interestingly this happened the moment I power-cycled the wifi router, I was connected to.
I will test whether this crash is reproducible.
Comment 14 Robin Haberkorn 2025-01-29 10:54:01 UTC
(In reply to Robin Haberkorn from comment #13)

I can reliably reproduce these crashes by power-cycling my router while there is some traffic on the network (e.g. a Youtube video playing). Having a bluetooth device paired is not a necessity to reproduce it.

This is of course intolerable. I only recently switched from if_iwm to if_iwlwifi because of lags in delivery of bluetooth messages to userspace when having both wifi and bluetooth traffic at the same time. Perhaps I will try to patch my iwlwifi driver in the aforementioned way and see if it helps.
Comment 15 Bjoern A. Zeeb freebsd_committer freebsd_triage 2025-01-29 15:17:49 UTC
(In reply to Robin Haberkorn from comment #14)

Hi Robin,

thanks a lot for going the full way of finding a reproducer.
That should also make it easier here to go and see about it if it works equally well in my setup.

I've got this on the radar but there's only so much fish I can fry a days too ;-)  If you find patching your iwlwifi with extra NULL pointer checks help it'll be good.  It might still be that it'll just make it harder to his a race against something else but in the end everything will help at the moment.

What would be interesting to know if you can panic this reliably and have the opportunity to further debug is what happens on the other CPUs at that time so we could identify the case we are racing against.
Comment 16 Robin Haberkorn 2025-01-30 01:37:17 UTC
Created attachment 257106 [details]
Proposed fix

This patch just adds the NULL-pointer checks I discussed earlier. This does fix crashes, I could otherwise reproduce reliably.
Comment 17 Robin Haberkorn 2025-02-04 18:02:15 UTC
Well except, that it doesn't. Because it's obviously a memory corruption issue:

(kgdb) f 8
#8  iwl_mvm_bt_notif_per_link (mvm=0xfffffe00f1bb2508, vif=0xfffffe00f33cde80, data=0xfffffe00c294bd40, link_id=0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:292
292                  chanctx_conf->def.chan->band != NL80211_BAND_2GHZ)) {
(kgdb) f 8
#8  iwl_mvm_bt_notif_per_link (mvm=0xfffffe00f1bb2508, vif=0xfffffe00f33cde80, data=0xfffffe00c294bd40, link_id=0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/coex.c:292
292                  chanctx_conf->def.chan->band != NL80211_BAND_2GHZ)) {
(kgdb) p chanctx_conf->def.chan
$1 = (struct linuxkpi_ieee80211_channel *) 0x8086b85b
(kgdb) p *chanctx_conf->def.chan
Cannot access memory at address 0x8086b85b

It's only sometimes resulting in unexpected NULL pointers.

bz@FreeBSD.org, perhaps it would be worth to pull in upstream patches? The last update was in May of 2023. Actually, considering that this driver is supposed to be synced with the Linux kernel version, shouldn't that be the first thing we try?
Comment 18 Bjoern A. Zeeb freebsd_committer freebsd_triage 2025-02-05 00:12:39 UTC
(In reply to Robin Haberkorn from comment #17)

I assume means you are on stable/14?  main has newer versions than that and they along with a lot of other bits are on my MFC roadmap for the next week.

I believe there is another problem with this looking at the code -- at least in main and that may be related to LinuxKPI bits not playing 100% by Linux rules.  I've recently started to fix this for some other bits and will look into it for this one as well but I first need to flush a good part of my stack I am piling.  Bare with me for a few days.
Comment 19 Robin Haberkorn 2025-02-05 06:40:37 UTC
(In reply to Bjoern A. Zeeb from comment #18)

I was on releng/14.2, which should be closest to the v14.2 release. But compared to stable/14, it's only missing the "script to extract firmware details".
Compared to main, there are quite a few things missing, but as far as I can see, there still was no sync with the upstream sources.