Bug 263613 - iwlwifi0 on framework laptop crashes (kernel panic) when setting IP
Summary: iwlwifi0 on framework laptop crashes (kernel panic) when setting IP
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Bjoern A. Zeeb
URL:
Keywords: crash, needs-qa
Depends on:
Blocks: frameworklaptop iwlwifi
  Show dependency treegraph
 
Reported: 2022-04-27 18:24 UTC by Jonathan Vasquez
Modified: 2024-02-19 16:42 UTC (History)
9 users (show)

See Also:
bz: mfc-stable14+
bz: mfc-stable13+


Attachments
dmesg (78.02 KB, text/plain)
2022-04-27 18:24 UTC, Jonathan Vasquez
no flags Details
ifconfig (849 bytes, text/plain)
2022-04-27 18:25 UTC, Jonathan Vasquez
no flags Details
rc.conf (208 bytes, text/plain)
2022-04-27 18:25 UTC, Jonathan Vasquez
no flags Details
uname -a (170 bytes, text/plain)
2022-04-27 18:25 UTC, Jonathan Vasquez
no flags Details
wpa_supplicant.conf (74 bytes, text/plain)
2022-04-27 18:26 UTC, Jonathan Vasquez
no flags Details
camera photo of crash (934.98 KB, image/jpeg)
2022-04-27 18:29 UTC, Jonathan Vasquez
no flags Details
rtwn0 external usb 2.0 wifi adapter info (1.43 KB, text/plain)
2022-04-27 18:41 UTC, Jonathan Vasquez
no flags Details
stable-13-n252783-ef2aa775301-20221018 (3.56 KB, text/plain)
2022-10-18 18:27 UTC, Jonathan Vasquez
no flags Details
wpa_supplicant and dhcp static lease (5.79 KB, text/plain)
2022-10-21 00:51 UTC, Jonathan Vasquez
no flags Details
wpa_supplicant and dhcp static lease Part 2 (18.56 KB, text/plain)
2022-10-21 01:22 UTC, Jonathan Vasquez
no flags Details
dmesg (DHCPOFFER on loop) (13.88 KB, text/plain)
2022-10-25 02:26 UTC, Jonathan Vasquez
no flags Details
DHCPOFFER on loop - Picture (799.25 KB, image/jpeg)
2022-10-25 02:28 UTC, Jonathan Vasquez
no flags Details
core.txt.1 (setting ip) (248.59 KB, application/x-troff-man)
2022-11-04 19:11 UTC, Jonathan Vasquez
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Vasquez 2022-04-27 18:24:58 UTC
Created attachment 233540 [details]
dmesg

Hey all,

I've spent a few days reading _a lot_ of documentation and experimenting with getting FreeBSD running on my Framework Laptop on 13.0-RELEASE, 13.1-RC4, and 14-CURRENT (just to see how far all of the branches are working on it - although originally I wanted to just run 13.1 since that's the next RELEASE but I was getting too many issues and crashes with it.. and I've been avoiding jumping into STABLE/CURRENT, but c'est la vie haha). I've definitely noticed a bunch of issues that either already got fixed, or still remain (but I've yet gathered exact repro scenarios). However, for this particular issue (and I'm not sure if it's 100% related but I think it might be), the Intel AX210 card on the framework laptop seems to associate with my AP properly, but no DHCPOFFERS are received. I think there was only one time a few days ago on an older build of CURRENT that I was able to get it to get an IP, but I haven't been able to replicate that working state.

Also, this is on the sources for 14-CURRENT (for both world and kernel) as of today 2022-04-27. I'm motivated to help any of the devs test any code needed to get, not only this issue, but really anything regarding the framework laptop as an actual desktop productivity machine. I'm not a FreeBSD pro, but have been using it on my home server for the past 2~ years. So apologies if there any particular concepts or things I don't yet know, feel free to point me in the right direction though :).

I've attached various parts of my system config and dmesg output to help debug this.

I've noticed that I sometimes get the following errors as well (I've also looked at what the Linux community said about those errors and it seems they were fixed in 5.13):

iwlwifi0: iwl_trans_send_cmd bad_state = 0
iwlwifi0: Failed to remove MAC context: -5

repro commands:

Given the above situation where the system started up the wlan0 interfaced (parent: iwlwifi0) and was associated to the AP but did not receive any DHCP offers:

1. ifconfig wlan0 down
2. ifconfig wlan0 192.168.1.105/24 netmask 255.255.255.0

> The command will trigger, and a few seconds later it will hard crash. I've attached a camera shot of this. But something relevant may be the following:

iwlwifi0: Failed to send binding (action:1): -5
iwlwifi0: PHY ctxt cmd error. ret=-5
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 1 (SCAN) -> 2 (AUTH)
iwlwifi0: No queue was found. Dropping TX
iwlwifi0: Failed to trigger RX queues sync (-5)

...

panic: lkpi_sta_auth_to_scan: lsta 0xfffff8012f0c5800 state not NONE: 0, nstate 1 arg 1

---

Overall, this laptop seems to have slowly gotten better support so that's definitely encouraging and I also read the recent news about FreeBSD and the work going on to improve Framework Laptop support. The wifi issues caused hard reboots on 13.1-RC4 current (It would just shut off, I believe the "firmware dying in a fire" that Kyle Evans mentioned here: https://lists.freebsd.org/archives/freebsd-wireless/2021-October/000099.html might have been what he meant by that). This may or may not be the same bug that I'm experiencing here, but maybe since I'm running -CURRENT now with the default debug configuration, "hard crash reboots" turn into "hard crash - do not reboot" situations? 

Anyways, let me know if anything and I can further help debug.
Comment 1 Jonathan Vasquez 2022-04-27 18:25:23 UTC
Created attachment 233541 [details]
ifconfig
Comment 2 Jonathan Vasquez 2022-04-27 18:25:42 UTC
Created attachment 233542 [details]
rc.conf
Comment 3 Jonathan Vasquez 2022-04-27 18:25:55 UTC
Created attachment 233543 [details]
uname -a
Comment 4 Jonathan Vasquez 2022-04-27 18:26:11 UTC
Created attachment 233544 [details]
wpa_supplicant.conf
Comment 5 Jonathan Vasquez 2022-04-27 18:29:04 UTC
Created attachment 233545 [details]
camera photo of crash
Comment 6 Jonathan Vasquez 2022-04-27 18:41:54 UTC
Created attachment 233548 [details]
rtwn0 external usb 2.0 wifi adapter info

I just tested an external USB 2.0 Wifi adapter that I had and it is detected by FreeBSD, lights up, and is able to associate, but didn't receive an IP. This could indicate that maybe I've just misconfigured something somewhere (I did follow all of the Advance Wireless Configuration steps in the handbook) when in comes to the IP association part. However, the hard crash that I detected is still a bug. Attaching output for the 'rtwn0' wifi adapter.
Comment 7 Chris Hutchinson 2022-04-28 00:46:15 UTC
(In reply to Jonathan Vasquez from comment #2)
Not sure if it matters. But in my case. I need to blacklist
if_iwm(4):
devmatch_blacklist="if_iwm.ko"
then I add if_iwlwifi to my kld_list.
So
wlans_iwlwifi0="wlan0"
ifconfig_wlan0="WPA SYNCDHCP"
devmatch_blacklist="if_iwm.ko"
kld_list="if_iwlwifi"

Maybe this might help?
I don't crash. :-)
Comment 8 Jonathan Vasquez 2022-04-28 01:44:10 UTC
Hey Chris,

Thanks for that :). So I've been playing around with the system for the whole day today trying to understand the behavior of this machine in conjunction with FreeBSD and its drivers. There's just too many variables to explain everything but the good news is that with Chris' suggestion, DHCP now works on the iwlwifi0 adapter. DHCP doesn't work on the rwtn0 device though (I'm testing both to further test the interaction of the drivers and the hardware). All the devices are working on Linux from the last time I tested them. Static IP assignment still crashes the iwlwifi0 device, and I was able to get Static IP working in the rwtn0 device but it doesn't consistently work across -immediate- reboots (I haven't tested more cold scenario since as I said, I was playing around with it for the whole day lol). Sometimes it takes 15 seconds for it to work, sometimes it takes 45+ seconds, and something it just doesn't work. 

There were also some other settings I was missing during static ip assignment, specifically setting the: defaultrouter="<ip>" field. I was able to test this through /etc/rc.conf + service netif restart, and I also tried to manually construct the device and see if it worked. It basically yielded the above results.

Example:

ifconfig wlan0 create wlandev rwtn0
ifconfig wlan0 up
ifconfig wlan0 scan (Shows the AP)

wpa_supplicant -i wlan0 -c /etc/wpa_supplicant.conf

----

On another terminal (tmux):

ifconfig wlan0 inet 192.168.1.101 netmask 255.255.255.0

route add default 192.168.1.1
echo "nameserver 192.168.1.1" > /etc/resolv.conf

ping 192.168.1.1 (No response, non-deterministic as above results described)

netstat -rn and arp -a showed what you would expect in a correct working network (like the default gateway being set to 192.168.1.1 and the iface it used as wlan0).

---

So yea, lots of different interactions here. I also noticed sometimes if I left the "ifconfig_ue0="DHCP" in the file, sometimes I wouldn't get a DHCP request on the wireless card side.. but after many hours later I tried leaving both of them enabled and it was still working fine.. so it could have just been some weirdness. I even rebooted my router just to make sure (even though everything was working, including multiple other machines connected to the same router).

this is my current /etc/rc.conf for further info

root@leslie:~ # cat /etc/rc.conf
hostname="leslie"
dumpdev="NO"
zfs_enable="YES"

kld_list="i915kms if_iwlwifi"
devmatch_blacklist="if_iwm"

ifconfig_ue0="DHCP"

wlans_iwlwifi0="wlan0"
ifconfig_wlan0="WPA SYNCDHCP"

create_args_wlan0="regdomain FCC country US"

dbus_enable="YES"

##### Other

#wlans_rtwn0="wlan0"
#ifconfig_wlan0="inet 192.168.1.101 netmask 255.255.255.0 ssid Summerland WPA"
#ifconfig_wlan0="WPA SYNCDHCP"
#defaultrouter="192.168.1.1"

Either way, at least we were able to find another bug (the static ip assignment on the iwlwifi0 card causes a crash).
Comment 9 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-04-28 06:51:58 UTC
(1) For the AX210 iwm(4) blocklisting should make zero difference as the IDs won't be in the driver and so it'll not probe or try to attach.  Something else funky is going on and it highly smells like "timing".

(2) The ifconfig wlan0 down is currently known to leave state behind on iwlwifi which will then in the follow-up result in the SW crash;  I was hoping I could fix that while I was on the road but will try to do so when I am back in the office next week.

(3) I'd highly appreciate if the bug report could not be convoluted with too many different issuesas it'll be hard for me to follow otherwise.

(4) If DHCP is an issue ifconfig list scan showing the AP doesn't mean you are associated.  You need to check for the "status: associated" in ifconfig wlan0 or otherwise possibly in wpa.  Likewise for manual configuration.
Comment 10 Jonathan Vasquez 2022-04-28 13:36:03 UTC
Hey Bjoern,

Thanks for taking a look into this.

1. Yup that may be the case.

2. Sounds good.

3. I only posted particular pieces of information (all networking related) that I thought may affect this particular PR. So it's done primarily for the sake of completeness to make sure everything is accounted for. I know the FreeBSD community is highly RTFM so I wanted to make sure I covered as much as I could regarding the networking situation.

4. Yup I'm aware of that. My posts mentioned that ifconfig wlan0 scan displayed the ssid since previous instructions I've read mentioned to check if your wifi adapter even displays the ssid in the first place (or any wifi networks at all). If you check my ifconfig output, it will display that it is in fact associated.

Let me know if there is anything else you want me to test out in the meantime that can further assist you.

- Jonathan
Comment 11 Jonathan Vasquez 2022-04-28 18:48:50 UTC
I just retested again without having the devlist_blacklist="if_iwm" and also without "if_iwlwifi" in kld_list. It seems to have no effect now (as you said Bjoern). I noticed the 'if_iwlwifi' get autoloaded a bit before the firmware is used either way so that line probably doesn't have much effect in this case.
Comment 12 Jonathan Vasquez 2022-05-04 03:59:23 UTC
This is probably related to this but I noticed that if I did multiple reboots in sequence, the ability for the wifi card to re-establish a connection diminished. I'm guessing this maybe regarding the state, but not OS state, but firmware state, I'm guessing there may be some sort of SRAM chip in there that is maintaining the previous state of the multiple and subsequent reboots. This may explain why if I wait a bit (I kinda have to right lol) it eventually is able to connect again.
Comment 13 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-08-07 21:29:41 UTC
(In reply to Jonathan Vasquez from comment #12)

Do you have the ability to check what your AP is thinking?
Comment 14 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-08-07 21:38:11 UTC
(In reply to Bjoern A. Zeeb from comment #13)

Also reading back to the beginning of the PR, this highly sounds like a wpa_supplicant issue which was fixed a few weeks ago.

In addition LinuxKPI for iwlwifi has moved forward since the original opening.
Can you update the PR with a latest status.

I am very sorry, I apparently hadn't assigned the PR to myself and completely missed it later for.
Comment 15 Gleb Popov freebsd_committer freebsd_triage 2022-09-16 05:48:42 UTC
My crash on the recent CURRENT with Intel(R) Dual Band Wireless AC 7265, which seems highly related:

http://arrowd.name/iwlwifi1.jpg
http://arrowd.name/iwlwifi2.jpg

The crash happens on "service netif start", presumably when the IP address is being set.
Comment 16 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-09-16 16:52:43 UTC
(In reply to Gleb Popov from comment #15)

Do not service netif restart wlan0 for the moment.
That'll destroy and re-crate your wlan0 interface and while we tear down the state and build it up, the firmware does not seem to like that.

Please try ifconfig wlan0 down && sleep 1 && ifconfig wlan0 up  if you have to.
Comment 17 Jonathan Vasquez 2022-10-18 14:42:56 UTC
No prob Bjoern :). I'm currently on 13-STABLE (stable/13-n252739-af335a43669) but am currently building the latest 13-STABLE as of (ef2aa77530127f) which includes your recent wireless changes. I'll report back once I have this built.

As for my ability to check what my AP is thinking, I'm not necessarily sure how to do that, but I am running DD-WRT on my Linksys WRT3200ACM if that helps.
Comment 18 Jonathan Vasquez 2022-10-18 18:27:27 UTC
Created attachment 237438 [details]
stable-13-n252783-ef2aa775301-20221018

Hey Bjoern,

I finished testing again on FreeBSD 13.1-STABLE #0 stable/13-n252783-ef2aa775301. I was able to associate the wireless card with my AP, but it failed to get a DHCP request (This worked before but very flaky, this time nothing though). There are no MAC address conflicts on my router, and my router is working fine and delegating DHCP addresses properly. I can also see all of the DHCP clients connected to my AP from the DD-WRT admin menu, and I can see that the MAC address for my wireless card is indeed connected and associated with the AP. I did notice that under the "Info" column under the "Wireless Nodes" section for DD-WRT, I saw that the iwlwifi card is being reported as "LEGACY" where as the other wireless nodes are either "VHT20SGI" or "HT20".

I also tried to re-test assigning a static IP to the card and see what happens in that case. The same thing as before occurred where immediately upon restarting the card, it crashed the entire system. A subsequent reboot automatically crashed the system on boot as well (since the direct ip assignments were still in /etc/rc.conf), and lastly doing another subsequent reboot (so two reboots back to back) allowed the system to boot up properly and assign the IP to the card, but it failed to associate to my AP (it actually associated with another random AP in the area since I didn't specify "WPA" in the config and thus it didn't limit it to the AP defined in /etc/wpa_supplicant.conf).

I've attached some of the relevant info from various parts of the system in the attached document (stable-13-n252783-ef2aa775301-20221018.txt).

- Jonathan
Comment 19 Jonathan Vasquez 2022-10-21 00:51:16 UTC
Created attachment 237491 [details]
wpa_supplicant and dhcp static lease

Hey Bjoern,

Good news. After some time experimenting with the wireless card and the AP, I was able to try something "clever" in order to get an IP from the router.

After some experiments, I continued to notice that the router didn't want to give a DHCPOFFER to the wireless card, even though the router has no issues giving the card an IP (the same 192.168.1.136 address) when using wifibox on freebsd (which is using iwlwifi directly on Linux via bhyve's PCI passthrough). We all know this workaround. That made me think, that maybe if I can just tell the router's DHCP server to always give this ip address to X MAC address, then maybe there -will- be a DHCPOFFER available.. maybe the router is getting confused because the freebsd iwlwifi driver is communicating with it in a way that it doesn't understand. Fast forward, after adding my wlan card's MAC address directly to my router's DHCP server's Static Leases table, I restarted the wlan0 interface on FreeBSD, re-associated with wpa_supplicant (it happened to be on 5G freq at this stage but earlier it was on 2.4 (Since I have identical SSIDs for both freq, and the ap/card can decide what's best for them), killed any old 'dhclient's, and re-ran dhclient wlan0, and it got the static IP offer immediately over DHCP. This means I can start testing this card in a more day-to-day basis now. 

This obviously opens questions as to why the AP/Card failed to exchange a proper DHCP IP (there are plenty of IP slots available in my DHCP range). Originally I was thinking maybe there was a conflict between some old lease file in /var/db/dhclient.*.*, given that I'm switching between wifibox, and the native driver on the same device. But this didn't have an affect.

I've attached the logs from my debugging and successful connection. I'll be continuing to experiment with some wpa_supplicant settings and doing some more reboots to test the performance of the reassociation and make sure that the driver can continue to reconnect. I also want to remove the DHCP static lease and see if it "works" automatically afterwards.
Comment 20 Jonathan Vasquez 2022-10-21 01:22:21 UTC
Created attachment 237492 [details]
wpa_supplicant and dhcp static lease Part 2

Just finished my testing. More good news.

1. I was able to wipe out any DHCP configs on my laptop (/var/db/dhclient*) and attempt to cleanly re-retrieve the static lease from the AP. This worked fine and was given the 192.168.1.140 static address.

2. After this I wiped out the wireless leases again from the laptop and removed the 192.168.1.140 static lease from the router. I rebooted the router afterwards. IIRC I also rebooted the laptop. Once rebooted, I attempted to retrieve a DHCP address again. This actually worked this time, and to make things more interesting, it re-gave me the old IP address that it was giving the wifibox bhyve VM, which was 192.168.1.136. So this definitely means that the router did remember this MAC address within it's DB somewhere, but it didn't want to give it to me before.

3. I wiped out the configs again and restored my original wifi settings:

(same /etc/wpa_supplicant.conf)

re-enabled the `ifconfig_wlan0="WPA SYNCDHCP` line so that upon reboot it will automatically start the wpa_supplicant app, associate, and start dhclient, and request IP.

After that I rebooted the machine, and the system booted up perfectly fine and retrieved the 192.168.1.136 offer again and was able to connect to the internet.

I remember a few months ago I went to a friend's house and their AP did not want to give me an IP address, (I believe it did associate though), so I'm wondering if this was the same bug that was affecting this..
Comment 21 Jonathan Vasquez 2022-10-25 02:26:40 UTC
Created attachment 237594 [details]
dmesg (DHCPOFFER on loop)

Some minor updates:

1. Sometimes the card takes a while to get a DHCPOFFER, or stays in the DHCPREQUEST stage for a while. Eventually it seems to get it though (Maybe after many minutes).

2. Today the card decided to go a little crazy when it was booting up (It was off for more than 12 hours). The card did receive a DHCPOFFER request, but then continuously seemed to not accept it and it continued to receive a DHCPOFFER request. I've attached a picture and the dmesg output.

After I (Ctrl+C) a bunch of times as boot, I eventually got to the login prompt. I logged in and ran `service netif restart wlan0`, after that it connected and got an IP. Seems I had to kick it once (maybe to clear some buggy state) haha.
Comment 22 Jonathan Vasquez 2022-10-25 02:28:04 UTC
Created attachment 237595 [details]
DHCPOFFER on loop - Picture
Comment 23 Jonathan Vasquez 2022-11-04 19:11:17 UTC
Created attachment 237862 [details]
core.txt.1 (setting ip)

Good news! I was able to finally fix my issue regarding saving crash dumps and I now have a dump to show (Since I was using encrypted swap ( w/ .eli extension in /etc/fstab) that was causing my cores not be able to be saved. For whatever reason I thought the system was "smart enough" to extract the core before re-encrypting that swap partition on boot). Anyways, I've attached the extract as core.txt.1 (setting ip). These are the network related settings when it crashes in /etc/rc.conf:

wlans_iwlwifi0="wlan0"
#ifconfig_wlan0="WPA SYNCDHCP"
ifconfig_wlan0="WPA inet 192.168.1.135/24"
create_args_wlan0="country US regdomain FCC"
defaultroute_delay="0"

I'm not familiar with kernel debugging but since I got crash dumps working and gdb, I'll see if I can poke around and see what happens. I do know a bit of basic "pdb (for python development)" and I know that pdb was suppose to be inspired by gdb, so maybe it will help haha.
Comment 24 Peter Much 2022-11-21 14:29:38 UTC
Working myself thru this material, I come to the conclusion that my bug #266887 might actually be a duplicate of this.

From the initial description here, the basic reproducible behaviour is to
  ifconfig down; ifconfig up
(after the iface was already up).

And this is exactly what I found as my core issue:
Doing a sequence of ifconfig up/down/up will always crash the system, either immediately or after a few seconds.

It doesn't matter what is done with the interface, if there is a connection established or dhcp working, or whatever. We can do it in single-user with no networking at all, and get the same result:
  kldload if_iwlwifi
  ifconfig create wlan0 wlandev iwlwifi0
  ifconfig up
  ifconfig down
  ifconfig up -> *KAPUTT*

Looking into bug #267029, that also talks about *re*establishing a network - so the underlying cause might also be this same sequence of up/down/up. Possibly some others, too.

As long as the interface is just brought up and used, it works - which may be misleading to people, because there are system programs (e.g. devd) that may bring the interface down and up on certain conditions, and getting a crash after that might come as a surprise.

@Bjoern: in comment 9 you talk about this being a known issue that will be fixed.
That was half a year ago, and my experience is this being still the basic issue (in STABLE-13). What's the state now?
Comment 25 Jonathan Vasquez 2022-11-21 14:45:02 UTC
I'm running latest stable/13 and it still happens and should be the same in main as well (14 current).
Comment 26 Peter Much 2022-11-21 15:02:46 UTC
(In reply to Jonathan Vasquez from comment #23)

Thank You, Jonathan. Looking into Your attachment, I see this:

#10 __mtx_lock_sleep (c=0xfffffe0156b201b0, v=<optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:594
#11 0xffffffff80d88d43 in psq_drain (psq=0xfffffe0156b20198)
    at /usr/src/sys/net80211/ieee80211_power.c:187
#12 ieee80211_node_psq_drain (ni=ni@entry=0xfffffe0156b19000)
    at /usr/src/sys/net80211/ieee80211_power.c:214

And this is exactly where I am currently looking into, because my kernel crashes just there. (I'm on a Fujitsu A3511 with AX201, but this here looks very much identical to my issue.)

Background: since ifconfig up/down/up does crash here, I resorted to do kldunload/load instead. This didn't work, and I had to fix issue #267869 first.
And now, when doing kldunload, I get this crash in mtx_lock() from ieee80211_node_psq_drain() - not always, but often.
Comment 27 Bjoern A. Zeeb freebsd_committer freebsd_triage 2022-11-21 15:21:59 UTC
Hi,

the backtrace is interesting as the ni still seems to be valid.

I am currently trying to hunt down what looks like a node reference count problem as I get a 0xdeadc0dedeadcXXX situation at times which indicates that the node was freed and still being used.
I have some extra local [wlan]debug code to give extra "landmarks";  I'll try to put it into main and follow-up here with what to run.
Comment 28 Jonathan Vasquez 2022-11-21 15:45:47 UTC
Thanks Bjoern ;).
Comment 29 Peter Much 2022-11-21 16:18:45 UTC
Bjoern, the lock seems to have a problem.

As I have no idea what we are doing here, but I can code K&R C and I can learn. So I spoiled my kernel with printf(). Specifically I put one in the entry of psq_drain():

        printf("entering psq_drain psq=%lld, lock=%lu\n",
                         (long long)psq, psq->psq_lock.mtx_lock);

Now when everything goes well, it works like this:

# ifconfig wlan0 down

[56] entering psq_drain psq=-2195130981992, lock=0
<6>[56] wlan0: link state changed to DOWN

# kldunload if_iwlwifi

[58] entering psq_drain psq=-2195461594728, lock=0


And when it crashes, it looks like this:

To the contrary, when it crashes it looks like this:

# ifconfig wlan0 down

[135] entering psq_drain psq=-2195141467752, lock=0
<6>[135] wlan0: link state changed to DOWN

# kldunload if_iwlwifi

[157] entering psq_drain psq=-2195120246376, lock=0
[157] entering psq_drain psq=-2195120246376, lock=4
[157]
[157] Fatal trap 12: page fault while in kernel mode

Tentatively, as far as I currently understand, that mtx_lock=4 is not supposed to go into __mtx_lock_sleep().
Comment 30 nbari 2023-02-11 17:03:13 UTC
In my case  if_iwlwifi.ko is loaded and not crashing but I can't create the interface, more details here (https://forums.freebsd.org/threads/wi-fi-6-ax200-iwlwifi0-siocifcreate2-wlan0.88000/#post-598051)

Something that I notice is that if I unload the module it never gets removed:

kldunload if_iwlwifi

it keeps adding it:

iwlwifi0: detached
pci7: <network> at device 0.0 (no driver attached)
Warning: memory type lkpikmalloc leaked memory on destroy (1 allocations, 64 bytes leaked).
Intel(R) Wireless WiFi based driver for FreeBSD
iwlwifi0: <iwlwifi> mem 0xfc600000-0xfc603fff at device 0.0 on pci7
iwlwifi0: successfully loaded firmware image 'iwlwifi-cc-a0-73.ucode'
iwlwifi0: api flags index 2 larger than supported by driver
iwlwifi0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
iwlwifi0: loaded firmware version 73.35c0a2c6.0 cc-a0-73.ucode op_mode iwlmvm
iwlwifi0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
iwlwifi0: Detected RF HR B3, rfid=0x10a100
iwlwifi0: base HW address: 50:e0:85:87:b5:18

Any ideas on how could I create the interface?
Comment 31 Graham Perrin 2023-09-21 07:04:25 UTC
Unlike the kernel panic that is photographed at 263632 comment 16 (nothing, in the photo, overtly related to Wi-Fi) …

… in this bug 263613 I do see relevance to Wi-Fi, in (at least) the backtrace in attachment 233545 [details]. So, 

----

^Triage: 

* kern (component) for a kernel panic
* expand the summary line to indicate that a crash is a kernel panic
* no change of assignment
* wireless@ remains on the CC list.
Comment 32 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-25 21:36:52 UTC
Has anyone tried any recent FreeBSD 14 or 15 to see what the current state for you is?
Comment 33 Gleb Popov freebsd_committer freebsd_triage 2023-10-28 11:53:47 UTC
I'm going to do a full upgrade to latest CURRENT for my laptop in the next couple of days. I'll try out iwlwifi too and report back here.
Comment 34 Gleb Popov freebsd_committer freebsd_triage 2023-10-29 17:53:49 UTC
It started working for me! I'm writing this comment while being connected via iwlwifi.

The "service netif restart" stopped panicing too, however it spits the following messages:

Oct 29 20:32:53 sbreeze kernel: wlan0: link state changed to DOWN
Oct 29 20:32:53 sbreeze kernel: iwlwifi0: Couldn't drain frames for staid 0, status 0x8
Oct 29 20:32:53 sbreeze kernel: iwlwifi0: lkpi_sta_run_to_init:1954: mo_sta_state(NOTEXIST) failed: -5
Oct 29 20:32:53 sbreeze kernel: iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Microcode SW error detected.  Restarting 0x2000000.
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Start IWL Error Log Dump:
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Transport status: 0x0000004B, valid: 6
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Loaded firmware version: 29.4063824552.0 7265D-29.ucode
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00003418 | ADVANCED_SYSASSERT          
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000220 | trm_hw_status0
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | trm_hw_status1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00043D6C | branchlink2
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x0004B002 | interruptlink1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | interruptlink2
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0xDEADBEEF | data1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0xDEADBEEF | data2
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0xDEADBEEF | data3
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x210046B6 | beacon time
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0xD39E294A | tsf low
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000135 | tsf hi
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | time gp1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x014E3B60 | time gp2
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000001 | uCode revision type
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x0000001D | uCode version major
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0xF2390AA8 | uCode version minor
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000210 | hw version
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00489200 | board version
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x005C0128 | hcmd
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x24022080 | isr0
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | isr1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000002 | isr2
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x004150C0 | isr3
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | isr4
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x005B0118 | last cmd Id
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | wait_event
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x000000D4 | l2p_control
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00018010 | l2p_duration
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000007 | l2p_mhvalid
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | l2p_addr_match
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000005 | lmpm_pmg_sel
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x03031934 | timestamp
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x0000C0C8 | flow_handler
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Fseq Registers:
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_ERROR_CODE
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_TOP_INIT_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_CNVIO_INIT_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_OTP_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_TOP_CONTENT_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_ALIVE_TOKEN
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_CNVI_ID
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_CNVR_ID
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | CNVI_AUX_MISC_CHIP
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | CNVR_AUX_MISC_CHIP
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_PREV_CNVIO_INIT_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_WIFI_FSEQ_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_BT_FSEQ_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: 0x00000000 | FSEQ_CLASS_TP_VERSION
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Collecting data: trigger 2 fired.
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: FW error in SYNC CMD MAC_CONTEXT_CMD
Oct 29 20:32:54 sbreeze kernel: #0 0xffffffff80a5fb9b at linux_dump_stack+0x1b
Oct 29 20:32:54 sbreeze kernel: #1 0xffffffff82cf6773 at iwl_trans_txq_send_hcmd+0x3f3
Oct 29 20:32:54 sbreeze kernel: #2 0xffffffff82c9229e at iwl_trans_send_cmd+0xce
Oct 29 20:32:54 sbreeze kernel: #3 0xffffffff82cd5319 at iwl_mvm_send_cmd_pdu+0x49
Oct 29 20:32:54 sbreeze kernel: #4 0xffffffff82ca27ba at iwl_mvm_mac_ctxt_remove+0x7a
Oct 29 20:32:54 sbreeze kernel: #5 0xffffffff82cadbcc at iwl_mvm_mac_remove_interface+0x1ec
Oct 29 20:32:54 sbreeze kernel: #6 0xffffffff80a59cae at lkpi_80211_mo_remove_interface+0x8e
Oct 29 20:32:54 sbreeze kernel: #7 0xffffffff80a555a6 at lkpi_ic_vap_delete+0xd6
Oct 29 20:32:54 sbreeze kernel: #8 0xffffffff8095f362 at wlan_clone_destroy+0x12
Oct 29 20:32:54 sbreeze kernel: #9 0xffffffff80914811 at if_clone_destroy+0x91
Oct 29 20:32:54 sbreeze kernel: #10 0xffffffff80910719 at ifioctl+0x899
Oct 29 20:32:54 sbreeze kernel: #11 0xffffffff8085ad45 at kern_ioctl+0x255
Oct 29 20:32:54 sbreeze kernel: #12 0xffffffff8085aa83 at sys_ioctl+0x123
Oct 29 20:32:54 sbreeze kernel: #13 0xffffffff80c21eb9 at amd64_syscall+0x109
Oct 29 20:32:54 sbreeze kernel: #14 0xffffffff80bf77bb at fast_syscall_common+0xf8
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Failed to send MAC_CONTEXT_CMD (action:3): -5
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Failed to disable queue 1 (ret=-5)
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Failed to remove station. Id=1
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Failed sending remove station
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: Applying debug destination EXTERNAL_DRAM
Oct 29 20:32:54 sbreeze syslogd: last message repeated 1 times
Oct 29 20:32:54 sbreeze kernel: iwlwifi0: FW already configured (0) - re-configuring
Oct 29 20:32:54 sbreeze kernel: WARNING !!!(({ __typeof(((volatile const unsigned long *)(&mvm->status))[((IWL_MVM_STATUS_IN_HW_RESTART) / 64)]) __var = ({ __asm__ __volatile__("": : :"memory"); (*(const volatile __typeof(((volatile const unsigned long *)(&mvm->status))[((IWL_MVM_STATUS_IN_HW_RESTART) / 64)]) *)&(((volatile const unsigned long *)(&mvm->status))[((IWL_MVM_STATUS_IN_HW_RESTART) / 64)])); }); __asm__ __volatile__("": : :"memory"); __var; }) & (1UL << ((IWL_MVM_STATUS_IN_HW_RESTART) & (64 - 1)))) && ctxt->ref failed at /usr/src/sys/contrib/dev/iwlwifi/mvm/phy-ctxt.c:269
Oct 29 20:32:54 sbreeze kernel: wlan0: Ethernet address: 9c:da:3e:77:91:d3
Oct 29 20:32:56 sbreeze kernel: wlan0: link state changed to UP

I will keep running iwlwifi on my laptop and will report back any issues I'll encounter. Thanks again for the great work!
Comment 35 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-19 01:05:33 UTC
Jonathan, any update from you?

If I understand Gleb correctly this problem is gone for him but the PR 271979 is still seen (tracked there).
Comment 36 Jonathan Vasquez 2023-12-19 14:23:53 UTC
Hey Bjoern,

I hope you are doing well ;). I’ve switched out my AX210 for an Atheros AR9462. But I’m happy to see that things have improved. In the future I may upgrade back to the AX210. Are we supporting 802.11n or higher speeds for the AX210 or still capped at the lower speeds?
Comment 37 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:20 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2ac8a2189ac6707f48f77ef2e36baf696a0d2f40

commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:53 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 38 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:37 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=713db49d06deee90dd358b2e4b9ca05368a5eaf6

commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:21 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  13 ++++-
 6 files changed, 134 insertions(+), 28 deletions(-)
Comment 39 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:16 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b392b36d3776b696601ce0253256803276d24ea2

commit b392b36d3776b696601ce0253256803276d24ea2
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1400509 to be able to detect
    this change.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)
    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  15 +++--
 sys/sys/param.h                |   2 +-
 8 files changed, 142 insertions(+), 30 deletions(-)
Comment 40 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:20 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8c450ea1083b03f30871506b59034f26bc608972

commit 8c450ea1083b03f30871506b59034f26bc608972
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 41 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:15 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=184ccc414686ea32c64f063c081c7cc1adeae7c3

commit 184ccc414686ea32c64f063c081c7cc1adeae7c3
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:02 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 42 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:23 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a7e1fc7f620d3341549c1380f550aaafbdb45622

commit a7e1fc7f620d3341549c1380f550aaafbdb45622
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:01 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1303501 to be able to detect
    this change.

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 sys/sys/param.h                |   2 +-
 8 files changed, 143 insertions(+), 34 deletions(-)
Comment 43 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:10:48 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf

commit d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:09:22 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    Approved by:    re (cperciva)
    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)
    (cherry picked from commit 184ccc414686ea32c64f063c081c7cc1adeae7c3)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 44 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:10:57 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9b998db87c28356fce21784c4f8bfb8737615e1f

commit 9b998db87c28356fce21784c4f8bfb8737615e1f
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:07:20 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version will get updated to 1303001 to be able to detect
    this change.

    Approved by:    re (cperciva)

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)
    (cherry picked from commit a7e1fc7f620d3341549c1380f550aaafbdb45622)

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 6 files changed, 136 insertions(+), 33 deletions(-)
Comment 45 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 16:42:59 UTC
I believe all the firmware crashes reported here have been solved for a few months and 15/14/13/and 13.3 should not have these problems anymore.

I'll close but in case you still see the problem please re-open.
Also if anyone can confirm it working please let us know.

Note: there is still the "Invalid TXQ id" I know some people are seeing; that is tracked in PR 274382 (just in case you test and hit that please follow-up there).