Bug 274003 - iwlwifi driver update 2023-09-21
Summary: iwlwifi driver update 2023-09-21
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Bjoern A. Zeeb
URL:
Keywords: crash, tracking
Depends on:
Blocks: iwlwifi
  Show dependency treegraph
 
Reported: 2023-09-21 17:47 UTC by Bjoern A. Zeeb
Modified: 2024-02-19 21:41 UTC (History)
5 users (show)

See Also:
bz: mfc-stable14+
bz: mfc-stable13+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-09-21 17:47:59 UTC
"meta-bug" for collecting all (new) issues after the iwlwifi driver and firmware update 2023-09-21.
Comment 1 Bakul Shah 2023-10-01 06:06:14 UTC
guest VM: -current (commit: 8a77bc5e1b - past your 16e688b2a commit)
host: stable/14 host (Ryzen)

It came up fine but *crashed* on "service netif restart wlan0".
In case it matters, I am running GENERIC-NODEBUG, wifi: Intel 9260
[Note: This is the first time it worked at all in a VM!]

dhclient exiting
Stopping wpa_supplicant.
Waiting for PIDS: 354iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
.
Stopping Network: wlan0.
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=0
        ether xxxxxxxxxxxxx
        groups: wlan
        ssid "" channel 8 (2447 MHz 11g)
        regdomain FCC country US authmode OPEN privacy OFF txpower 30 bmiss 7
        scanvalid 60 protmode CTS wme
        parent interface: iwlwifi0
        media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Sep 30 22:48:50 fbsd15 dhclient[666]: connection closed
Sep 30 22:48:50 fbsd15 dhclient[666]: exiting.



Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x458
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b27cac
stack pointer           = 0x28:0xfffffe00748e29c0
frame pointer           = 0x28:0xfffffe00748e2a40
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2378 (ifconfig)
rdi: fffffe0075f7c1b0 rsi: 0000000000000004 rdx: 0000000000000000
rcx: fffffe00759ab740  r8: 0000000000000000  r9: fffff80001054b98
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00748e2a40
r10: 0000000000000000 r11: fffff8000103dd70 r12: fffffe00748e29e0
r13: fffffe00759ab740 r14: 0000000000000000 r15: fffffe0075f7c1b0
trap number             = 12
...
panic() at panic+0x43/frame 0xfffffe00748e2830
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00748e2890
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00748e28f0
calltrap() at calltrap+0x8/frame 0xfffffe00748e28f0
--- trap 0xc, rip = 0xffffffff80b27cac, rsp = 0xfffffe00748e29c0, rbp = 0xfffffe00748e2a40 ---
__mtx_lock_sleep() at __mtx_lock_sleep+0xbc/frame 0xfffffe00748e2a40
ieee80211_node_psq_drain() at ieee80211_node_psq_drain+0x100/frame 0xfffffe00748e2a90
node_cleanup() at node_cleanup+0x65/frame 0xfffffe00748e2ac0
node_free() at node_free+0x25/frame 0xfffffe00748e2ae0
ieee80211_node_vdetach() at ieee80211_node_vdetach+0x2b/frame 0xfffffe00748e2b00
ieee80211_vap_detach() at ieee80211_vap_detach+0x41d/frame 0xfffffe00748e2b40
lkpi_ic_vap_delete() at lkpi_ic_vap_delete+0x9d/frame 0xfffffe00748e2b80
wlan_clone_destroy() at wlan_clone_destroy+0x12/frame 0xfffffe00748e2b90
if_clone_destroy() at if_clone_destroy+0x91/frame 0xfffffe00748e2bd0
ifioctl() at ifioctl+0x899/frame 0xfffffe00748e2cc0
kern_ioctl() at kern_ioctl+0x255/frame 0xfffffe00748e2d30
sys_ioctl() at sys_ioctl+0x123/frame 0xfffffe00748e2e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00748e2f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00748e2f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x1188e953e28a, rsp = 0x1188e5b34f68, rbp = 0x1188e5b34fa0 ---
KDB: enter: panic
[ thread pid 2378 tid 100283 ]
Stopped at      kdb_enter+0x32: movq    $0,0xe29e83(%rip)
db>
Comment 2 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-02 11:38:40 UTC
(In reply to Bakul Shah from comment #1)

Initially looking at the backtrace I thought it's a problem related to what I posted in the follow-up (backtraces may differ): https://lists.freebsd.org/archives/freebsd-wireless/2023-September/001449.html

But thank to posting the full output, the real problem comes out of iwlwifi:
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
It's a FreeBSD enhanced error message already.
Given the status I added to that error it seems the sta is gone already when we try to drain (ADD_STA_MODIFY_NON_EXISTING_STA -- driver requested to modify a station that doesn't exit).
So after all it could be related to the node_free() problem in your backtrace, which I started tracing on Saturday.  I'll follow-up when that is supposed to be fixed and we'll see if this is going away then too.
Comment 3 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-02 11:52:41 UTC
(In reply to Bjoern A. Zeeb from comment #2)

Also seems to be related to PR 273985.
Comment 4 Bakul Shah 2023-10-03 02:17:50 UTC
After exiting the VM, the pci slot has to be reset (from the host) before the interface works again. But even then it worked rarely. typically I would see

Oct  2 11:58:39 fbsd15 dhclient[999]: send_packet: No buffer space available
iwlwifi0: No beacon heard and the time event is over already...
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)

iwlwifi0: Queue 5 is active on fifo 3 and stuck for 10000 ms. SW [2, 3] HW [2, 3] FH TRB=0x080305001
iwlwifi0: Microcode SW error detected. Restarting 0x0.
iwlwifi0: Start IWL Error Log Dump:
iwlwifi0: Transport status: 0x0000004A, valid: 6
iwlwifi0: Loaded firmware version: 46.ff18e32a.0 9260-th-b0-jf-b0-46.ucode
iwlwifi0: 0x00000084 | NMI_INTERRUPT_UNKNOWN
iwlwifi0: 0x00A022F0 | trm_hw_status0
iwlwifi0: 0x00000000 | trm_hw_status1
iwlwifi0: 0x00481CEE | branchlink2

etc.

On "service netif stop wlan0" the interface disappears. Then "service netif start wlan0" doesn't work with

iwlwifi0: lkpi_ic_vap_create: failed to start hw: 17
ifconfig: SIOCIFCREATE2 (wlan0): Input/output error

I then recompiled the kernel from scratch with debug symbols (by removing

WITHOUT_DEBUG_FILES=yes
MK_DEBUG_FILES=no

Now it never works, with the same symptoms. And it panics after a while:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff81f855f2
stack pointer           = 0x28:0xfffffe0074175de0
frame pointer           = 0x28:0xfffffe0074175de0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 6 (txg_thread_enter)
rdi: fffffe006aa4d130 rsi: fffff80003592880 rdx: fffffe006aa4d140
rcx: 7fffffff00000000  r8: 0000000000000000  r9: fffffe0074176000
rax: fffff80004cfcd60 rbx: fffffe006aa4c000 rbp: fffffe0074175de0
r10: 0000000000000000 r11: 000000007fffdb42 r12: fffffe006aa4d130
r13: 0000000000000064 r14: fffffe006aa4d110 r15: fffffe006aaae020
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1696273625
KDB: stack backtrace:

I can try to capture more data now that I can use gdb from the host to the VM.
Comment 5 Graham Perrin 2023-10-03 05:57:48 UTC
^Triage: 

* kern (component) and crash (keyword) for kernel panics.

(In reply to Bjoern A. Zeeb from comment #0)

> "meta-bug" for collecting all (new) issues after the iwlwifi 
> driver and firmware update 2023-09-21.

Would you like people's different cases to be mixed into this one report, or separated out?
Comment 6 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-03 19:00:26 UTC
(In reply to Graham Perrin from comment #5)

Feel free to put them here.
Comment 7 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-03 19:04:31 UTC
(In reply to Bakul Shah from comment #4)

(1) the "etc." bit is actually important.

(2) given the firmware version this is a 9xxx? or 8xxx?  I cannot remember what device you had?

(3) iwlwifi0: lkpi_ic_vap_create: failed to start hw: 17 is understood and mentioned in https://cgit.FreeBSD.org/src/commit/?id=dbf7691999abe501e0ebc0fe4d8d9e97718d3890
Comment 8 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-03 21:21:12 UTC
(In reply to Bjoern A. Zeeb from comment #7)

# (3) should be fixed by https://cgit.FreeBSD.org/src/commit/?id=6c38c6b1b917957d420902213f318bf0153214f2
Comment 9 Bakul Shah 2023-10-03 21:33:15 UTC
(2) Device:

iwlwifi0@pci0:0:7:0:    class=0x028000 rev=0x29 hdr=0x00 vendor=0x8086 device=0x2526 subvendor=0x8086 subdevice=0x0014
    vendor     = 'Intel Corporation'
    device     = 'Wireless-AC 9260'
    class      = network

(1) What the system reports (I haven't applied your latest fix):

Autoloading module: if_iwlwifi
Intel(R) Wireless WiFi based driver for FreeBSD
iwlwifi0: <iwlwifi> mem 0xc1034000-0xc1037fff at device 7.0 on pci0
iwlwifi0: Detected crf-id 0x2816, cnv-id 0x1000200 wfpm id 0x80000000
iwlwifi0: PCI dev 2526/0014, rev=0x321, rfid=0x105110
iwlwifi0: successfully loaded firmware image 'iwlwifi-9260-th-b0-jf-b0-46.ucode'
iwlwifi0: WRT: Overriding region id 0
iwlwifi0: WRT: Overriding region id 1
iwlwifi0: WRT: Overriding region id 2
iwlwifi0: WRT: Overriding region id 3
iwlwifi0: WRT: Overriding region id 4
iwlwifi0: WRT: Overriding region id 6
iwlwifi0: WRT: Overriding region id 8
iwlwifi0: WRT: Overriding region id 9
iwlwifi0: WRT: Overriding region id 10
iwlwifi0: WRT: Overriding region id 11
iwlwifi0: WRT: Overriding region id 15
iwlwifi0: WRT: Overriding region id 16
iwlwifi0: WRT: Overriding region id 18
iwlwifi0: WRT: Overriding region id 19
iwlwifi0: WRT: Overriding region id 20
iwlwifi0: WRT: Overriding region id 21
iwlwifi0: WRT: Overriding region id 28
iwlwifi0: loaded firmware version 46.ff18e32a.0 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: SecBoot CPU1 Status: 0x3000001, CPU2 Status: 0x0
iwlwifi0: WFPM_ARC1_PD_NOTIFICATION: 0x2f
iwlwifi0: HPM_SECONDARY_DEVICE_STATE: 0x42
iwlwifi0: WFPM_MAC_OTP_CFG7_ADDR: 0x0
iwlwifi0: WFPM_MAC_OTP_CFG7_DATA: 0x4
iwlwifi0: UMAC PC: 0xc0080000
iwlwifi0: LMAC PC: 0x605dc
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Not valid error log pointer 0x00000000 for Init uCode
iwlwifi0: IML/ROM dump:
iwlwifi0: 0x00000000 | IML/ROM error/state
iwlwifi0: 0x03000001 | IML/ROM data1
iwlwifi0: Fseq Registers:
iwlwifi0: 0xE3667178 | FSEQ_ERROR_CODE
iwlwifi0: 0x00000000 | FSEQ_TOP_INIT_VERSION
iwlwifi0: 0xDCA958AE | FSEQ_CNVIO_INIT_VERSION
iwlwifi0: 0x0000A371 | FSEQ_OTP_VERSION
iwlwifi0: 0x3DDC2BB5 | FSEQ_TOP_CONTENT_VERSION
iwlwifi0: 0xE6DEEC56 | FSEQ_ALIVE_TOKEN
iwlwifi0: 0x2D435D4F | FSEQ_CNVI_ID
iwlwifi0: 0x70DAA327 | FSEQ_CNVR_ID
iwlwifi0: 0x01000200 | CNVI_AUX_MISC_CHIP
iwlwifi0: 0x01300202 | CNVR_AUX_MISC_CHIP
iwlwifi0: 0x0000485B | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
iwlwifi0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
iwlwifi0: 0xC90E5BC2 | FSEQ_PREV_CNVIO_INIT_VERSION
iwlwifi0: 0xEB17CF2B | FSEQ_WIFI_FSEQ_VERSION
iwlwifi0: 0x03176B8B | FSEQ_BT_FSEQ_VERSION
iwlwifi0: 0xC27A3CBE | FSEQ_CLASS_TP_VERSION
iwlwifi0: Failed to start INIT ucode: -60
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Failed to run INIT ucode: -60
iwlwifi0: retry init count 0
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
Invalid rxb from HW 0
iwlwifi0: Microcode SW error detected. Restarting 0x0.
iwlwifi0: Not valid error log pointer 0x00000000 for Init uCode
iwlwifi0: IML/ROM dump:
iwlwifi0: 0x00002320 | IML/ROM error/state
iwlwifi0: 0x00000003 | IML/ROM data1
iwlwifi0: Fseq Registers:
iwlwifi0: 0xE3667178 | FSEQ_ERROR_CODE
iwlwifi0: 0x00000000 | FSEQ_TOP_INIT_VERSION
iwlwifi0: 0xDCA958AE | FSEQ_CNVIO_INIT_VERSION
iwlwifi0: 0x0000A371 | FSEQ_OTP_VERSION
iwlwifi0: 0x3DDC2BB5 | FSEQ_TOP_CONTENT_VERSION
iwlwifi0: 0xE6DEEC56 | FSEQ_ALIVE_TOKEN
iwlwifi0: 0x2D435D4F | FSEQ_CNVI_ID
iwlwifi0: 0x70DAA327 | FSEQ_CNVR_ID
iwlwifi0: 0x01000200 | CNVI_AUX_MISC_CHIP
iwlwifi0: 0x01300202 | CNVR_AUX_MISC_CHIP
iwlwifi0: 0x0000485B | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
iwlwifi0: 0x0BADCAFE | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
iwlwifi0: 0xC90E5BC2 | FSEQ_PREV_CNVIO_INIT_VERSION
iwlwifi0: 0xEB17CF2B | FSEQ_WIFI_FSEQ_VERSION
iwlwifi0: 0x03176B8B | FSEQ_BT_FSEQ_VERSION
iwlwifi0: 0xC27A3CBE | FSEQ_CLASS_TP_VERSION
Invalid rxb from HW 0
iwlwifi0: SecBoot CPU1 Status: 0x3, CPU2 Status: 0x2320
iwlwifi0: WFPM_ARC1_PD_NOTIFICATION: 0x20
iwlwifi0: HPM_SECONDARY_DEVICE_STATE: 0x42
iwlwifi0: WFPM_MAC_OTP_CFG7_ADDR: 0x0
iwlwifi0: WFPM_MAC_OTP_CFG7_DATA: 0x4
iwlwifi0: UMAC PC: 0x8044f384
iwlwifi0: LMAC PC: 0xe8
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Failed to start INIT ucode: -60
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Failed to run INIT ucode: -60
iwlwifi0: retry init count 1
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: base HW address: 8c:a9:82:fc:e8:9c, OTP minor version: 0x4
ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/compat/pkg /usr/local/lib/compat/pkg /usr/local/lib/perl5/5.34/mach/CORE /usr/local/llvm15/lib
32-bit compatibility ldconfig path: /usr/lib32
Setting hostname: fbsd15.bitblocks.com.
Setting up harvesting: PURE_RDRAND,[CALLOUT],[UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,[NET_ETHER],NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED
Feeding entropy: .
wlan0: Ethernet address: 8c:a9:82:fc:e8:9c
Created wlan(4) interfaces: wlan0.
lo0: link state changed to UP
Starting wpa_supplicant.
Starting Network: lo0 em0 wlan0.
lo0: flags=1008049<UP,LOOPBACK,RUNNING,MULTICAST,LOWER_UP> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
em0: flags=1008802<BROADCAST,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        options=4e504bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 58:9c:fc:0c:17:bb
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
...
panic: lkpi_sta_scan_to_auth: lsta 0xfffff800049b5000 state not NOTEXIST: 0x1

cpuid = 0
time = 1696343345
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00741dfb70
vpanic() at vpanic+0x132/frame 0xfffffe00741dfca0
panic() at panic+0x43/frame 0xfffffe00741dfd00
lkpi_sta_scan_to_auth() at lkpi_sta_scan_to_auth+0x602/frame 0xfffffe00741dfd80
lkpi_iv_newstate() at lkpi_iv_newstate+0x253/frame 0xfffffe00741dfdf0
ieee80211_newstate_cb() at ieee80211_newstate_cb+0x1e7/frame 0xfffffe00741dfe40
taskqueue_run_locked() at taskqueue_run_locked+0xab/frame 0xfffffe00741dfec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe00741dfef0
fork_exit() at fork_exit+0x82/frame 0xfffffe00741dff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00741dff30
--- trap 0x16, rip = 0x2fcfc427ccba, rsp = 0x2fcfca331f48, rbp = 0x2fcfca331f60 ---
KDB: enter: panic
[ thread pid 0 tid 100264 ]
Stopped at      kdb_enter+0x32: movq    $0,0xe2aa13(%rip)
db>
Comment 10 Bakul Shah 2023-10-05 00:25:56 UTC
Running latest kernel. Now it panic every time. Resetting the pci slot on the host doesn't help. The system does come up multiuser but after a few seconds panics.

Autoloading module: if_iwlwifi
Intel(R) Wireless WiFi based driver for FreeBSD
iwlwifi0: <iwlwifi> mem 0xc1034000-0xc1037fff at device 7.0 on pci0
iwlwifi0: Detected crf-id 0x2816, cnv-id 0x1000200 wfpm id 0x80000000
iwlwifi0: PCI dev 2526/0014, rev=0x321, rfid=0x105110
iwlwifi0: successfully loaded firmware image 'iwlwifi-9260-th-b0-jf-b0-46.ucode'
iwlwifi0: WRT: Overriding region id 0
iwlwifi0: WRT: Overriding region id 1
iwlwifi0: WRT: Overriding region id 2
iwlwifi0: WRT: Overriding region id 3
iwlwifi0: WRT: Overriding region id 4
iwlwifi0: WRT: Overriding region id 6
iwlwifi0: WRT: Overriding region id 8
iwlwifi0: WRT: Overriding region id 9
iwlwifi0: WRT: Overriding region id 10
iwlwifi0: WRT: Overriding region id 11
iwlwifi0: WRT: Overriding region id 15
iwlwifi0: WRT: Overriding region id 16
iwlwifi0: WRT: Overriding region id 18
iwlwifi0: WRT: Overriding region id 19
iwlwifi0: WRT: Overriding region id 20
iwlwifi0: WRT: Overriding region id 21
iwlwifi0: WRT: Overriding region id 28
iwlwifi0: loaded firmware version 46.ff18e32a.0 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: SecBoot CPU1 Status: 0xa5a5a5a2, CPU2 Status: 0xa5a5a5a2
iwlwifi0: WFPM_ARC1_PD_NOTIFICATION: 0xa5a5a5a2
iwlwifi0: HPM_SECONDARY_DEVICE_STATE: 0xa5a5a5a2
iwlwifi0: WFPM_MAC_OTP_CFG7_ADDR: 0xa5a5a5a2
iwlwifi0: WFPM_MAC_OTP_CFG7_DATA: 0xa5a5a5a2
iwlwifi0: UMAC PC: 0xa5a5a5a2
iwlwifi0: LMAC PC: 0xa5a5a5a2
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Not valid error log pointer 0x00000000 for Init uCode
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: IML/ROM dump:
iwlwifi0: 0xA5A5 | IML/ROM SYSASSERT
iwlwifi0: 0xA5A5A5A2 | IML/ROM error/state
iwlwifi0: 0xA5A5A5A2 | IML/ROM data1
iwlwifi0: Fseq Registers:
iwlwifi0: 0xA5A5A5A2 | FSEQ_ERROR_CODE
iwlwifi0: 0xA5A5A5A2 | FSEQ_TOP_INIT_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_CNVIO_INIT_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_OTP_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_TOP_CONTENT_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_ALIVE_TOKEN
iwlwifi0: 0xA5A5A5A2 | FSEQ_CNVI_ID
iwlwifi0: 0xA5A5A5A2 | FSEQ_CNVR_ID
iwlwifi0: 0xA5A5A5A2 | CNVI_AUX_MISC_CHIP
iwlwifi0: 0xA5A5A5A2 | CNVR_AUX_MISC_CHIP
iwlwifi0: 0xA5A5A5A2 | CNVR_SCU_SD_REGS_SD_REG_DIG_DCDC_VTRIM
iwlwifi0: 0xA5A5A5A2 | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR
iwlwifi0: 0xA5A5A5A2 | FSEQ_PREV_CNVIO_INIT_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_WIFI_FSEQ_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_BT_FSEQ_VERSION
iwlwifi0: 0xA5A5A5A2 | FSEQ_CLASS_TP_VERSION
iwlwifi0: Failed to start INIT ucode: -60
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: WRT: Failed to dump region: id=1, type=10
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: Hardware error detected. Restarting.
iwlwifi0: WRT: Failed to dump region: id=21, type=10
iwlwifi0: WRT: Failed to dump region: id=1, type=10
iwlwifi0: WRT: Failed to dump region: id=21, type=10
iwlwifi0: Failing on timeout while stopping DMA channel 8 [0xa5a5a5a2]
iwlwifi0: Failed to run INIT ucode: -60
iwlwifi0: retry init count 0
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: base HW address: 8c:a9:82:fc:e8:9c, OTP minor version: 0x4
ELF ldconfig path: /lib /usr/lib /usr/lib/compat /usr/local/lib /usr/local/lib/compat/pkg /usr/local/lib/compat/pkg /usr/local/lib/perl5/5.34/mach/CORE /usr/local/llvm15/lib
32-bit compatibility ldconfig path: /usr/lib32
Setting hostname: fbsd15.bitblocks.com.
Setting up harvesting: PURE_RDRAND,[CALLOUT],[UMA],[FS_ATIME],SWI,INTERRUPT,NET_NG,[NET_ETHER],NET_TUN,MOUSE,KEYBOARD,ATTACH,CACHED
Feeding entropy: .
wlan0: Ethernet address: 8c:a9:82:fc:e8:9c
Created wlan(4) interfaces: wlan0.

[comes up multiuser]

Oct  4 10:20:14 fbsd15 ntpd[887]: error resolving pool 0.freebsd.pool.ntp.org: Name does not resolve (8)
Oct  4 10:20:15 fbsd15 ntpd[887]: error resolving pool 2.freebsd.pool.ntp.org: Name does not resolve (8)
iwlwifi0: No beacon heard and the time event is over already...
Oct  4 10:20:21 fbsd15 wpa_supplicant[352]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Can't assign requested address
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
panic: lkpi_sta_scan_to_auth: lsta 0xfffff80004419000 state not NOTEXIST: 0x1

cpuid = 0
time = 1696440022
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00741dfb70
vpanic() at vpanic+0x132/frame 0xfffffe00741dfca0
panic() at panic+0x43/frame 0xfffffe00741dfd00
lkpi_sta_scan_to_auth() at lkpi_sta_scan_to_auth+0x602/frame 0xfffffe00741dfd80
lkpi_iv_newstate() at lkpi_iv_newstate+0x253/frame 0xfffffe00741dfdf0
ieee80211_newstate_cb() at ieee80211_newstate_cb+0x1e7/frame 0xfffffe00741dfe40
taskqueue_run_locked() at taskqueue_run_locked+0xab/frame 0xfffffe00741dfec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe00741dfef0
fork_exit() at fork_exit+0x82/frame 0xfffffe00741dff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00741dff30
--- trap 0x16, rip = 0x6bb99f17cba, rsp = 0x6bb9f531f48, rbp = 0x6bb9f531f60 ---
KDB: enter: panic
[ thread pid 0 tid 100264 ]
Stopped at      kdb_enter+0x32: movq    $0,0xe2a9d3(%rip)
db>
Comment 11 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-05 13:36:52 UTC
(In reply to Bakul Shah from comment #10)

Does this also happen if your power off your *host* for a few seconds and boot up again?  Given you say you have to reset PCI on the host (and now that doesn't help either) I don't want to rule out a bhyve/passthru problem.  Which FreeBSD version are you running on the host?

iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: SecBoot CPU1 Status: 0xa5a5a5a2, CPU2 Status: 0xa5a5a5a2
^^^^^^^^^^^^^

0xa5a5a5a2 everywhere is a different problem highly likely unrelated to iwlwifi or linuxkpi.
Comment 12 Bakul Shah 2023-10-06 21:51:00 UTC
It came up fine after I rebooted the host. But crashed on first "service netif restart wlan0"

service netif restart wlan0
Stopping wpa_supplicant.
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
Oct  6 07:33:31 fbsd15 dhclient[5603]: Interface wlan0 is down, dhclient exiting
Oct  6 07:33:31 fbsd15 dhclient[5603]: connection closed
Oct  6 07:33:31 fbsd15 dhclient[5603]: exiting.
Stopping Network: wlan0.
wlan0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=0
        ether 8c:a9:82:fc:e8:9c
        groups: wlan
        ssid "" channel 8 (2447 MHz 11g)
        regdomain FCC country US authmode OPEN privacy OFF txpower 30 bmiss 7
        scanvalid 60 protmode CTS wme
        parent interface: iwlwifi0
        media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
        status: no carrier
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x20:0xffffffff80cf1871
stack pointer           = 0x28:0xfffffe0074ad8ab0
frame pointer           = 0x28:0xfffffe0074ad8ac0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 17317 (ifconfig)
rdi: fffffe007cb81000 rsi: fffff8000103daa0 rdx: 0000000000000005
rcx: fffff8007e667c80  r8: fffff800bac2d788  r9: 00000000baba7800
rax: deadc0dedeadc0de rbx: fffffe007cb81000 rbp: fffffe0074ad8ac0
r10: 0000000000000000 r11: 0000000000010000 r12: fffffe007547c038
r13: fffffe007547c000 r14: deadc0dedeadc0de r15: fffff80001773800
trap number             = 9
panic: general protection fault
cpuid = 0
time = 1696602811
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0074ad87f0
vpanic() at vpanic+0x132/frame 0xfffffe0074ad8920
panic() at panic+0x43/frame 0xfffffe0074ad8980
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe0074ad89e0
calltrap() at calltrap+0x8/frame 0xfffffe0074ad89e0
--- trap 0x9, rip = 0xffffffff80cf1871, rsp = 0xfffffe0074ad8ab0, rbp = 0xfffffe0074ad8ac0 ---
node_free() at node_free+0x11/frame 0xfffffe0074ad8ac0
ieee80211_node_vdetach() at ieee80211_node_vdetach+0x2b/frame 0xfffffe0074ad8ae0
ieee80211_vap_detach() at ieee80211_vap_detach+0x612/frame 0xfffffe0074ad8b20
lkpi_ic_vap_delete() at lkpi_ic_vap_delete+0xae/frame 0xfffffe0074ad8b50
wlan_clone_destroy() at wlan_clone_destroy+0x12/frame 0xfffffe0074ad8b60
if_clone_destroyif_flags() at if_clone_destroyif_flags+0x6a/frame 0xfffffe0074ad8ba0
if_clone_destroy() at if_clone_destroy+0x100/frame 0xfffffe0074ad8be0
ifioctl() at ifioctl+0x8a5/frame 0xfffffe0074ad8cd0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe0074ad8d30
sys_ioctl() at sys_ioctl+0x152/frame 0xfffffe0074ad8e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe0074ad8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0074ad8f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x3c0f3a32629a, rsp = 0x3c0f37307a48, rbp = 0x3c0f37307a80 ---
KDB: enter: panic
[ thread pid 17317 tid 100521 ]
Stopped at      kdb_enter+0x32: movq    $0,0xe2a953(%rip)
db>
Comment 13 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-10-06 22:48:04 UTC
(In reply to Bakul Shah from comment #12)

that's the known one currently; n80211 node ref cnt or race; that's a wip
Comment 14 Bakul Shah 2023-10-11 05:26:21 UTC
Using "bhyve -G 1234" I was able to glean some more details.
"service restart wlan0" triggers a panic. The relevant backtrace part:

#16 <signal handler called>
#17 ieee80211_ratectl_node_deinit (ni=0xfffffe0075cb2000)
    at /home/FreeBSD/current/sys/net80211/ieee80211_ratectl.h:127
#18 node_free (ni=0xfffffe0075cb2000)
    at /home/FreeBSD/current/sys/net80211/ieee80211_node.c:1301
#19 0xffffffff80cf237b in ieee80211_node_vdetach (
    vap=vap@entry=0xfffffe00757e9010)
    at /home/FreeBSD/current/sys/net80211/ieee80211_node.c:206
#20 0xffffffff80cc52d2 in ieee80211_vap_detach (
    vap=vap@entry=0xfffffe00757e9010)

poking around with gdb, notice the value of vap, which seems to be uninitialized! ni seems ok. Seems weird! Is this a race condition or some precondition not being checked or gdb lying? I will continue looking. Should this be reported on some other bug#?

(gdb) f 17
#17 ieee80211_ratectl_node_deinit (ni=0xfffffe0075cb2000)
    at /home/FreeBSD/current/sys/net80211/ieee80211_ratectl.h:127
127             vap->iv_rate->ir_node_deinit(ni);
(gdb) l
122     static __inline void
123     ieee80211_ratectl_node_deinit(struct ieee80211_node *ni)
124     {
125             const struct ieee80211vap *vap = ni->ni_vap;
126
127             vap->iv_rate->ir_node_deinit(ni);
128     }
129
130     static int __inline
131     ieee80211_ratectl_rate(struct ieee80211_node *ni, void *arg, uint32_t iarg)
(gdb) p vap
$6 = (const struct ieee80211vap *) 0xdeadc0dedeadc0de
Comment 15 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:18 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0936c648ad0ee5152dc19f261e77fe9c1833fe05

commit 0936c648ad0ee5152dc19f261e77fe9c1833fe05
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-05 14:51:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:48:04 +0000

    LinuxKPI: 802.11: update the ni/lsta reference cycle

    Update the ni/lsta reference cycle, add extra checks and assertions.
    This is to accomodate problems we were seeing based on net80211
    behaviour (join1() and (*iv_update_bss)() as well as state changes for
    new iv_bss nodes during an active session).
    This should hopefully help to stabilise behaviour until the underlying
    problems gets properly addressed (for this and all other device drivers).

    PR:             272607, 273985, 274003
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43753

 sys/compat/linuxkpi/common/src/linux_80211.c | 209 +++++++++++++++++----------
 sys/compat/linuxkpi/common/src/linux_80211.h |   1 +
 2 files changed, 130 insertions(+), 80 deletions(-)
Comment 16 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:30 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2ac8a2189ac6707f48f77ef2e36baf696a0d2f40

commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:53 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 17 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:31 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=713db49d06deee90dd358b2e4b9ca05368a5eaf6

commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:21 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  13 ++++-
 6 files changed, 134 insertions(+), 28 deletions(-)
Comment 18 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:06 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=12887199b37469c98a47baf66cd3cc182c79fbd6

commit 12887199b37469c98a47baf66cd3cc182c79fbd6
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-05 14:51:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    LinuxKPI: 802.11: update the ni/lsta reference cycle

    Update the ni/lsta reference cycle, add extra checks and assertions.
    This is to accomodate problems we were seeing based on net80211
    behaviour (join1() and (*iv_update_bss)() as well as state changes for
    new iv_bss nodes during an active session).
    This should hopefully help to stabilise behaviour until the underlying
    problems gets properly addressed (for this and all other device drivers).

    PR:             272607, 273985, 274003
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43753

    (cherry picked from commit 0936c648ad0ee5152dc19f261e77fe9c1833fe05)

 sys/compat/linuxkpi/common/src/linux_80211.c | 209 +++++++++++++++++----------
 sys/compat/linuxkpi/common/src/linux_80211.h |   1 +
 2 files changed, 130 insertions(+), 80 deletions(-)
Comment 19 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:22 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b392b36d3776b696601ce0253256803276d24ea2

commit b392b36d3776b696601ce0253256803276d24ea2
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1400509 to be able to detect
    this change.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)
    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  15 +++--
 sys/sys/param.h                |   2 +-
 8 files changed, 142 insertions(+), 30 deletions(-)
Comment 20 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:23 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8c450ea1083b03f30871506b59034f26bc608972

commit 8c450ea1083b03f30871506b59034f26bc608972
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 21 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:12 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=184ccc414686ea32c64f063c081c7cc1adeae7c3

commit 184ccc414686ea32c64f063c081c7cc1adeae7c3
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:02 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 22 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:26 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a7e1fc7f620d3341549c1380f550aaafbdb45622

commit a7e1fc7f620d3341549c1380f550aaafbdb45622
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:01 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1303501 to be able to detect
    this change.

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 sys/sys/param.h                |   2 +-
 8 files changed, 143 insertions(+), 34 deletions(-)
Comment 23 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:37 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=223edc1a3c2fc86dbc7fa0ecd00f26a85d7c7b43

commit 223edc1a3c2fc86dbc7fa0ecd00f26a85d7c7b43
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-05 14:51:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:02 +0000

    LinuxKPI: 802.11: update the ni/lsta reference cycle

    Update the ni/lsta reference cycle, add extra checks and assertions.
    This is to accomodate problems we were seeing based on net80211
    behaviour (join1() and (*iv_update_bss)() as well as state changes for
    new iv_bss nodes during an active session).
    This should hopefully help to stabilise behaviour until the underlying
    problems gets properly addressed (for this and all other device drivers).

    PR:             272607, 273985, 274003
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43753

    (cherry picked from commit 0936c648ad0ee5152dc19f261e77fe9c1833fe05)

 sys/compat/linuxkpi/common/src/linux_80211.c | 209 +++++++++++++++++----------
 sys/compat/linuxkpi/common/src/linux_80211.h |   1 +
 2 files changed, 130 insertions(+), 80 deletions(-)
Comment 24 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:10:41 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9b2da4bc5a68294bc1dcfdd0d0ccadf747bafd67

commit 9b2da4bc5a68294bc1dcfdd0d0ccadf747bafd67
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-05 14:51:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:09:22 +0000

    LinuxKPI: 802.11: update the ni/lsta reference cycle

    Update the ni/lsta reference cycle, add extra checks and assertions.
    This is to accomodate problems we were seeing based on net80211
    behaviour (join1() and (*iv_update_bss)() as well as state changes for
    new iv_bss nodes during an active session).
    This should hopefully help to stabilise behaviour until the underlying
    problems gets properly addressed (for this and all other device drivers).

    Approved by:    re (cperciva)
    PR:             272607, 273985, 274003
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43753

    (cherry picked from commit 0936c648ad0ee5152dc19f261e77fe9c1833fe05)
    (cherry picked from commit 223edc1a3c2fc86dbc7fa0ecd00f26a85d7c7b43)

 sys/compat/linuxkpi/common/src/linux_80211.c | 209 +++++++++++++++++----------
 sys/compat/linuxkpi/common/src/linux_80211.h |   1 +
 2 files changed, 130 insertions(+), 80 deletions(-)
Comment 25 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:10:55 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf

commit d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:09:22 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    Approved by:    re (cperciva)
    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)
    (cherry picked from commit 184ccc414686ea32c64f063c081c7cc1adeae7c3)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 26 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:11:04 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9b998db87c28356fce21784c4f8bfb8737615e1f

commit 9b998db87c28356fce21784c4f8bfb8737615e1f
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:07:20 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version will get updated to 1303001 to be able to detect
    this change.

    Approved by:    re (cperciva)

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)
    (cherry picked from commit a7e1fc7f620d3341549c1380f550aaafbdb45622)

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 6 files changed, 136 insertions(+), 33 deletions(-)
Comment 27 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 16:32:07 UTC
I believe all reports should be fixed in all branches now.
Any further problems should better be tracked individually at this point.
Please check any in the iwlwifi meta-bug https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=273620 and add to those, re-open, or open a new one.

Thanks for all the testing and reporting!
Comment 28 Bakul Shah 2024-02-19 20:42:22 UTC
Still panics in a VM (same setup as in comment 1). Running ef75877fc2d9

I did

# ifconfig wlan create wlandev iwlwifi0
# wpa_supplicant -i wlan0 -c /etc/wpa_supplicant.conf &
<after a while>
# ifconfig wlan0 down
wlan0: CTRL-EVENT-DISCONNECTED bssid=XXXXXXXXX reason=3 locally_generated=1
Feb 19 12:33:10 fbsd15 dhclient[1386]: Interface wlan0 is down, dhclient exiting
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_sta_run_to_init:2304: mo_sta_state(NOTEXIST) failed: -5
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)


and it paniced: in gdb

(gdb) f 1
#1  0xffffffff80d06653 in ieee80211_newstate_cb (xvap=0xfffffe00750cc010,
    npending=<optimized out>)
    at /home/FreeBSD/current/sys/net80211/ieee80211_proto.c:2616
2616                    KASSERT(nstate != IEEE80211_S_INIT,
Comment 29 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 21:01:17 UTC
(In reply to Bakul Shah from comment #28)

Still a 9260?

Given you are only doing a ifconfig down, the panic is different from #c1, also not #c9, #c10, #c12 or #c14 bits.  I've seen this elsewhere as well.  We'll need to go and see if there's a bss_conf update on the pre-22000 cards which removes the sta for us.  I'll try to find the PR.

Meanwhile can you email me the full gdb backtrace to bz@
Comment 30 Bakul Shah 2024-02-19 21:14:18 UTC
Still 9260! Yes, there may have been other bugs. This one is pretty easy to trigger but still once in a while the system panic on doing wpa_supplicant, before ifconfig wlan0 down but still in the same way, in ieee80211_newstate_cb().

There is not much to the gdb backtrace, see bleow. but I can try extract more info.

(gdb) where
#0  panic (fmt=0xffffffff811ca7a4 "INIT state change failed")
    at /home/FreeBSD/current/sys/kern/kern_shutdown.c:888
#1  0xffffffff80d06653 in ieee80211_newstate_cb (xvap=0xfffffe0074e08010,
    npending=<optimized out>)
    at /home/FreeBSD/current/sys/net80211/ieee80211_proto.c:2616
#2  0xffffffff80bbb41b in taskqueue_run_locked (
    queue=queue@entry=0xfffff800038caa00)
    at /home/FreeBSD/current/sys/kern/subr_taskqueue.c:517
#3  0xffffffff80bbc4d3 in taskqueue_thread_loop (
    arg=arg@entry=0xfffffe0075006110)
    at /home/FreeBSD/current/sys/kern/subr_taskqueue.c:829
#4  0xffffffff80b09882 in fork_exit (
    callout=0xffffffff80bbc400 <taskqueue_thread_loop>,
    arg=0xfffffe0075006110, frame=0xfffffe007411cf40)
    at /home/FreeBSD/current/sys/kern/kern_fork.c:1157
#5  <signal handler called>
#6  0x0000174eb7b2751a in ?? ()
Backtrace stopped: Cannot access memory at address 0x174ec2d27f48
Comment 31 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 21:41:17 UTC
(In reply to Bakul Shah from comment #30)

Let us move this to PR 275255.  It seems the same problem as in #c28 and also with a pre-22000 card.