Bug 275255 - iwlwifi: panic after iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT) (8xxx/9xxx chipsets)
Summary: iwlwifi: panic after iwlwifi0: lkpi_iv_newstate: error -5 during state transi...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: wireless (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Bjoern A. Zeeb
URL:
Keywords:
Depends on:
Blocks: iwlwifi
  Show dependency treegraph
 
Reported: 2023-11-22 06:41 UTC by Xin LI
Modified: 2024-06-14 20:05 UTC (History)
7 users (show)

See Also:
bz: mfc-stable14+
bz: mfc-stable13+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Xin LI freebsd_committer freebsd_triage 2023-11-22 06:41:19 UTC
(This is a laptop running with fresh -CURRENT; the device was:

iwlwifi0: <iwlwifi> mem 0xecc00000-0xecc01fff at device 0.0 on pci3
iwlwifi0: Detected crf-id 0xbadcafe, cnv-id 0x10 wfpm id 0x80000000
iwlwifi0: PCI dev 24fd/0010, rev=0x230, rfid=0xd55555d5
iwlwifi0: successfully loaded firmware image 'iwlwifi-8265-36.ucode'
iwlwifi0: loaded firmware version 36.ca7b901d.0 8265-36.ucode op_mode iwlmvm
iwlwifi0: Detected Intel(R) Dual Band Wireless AC 8265, REV=0x230

The kernel is built with WITNESS / INVARIANT enabled.

It seems that the 802.11 stack was trying to transit from RUN to INIT, and the driver returned -EIO because firmware told it that ADD_STA_MODIFY_NON_EXISTING_STA (=0x8) in iwl_mvm_drain_sta().

)


Tue Nov 21 22:33:33 PST 2023

FreeBSD p51.home.us.delphij.net 15.0-CURRENT FreeBSD 15.0-CURRENT #1 main-n266520-f930dac6d584: Mon Nov 20 15:48:41 PST 2023     delphij@p51.home.us.delphij.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

panic: INIT state change failed

GNU gdb (GDB) 13.2 [GDB v13.2 for FreeBSD]
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd15.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:
iwlwifi0: linuxkpi_ieee80211_connection_loss: vif 0xfffffe01773abc80 vap 0xfffffe01773ab010 state RUN
<6>wlan0: link state changed to DOWN
<118>Nov 21 22:32:11 p51 wpa_supplicant[423]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Can't assign requested address
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_sta_run_to_init:1954: mo_sta_state(NOTEXIST) failed: -5
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
Dumping 2446 out of 32422 MB: (CTRL-C to abort)  (CTRL-C to abort) ..1% (CTRL-C to abort)  (CTRL-C to abort)  (CTRL-C to abort) ..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
57		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
        td = <optimized out>
#1  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:405
        error = 0
        coredump = <optimized out>
#2  0xffffffff85dee143 in vt_kms_postswitch () from /boot/modules/drm.ko
No symbol table info available.
#3  0xffffffff8099ac81 in vt_window_switch (vw=0xfffff8000233bd80, 
    vw@entry=0xffffffff816a9c98 <vt_conswindow>)
    at /usr/src/sys/dev/vt/vt_core.c:612
        vd = 0xffffffff816a9de8 <vt_consdev>
        curvw = 0xfffff80006dfcd80
        kbd = <optimized out>
#4  0xffffffff8099bfdf in vtterm_cngrab (tm=<unavailable>, 
    tm@entry=<error reading variable: value is not available>)
    at /usr/src/sys/dev/vt/vt_core.c:1863
        vw = 0xffffffff816a9c98 <vt_conswindow>
        vd = 0xffffffff816a9de8 <vt_consdev>
#5  0xffffffff80aeb106 in cngrab () at /usr/src/sys/kern/kern_cons.c:385
        cnd = 0xffffffff8196d7e0 <cn_devtab>
        cn = <unavailable>
#6  0xffffffff80b5bd7f in vpanic (
    fmt=0xffffffff8120c4c9 "INIT state change failed", 
    ap=ap@entry=0xffffffff82761dd0) at /usr/src/sys/kern/kern_shutdown.c:942
        buf = "INIT state change failed", '\000' <repeats 231 times>
        __pc = <optimized out>
        __pc = <optimized out>
        __pc = <optimized out>
        other_cpus = {__bits = {127, 0 <repeats 15 times>}}
        td = 0xfffff80001f31000
        bootopt = 256
        newpanic = <optimized out>
#7  0xffffffff80b5bbf3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:894
        ap = {{gp_offset = 8, fp_offset = 48, 
            overflow_arg_area = 0xffffffff82761e00, 
            reg_save_area = 0xffffffff82761da0}}
#8  0xffffffff80d104e1 in ieee80211_newstate_cb (xvap=0xfffffe01773ab010, 
    npending=<optimized out>) at /usr/src/sys/net80211/ieee80211_proto.c:2552
        vap = 0xfffffe01773ab010
        ic = <optimized out>
        arg = 0
        ostate = IEEE80211_S_RUN
        rc = -5
        nstate = <optimized out>
#9  0xffffffff80bc1f8b in taskqueue_run_locked (
    queue=queue@entry=0xfffff800028c6000)
    at /usr/src/sys/kern/subr_taskqueue.c:512
        et = {et_link = {tqe_next = 0x0, tqe_prev = 0x8}, 
          et_td = 0xffffffff811ba967, et_section = {bucket = 0}, 
          et_old_priority = 0 '\000'}
        tb = {tb_running = 0xfffffe01773ab320, tb_seq = 25, 
          tb_canceling = false, tb_link = {le_next = 0x0, 
            le_prev = 0xfffff800028c6010}}
        in_net_epoch = false
        task = 0xfffffe01773ab320
        pending = 1
#10 0xffffffff80bc3043 in taskqueue_thread_loop (
    arg=arg@entry=0xfffffe0176a55110)
    at /usr/src/sys/kern/subr_taskqueue.c:824
        tqp = <optimized out>
        tq = 0xfffff800028c6000
#11 0xffffffff80b11372 in fork_exit (
    callout=0xffffffff80bc2f70 <taskqueue_thread_loop>, 
    arg=0xfffffe0176a55110, frame=0xffffffff82761f40)
    at /usr/src/sys/kern/kern_fork.c:1160
        __pc = <optimized out>
        __pc = <optimized out>
        td = 0xfffff80001f31000
        p = 0xffffffff8196c4c0 <proc0>
        dtd = <optimized out>
#12 <signal handler called>
No locals.
#13 0x00001dd895cec5ba in ?? ()
No symbol table info available.
Backtrace stopped: Cannot access memory at address 0x1dd89daf8f48
(kgdb)
Comment 1 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-11-22 22:30:20 UTC
Despite looking different this is a duplicate of 271979 with different states as a result of the node swap from net80211.

*** This bug has been marked as a duplicate of bug 271979 ***
Comment 2 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-11-30 22:51:59 UTC
I'll re-open this.

lkpi_sta_run_to_init() should not be affected by the problem in 271979 given it does not re-lookup ni -- at least not in my local tree.

Also you made it all to mo_sta_state(NOTEXIST).
That smells like a bss_conf update removed the sta for us.
I've hit this before.
Maybe only with pre-22000 cards?


Test case (likely to reproduce):
get into RUN
take your AP away
wait for connection loss to happen
that will trigger the RUN -> INIT newstate change.
Things will proceed from that.

My 8265 is currently buried somewhere but I'll go and give it a try in a few days hopefully.
Can you reproduce this somehow?
Comment 3 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-01 00:39:20 UTC
(In reply to Bjoern A. Zeeb from comment #2)

obviously we do hist the beacon loss first:

[2339.509731] iwlwifi0: linuxkpi_ieee80211_beacon_loss: vif 0xffff0001d6b09c80 vap 0xffff0001d6b09010 state RUN
[2339.612089] iwlwifi0: linuxkpi_ieee80211_beacon_loss: vif 0xffff0001d6b09c80 vap 0xffff0001d6b09010 state RUN
[2339.618158] wlan1: link state changed to DOWN

which will get us into sta_beacon_miss and switch to SCAN.

And lkpi_sta_run_to_scan simply calls lkpi_sta_run_to_init.

Doesn't go kaboom on a modern card; in this case a B200.
I'll go and find my 8265 and try again the next days.
Comment 4 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:23 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=713db49d06deee90dd358b2e4b9ca05368a5eaf6

commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:21 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  13 ++++-
 6 files changed, 134 insertions(+), 28 deletions(-)
Comment 5 commit-hook freebsd_committer freebsd_triage 2024-02-14 19:50:33 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2ac8a2189ac6707f48f77ef2e36baf696a0d2f40

commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-14 19:47:53 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    MFC after:      3 days
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 6 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:10 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b392b36d3776b696601ce0253256803276d24ea2

commit b392b36d3776b696601ce0253256803276d24ea2
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1400509 to be able to detect
    this change.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)
    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  13 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  15 +++--
 sys/sys/param.h                |   2 +-
 8 files changed, 142 insertions(+), 30 deletions(-)
Comment 7 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:13 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8c450ea1083b03f30871506b59034f26bc608972

commit 8c450ea1083b03f30871506b59034f26bc608972
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:17 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 8 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:14 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a7e1fc7f620d3341549c1380f550aaafbdb45622

commit a7e1fc7f620d3341549c1380f550aaafbdb45622
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:01 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version gets updated to 1303501 to be able to detect
    this change.

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)

 UPDATING                       |   6 ++
 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 sys/sys/param.h                |   2 +-
 8 files changed, 143 insertions(+), 34 deletions(-)
Comment 9 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:24 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=184ccc414686ea32c64f063c081c7cc1adeae7c3

commit 184ccc414686ea32c64f063c081c7cc1adeae7c3
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:02:02 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 10 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:10:54 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf

commit d4b4efc6db6c6c3a9abf2f187ba1ccc0e40028cf
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-03 16:33:56 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:09:22 +0000

    LinuxKPI: 802.11: band-aid for invalid state changes after (*iv_update_bss)

    With firmware based solutions we cannot just jump from an active session
    to a new iv_bss node without tearing down state for the old and bringing
    up the new node.  This likely used to work on softmac based cards/drivers
    where one could essentially set the state and fire at will.

    We track (*iv_update_bss) calls from net80211 and set a local flag that
    we are out of synch and do not allow any further operations up the state
    machine until we hit INIT or SCAN.  That means someone will take the state
    down, clean up firmware state and then we can join again and build up
    state.

    Apparently this problem has been "known" for a while as native iwm(4) and
    others have similar workarounds (though less strict) and can be equally
    pestered into bad states.  For LinuxKPI all the KASSERTs just massively
    brought this problem out.  The solution will be some rewrites in net80211.
    Until then, try to keep us more stable at least and not die on second
    join1() calls triggered by service netif start wlan0 and similar.

    Approved by:    re (cperciva)
    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (2023, partial)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43725

    (cherry picked from commit 2ac8a2189ac6707f48f77ef2e36baf696a0d2f40)
    (cherry picked from commit 184ccc414686ea32c64f063c081c7cc1adeae7c3)

 sys/compat/linuxkpi/common/src/linux_80211.c | 309 +++++++++++++++++++--------
 sys/compat/linuxkpi/common/src/linux_80211.h |   2 +
 2 files changed, 216 insertions(+), 95 deletions(-)
Comment 11 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:11:05 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9b998db87c28356fce21784c4f8bfb8737615e1f

commit 9b998db87c28356fce21784c4f8bfb8737615e1f
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-01-10 10:14:16 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:07:20 +0000

    net80211: deal with lost state transitions

    Since 5efea30f039c4 we can possibly lose a state transition which can
    cause trouble further down the road.
    The reproducer from 643d6dce6c1e can trigger these for example.
    Drivers for firmware based wireless cards have worked around some of
    this (and other) problems in the past.

    Add an array of tasks rather than a single one as we would simply
    get npending > 1 and lose order with other tasks.  Try to keep state
    changes updated as queued in case we end up with more than one at a
    time.  While this is not ideal either (call it a hack) it will sort
    the problem for now.
    We will queue in ieee80211_new_state_locked() and do checks there
    and dequeue in ieee80211_newstate_cb().
    If we still overrun the (currently) 8 slots we will drop the state
    change rather than overwrite the last one.
    When dequeing we will update iv_nstate and keep it around for historic
    reasons for the moment.

    The longer term we should make the callers of
    ieee80211_new_state[_locked]() actually use the returned errors
    and act appropriately but that will touch a lot more places and
    drivers (possibly incl. changed behaviour for ioctls).

    rtwn(4) and rum(4) should probably be revisted and net80211 internals
    removed (for rum(4) at least the current logic still seems prone to
    races).

    PR:             271979, 271988, 275255, 263613, 274003
    Sponsored by:   The FreeBSD Foundation (in 2023)
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43389

    (cherry picked from commit 713db49d06deee90dd358b2e4b9ca05368a5eaf6)

    Given this changes the internal structure of 'struct ieee80211vap',
    which gets allocated by the drivers, and we do not have enough
    spares, all wireless drivers need to be recompiled.
    Given we are forced to do the update, we leave fields in the middle
    of the struct and add more spares at the same time.
    __FreeBSD_version will get updated to 1303001 to be able to detect
    this change.

    Approved by:    re (cperciva)

    (cherry picked from commit a890a3a5ddf33acb0a4000885945b89156799b07)
    (cherry picked from commit a7e1fc7f620d3341549c1380f550aaafbdb45622)

 sys/dev/rtwn/if_rtwn.c         |   4 +-
 sys/dev/usb/wlan/if_rum.c      |   4 +-
 sys/net80211/ieee80211.c       |   4 +-
 sys/net80211/ieee80211_ddb.c   |  15 ++++-
 sys/net80211/ieee80211_proto.c | 124 ++++++++++++++++++++++++++++++++++-------
 sys/net80211/ieee80211_var.h   |  18 +++---
 6 files changed, 136 insertions(+), 33 deletions(-)
Comment 12 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 16:28:58 UTC
I believe this should be fixed in all branches now.
Can you re-test or can we close this?
Comment 13 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-02-19 21:42:46 UTC
Also just reported here with a 9xxx card:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274003#c28

Will try to track this here.
Comment 14 ml 2024-04-22 17:27:55 UTC
I'm seeing very frequent (i.e. several times per day) panics after upgrading from 13.2R to 13.3R and possibly they are related to this bug.




I'm not able to get a proper core on this machine (and I still can't understand why), but in the logs I find:


Apr 18 15:00:25 hector kernel: iwlwifi0: linuxkpi_ieee80211_beacon_loss: vif 0xfffffe00ac5b1e80 vap 0xfffffe00ac5b1010 state RUN
Apr 18 15:00:28 hector syslogd: last message repeated 1 times
Apr 18 15:00:29 hector kernel: ipfw: 9999 Deny UDP 192.168.113.82:137 192.168.113.255:137 in via wlan0
Apr 18 15:00:29 hector wpa_supplicant[93738]: wlan0: CTRL-EVENT-DISCONNECTED bssid=fc:f5:28:ca:f1:12 reason=0
Apr 18 15:00:29 hector kernel: iwlwifi0: linuxkpi_ieee80211_connection_loss: vif 0xfffffe00ac5b1e80 vap 0xfffffe00ac5b1010 state RUN
Apr 18 15:00:29 hector kernel: wlan0: link state changed to DOWN
Apr 18 15:00:29 hector wpa_supplicant[93738]: BSSID fc:f5:28:ca:f1:12 ignore list count incremented to 2, ignoring for 10 seconds
Apr 18 15:00:29 hector wpa_supplicant[93738]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Can't assign requested address
Apr 18 15:00:29 hector dhclient[97089]: wlan0 link state up -> down
Apr 18 15:00:29 hector devd[99040]: Processing event '!system=IFNET subsystem=wlan0 type=LINK_DOWN'
Apr 18 15:00:29 hector devd[99040]: Pushing table
Apr 18 15:00:29 hector devd[99040]: Processing notify event
Apr 18 15:00:29 hector devd[99040]: Popping table
Apr 18 15:00:29 hector dbus-daemon[3053]: [system] Activating service name='org.freedesktop.ConsoleKit' requested by ':1.2' (uid=0 pid=24346 comm="") (using servicehelper)
Apr 18 15:00:29 hector kernel: iwlwifi0: Couldn't drain frames for staid 0, status 0x8
Apr 18 15:00:29 hector kernel: iwlwifi0: lkpi_sta_run_to_init:2173: mo_sta_state(NOTEXIST) failed: -5
Apr 18 15:00:29 hector kernel: iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
Apr 18 15:00:29 hector dbus-daemon[3053]: [system] Activating service name='org.freedesktop.PolicyKit1' requested by ':1.3' (uid=0 pid=24293 comm="") (using servicehelper)
Apr 18 15:00:29 hector dbus-daemon[3053]: [system] Successfully activated service 'org.freedesktop.ConsoleKit'
Apr 18 15:00:29 hector polkitd[24756]: Started polkitd version 124
Apr 18 15:00:29 hector polkitd[24756]: Loading rules from directory /usr/local/etc/polkit-1/rules.d
Apr 18 15:00:29 hector polkitd[24756]: Loading rules from directory /usr/local/share/polkit-1/rules.d
Apr 18 15:00:29 hector polkitd[24756]: Finished loading, compiling and executing 1 rules
Apr 18 15:00:29 hector dbus-daemon[3053]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
Apr 18 15:00:29 hector polkitd[24756]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
Apr 18 15:00:29 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Activating service name='org.a11y.Bus' requested by ':1.0' (uid=1001 pid=25777 comm="")
Apr 18 15:00:29 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Successfully activated service 'org.a11y.Bus'
Apr 18 15:00:29 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Activating service name='org.xfce.Xfconf' requested by ':1.2' (uid=1001 pid=25777 comm="")
Apr 18 15:00:29 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Successfully activated service 'org.xfce.Xfconf'
Apr 18 15:00:30 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Activating service name='org.gtk.vfs.Daemon' requested by ':1.6' (uid=1001 pid=36842 comm="")
Apr 18 15:00:30 hector dbus-daemon[29677]: [session uid=1001 pid=28981] Successfully activated service 'org.gtk.vfs.Daemon'
Apr 18 15:00:30 hector dbus-daemon[3053]: [system] Activating service name='org.freedesktop.UPower' requested by ':1.6' (uid=1001 pid=40674 comm="") (using servicehelper)
Apr 18 15:00:30 hector dbus-daemon[3053]: [system] Successfully activated service 'org.freedesktop.UPower'
Apr 18 15:00:30 hector wpa_supplicant[93738]: wlan0: Trying to associate with fc:f5:28:ca:f1:13 (SSID='CCBiesse' freq=5180 MHz)
Apr 18 15:00:30 hector kernel: iwlwifi0: lkpi_sta_scan_to_auth:1033: lvif 0xfffffe00ac5b1000 vap 0xfffffe00ac5b1010 iv_bss 0xfffffe00adb4c000 lvif_bss 0xfffff8000565a000 lvif_bss->ni 0xfffffe00aed99000 synched 0
Apr 18 15:00:30 hector kernel: iwlwifi0: lkpi_iv_newstate: error 16 during state transition 1 (SCAN) -> 2 (AUTH)
Apr 18 15:01:09 hector syslogd: restart
Apr 18 15:01:09 hector syslogd: kernel boot file is /boot/kernel/kernel
Apr 18 15:01:09 hector kernel: Sleeping thread (tid 100785, pid 0) owns a non-sleepable lock
Apr 18 15:01:09 hector kernel: KDB: stack backtrace of thread 100785:
Apr 18 15:01:09 hector kernel: sched_switch() at sched_switch+0x7d1/frame 0xfffffe00ab30fe20
Apr 18 15:01:09 hector kernel: mi_switch() at mi_switch+0xbf/frame 0xfffffe00ab30fe40
Apr 18 15:01:09 hector kernel: _sleep() at _sleep+0x1f0/frame 0xfffffe00ab30fec0
Apr 18 15:01:09 hector kernel: taskqueue_thread_loop() at taskqueue_thread_loop+0xb1/frame 0xfffffe00ab30fef0
Apr 18 15:01:09 hector kernel: fork_exit() at fork_exit+0x7d/frame 0xfffffe00ab30ff30
Apr 18 15:01:09 hector kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00ab30ff30
Apr 18 15:01:09 hector kernel: --- trap 0xc, rip = 0x63639d6e3da, rsp = 0x6364d128f48, rbp = 0x6364d128f60 ---
Apr 18 15:01:09 hector kernel: panic: sleeping thread
Apr 18 15:01:09 hector kernel: cpuid = 2
Apr 18 15:01:09 hector kernel: time = 1713445230
Apr 18 15:01:09 hector kernel: KDB: stack backtrace:
Apr 18 15:01:09 hector kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00ab2ba960
Apr 18 15:01:09 hector kernel: vpanic() at vpanic+0x152/frame 0xfffffe00ab2ba9b0
Apr 18 15:01:09 hector kernel: panic() at panic+0x43/frame 0xfffffe00ab2baa10
Apr 18 15:01:09 hector kernel: propagate_priority() at propagate_priority+0x293/frame 0xfffffe00ab2baa50
Apr 18 15:01:09 hector kernel: turnstile_wait() at turnstile_wait+0x314/frame 0xfffffe00ab2baaa0
Apr 18 15:01:09 hector kernel: __mtx_lock_sleep() at __mtx_lock_sleep+0x17b/frame 0xfffffe00ab2bab30
Apr 18 15:01:09 hector kernel: linuxkpi_ieee80211_find_sta() at linuxkpi_ieee80211_find_sta+0xd0/frame 0xfffffe00ab2bab70
Apr 18 15:01:09 hector kernel: linuxkpi_ieee80211_find_sta_by_ifaddr() at linuxkpi_ieee80211_find_sta_by_ifaddr+0x7f/frame 0xfffffe00ab2babc0
Apr 18 15:01:09 hector kernel: iwl_mvm_rx_mpdu_mq() at iwl_mvm_rx_mpdu_mq+0x420/frame 0xfffffe00ab2bacd0
Apr 18 15:01:09 hector kernel: iwl_pcie_rx_handle() at iwl_pcie_rx_handle+0x444/frame 0xfffffe00ab2badd0
Apr 18 15:01:09 hector kernel: iwl_pcie_napi_poll_msix() at iwl_pcie_napi_poll_msix+0x30/frame 0xfffffe00ab2bae20
Apr 18 15:01:09 hector kernel: lkpi_napi_task() at lkpi_napi_task+0xf/frame 0xfffffe00ab2bae40
Apr 18 15:01:09 hector kernel: taskqueue_run_locked() at taskqueue_run_locked+0x182/frame 0xfffffe00ab2baec0
Apr 18 15:01:09 hector kernel: taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe00ab2baef0
Apr 18 15:01:09 hector kernel: fork_exit() at fork_exit+0x7d/frame 0xfffffe00ab2baf30
Apr 18 15:01:09 hector kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00ab2baf30
Apr 18 15:01:09 hector kernel: --- trap 0xc, rip = 0x63639d6e3da, rsp = 0x63644046f48, rbp = 0x63644046f60 ---
Apr 18 15:01:09 hector kernel: Uptime: 33s



Another:
Apr 18 15:05:32 hector wpa_supplicant[21281]: wlan0: CTRL-EVENT-DISCONNECTED bssid=fc:f5:28:ca:f1:13 reason=0
Apr 18 15:05:32 hector kernel: iwlwifi0: linuxkpi_ieee80211_beacon_loss: vif 0xfffffe00abd8de80 vap 0xfffffe00abd8d010 state RUN
Apr 18 15:05:32 hector syslogd: last message repeated 1 times
Apr 18 15:05:32 hector kernel: wlan0: link state changed to DOWN
Apr 18 15:05:32 hector devd[76299]: Processing event '!system=IFNET subsystem=wlan0 type=LINK_DOWN'
Apr 18 15:05:32 hector dhclient[22403]: wlan0 link state up -> down
Apr 18 15:05:32 hector devd[76299]: Pushing table
Apr 18 15:05:32 hector devd[76299]: Processing notify event
Apr 18 15:05:32 hector kernel: iwlwifi0: Couldn't drain frames for staid 0, status 0x8
Apr 18 15:05:32 hector kernel: iwlwifi0: lkpi_sta_run_to_init:2173: mo_sta_state(NOTEXIST) failed: -5
Apr 18 15:05:32 hector kernel: iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 1 (SCAN)
Apr 18 15:05:32 hector devd[76299]: Popping table
Apr 18 15:05:33 hector wpa_supplicant[21281]: wlan0: Trying to associate with fc:f5:28:ca:f1:13 (SSID='CCBiesse' freq=5180 MHz)
Apr 18 15:05:33 hector kernel: iwlwifi0: lkpi_sta_scan_to_auth:1033: lvif 0xfffffe00abd8d000 vap 0xfffffe00abd8d010 iv_bss 0xfffffe00ade45000 lvif_bss 0xfffff80005c8f800 lvif_bss->ni 0xfffffe00ac03c000 synched 0
Apr 18 15:05:33 hector kernel: iwlwifi0: lkpi_iv_newstate: error 16 during state transition 1 (SCAN) -> 2 (AUTH)
Apr 18 15:05:33 hector kernel: Sleeping thread (tid 100787, pid 0) owns a non-sleepable lock
Apr 18 15:05:33 hector kernel: KDB: stack backtrace of thread 100787:
Apr 18 15:05:33 hector kernel: sched_switch() at sched_switch+0x7d1/frame 0xfffffe00b0d0ae20
Apr 18 15:05:33 hector kernel: mi_switch() at mi_switch+0xbf/frame 0xfffffe00b0d0ae40
Apr 18 15:05:33 hector kernel: _sleep() at _sleep+0x1f0/frame 0xfffffe00b0d0aec0
Apr 18 15:06:26 hector syslogd: restart
Apr 18 15:06:26 hector syslogd: kernel boot file is /boot/kernel/kernel
Apr 18 15:06:26 hector kernel: fork_exit() at fork_exit+0x7d/frame 0xfffffe00b0d0af30
Apr 18 15:06:26 hector kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b0d0af30
Apr 18 15:06:26 hector kernel: --- trap 0xfc9226bb, rip = 0x9e2138e7aa6c807d, rsp = 0xce04afe17783e5b4, rbp = 0x7adcc9aeee864891 ---
Apr 18 15:06:26 hector kernel: panic: sleeping thread
Apr 18 15:06:26 hector kernel: cpuid = 3
Apr 18 15:06:26 hector kernel: time = 1713445533
Apr 18 15:06:26 hector kernel: KDB: stack backtrace:
Apr 18 15:06:26 hector kernel: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b0dbc960
Apr 18 15:06:26 hector kernel: vpanic() at vpanic+0x152/frame 0xfffffe00b0dbc9b0
Apr 18 15:06:26 hector kernel: panic() at panic+0x43/frame 0xfffffe00b0dbca10
Apr 18 15:06:26 hector kernel: propagate_priority() at propagate_priority+0x293/frame 0xfffffe00b0dbca50
Apr 18 15:06:26 hector kernel: turnstile_wait() at turnstile_wait+0x314/frame 0xfffffe00b0dbcaa0
Apr 18 15:06:26 hector kernel: __mtx_lock_sleep() at __mtx_lock_sleep+0x17b/frame 0xfffffe00b0dbcb30
Apr 18 15:06:26 hector kernel: linuxkpi_ieee80211_find_sta() at linuxkpi_ieee80211_find_sta+0xd0/frame 0xfffffe00b0dbcb70
Apr 18 15:06:26 hector kernel: linuxkpi_ieee80211_find_sta_by_ifaddr() at linuxkpi_ieee80211_find_sta_by_ifaddr+0x7f/frame 0xfffffe00b0dbcbc0
Apr 18 15:06:26 hector kernel: iwl_mvm_rx_mpdu_mq() at iwl_mvm_rx_mpdu_mq+0x420/frame 0xfffffe00b0dbccd0
Apr 18 15:06:26 hector kernel: iwl_pcie_rx_handle() at iwl_pcie_rx_handle+0x444/frame 0xfffffe00b0dbcdd0
Apr 18 15:06:26 hector kernel: iwl_pcie_napi_poll_msix() at iwl_pcie_napi_poll_msix+0x30/frame 0xfffffe00b0dbce20
Apr 18 15:06:26 hector kernel: lkpi_napi_task() at lkpi_napi_task+0xf/frame 0xfffffe00b0dbce40
Apr 18 15:06:26 hector kernel: taskqueue_run_locked() at taskqueue_run_locked+0x182/frame 0xfffffe00b0dbcec0
Apr 18 15:06:26 hector kernel: taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe00b0dbcef0
Apr 18 15:06:26 hector kernel: fork_exit() at fork_exit+0x7d/frame 0xfffffe00b0dbcf30
Apr 18 15:06:26 hector kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00b0dbcf30
Apr 18 15:06:26 hector kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Apr 18 15:06:26 hector kernel: Uptime: 4m36s




I've got several others.
Once I also entered a loop where the laptop would panic and reboot before I reached the prompt.
Luckily entering single-user mode, starting network from there and then moving to multi-user mode solved. 



# pciconf -lv iwlwifi0
iwlwifi0@pci0:0:12:0:    class=0x028000 rev=0x06 hdr=0x00 vendor=0x8086 device=0x31dc subvendor=0x8086 subdevice=0x0264
    vendor     = 'Intel Corporation'
    device     = 'Gemini Lake PCH CNVi WiFi'
    class      = network 

(should be Intel 9461).
Comment 15 ml 2024-04-22 17:30:02 UTC
(In reply to ml from comment #14)

P.S.
I'm wrinting here on invitation from Bjoern (he was the one saying I'm possibly seeing this specific issue).
Comment 16 Bakul Shah 2024-05-14 06:31:48 UTC
I tested this in a VM with a fairly recent -current (commit da4230af3fda).

ifconfig wlan0 create wlandev iwlwifi0
service netif start wlan0

works fine and the machine stayed up quite a while. But when I did
a "shutdown now" it paniced. I repeated this a few times:


syslogd: exiting on signal 15
iwlwifi0: Couldn't drain frames for staid 0, status 0x8
iwlwifi0: lkpi_sta_run_to_init:2309: mo_sta_state(NOTEXIST) failed: -5
iwlwifi0: lkpi_iv_newstate: error -5 during state transition 5 (RUN) -> 0 (INIT)
panic: INIT state change failed
cpuid = 0
time = 1715667397
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00688bbc60
vpanic() at vpanic+0x13f/frame 0xfffffe00688bbd90
panic() at panic+0x43/frame 0xfffffe00688bbdf0
ieee80211_newstate_cb() at ieee80211_newstate_cb+0x422/frame 0xfffffe00688bbe40
taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe00688bbec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe00688bbef0
fork_exit() at fork_exit+0x82/frame 0xfffffe00688bbf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00688bbf30
--- trap 0xc, rip = 0x39f11dbd331a, rsp = 0x39f122c47f48, rbp = 0x39f122c47f60 ---
KDB: enter: panic
[ thread pid 0 tid 100266 ]
Stopped at      kdb_enter+0x33: movq    $0,0x1053452(%rip)
Comment 17 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-05-22 02:34:17 UTC
Can everyone please try to patch from https://reviews.freebsd.org/D45293 ?

You can download the raw diff from https://reviews.freebsd.org/D45293?download=true if you don't use arc.

The change should apply to main, stable/14 and stable/13.
Comment 18 Bakul Shah 2024-05-22 03:34:34 UTC
I applied the patch on -current and the system worked fine and didn't crash at reboot time. It worked fine for several more reboots.
Comment 19 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-05-22 03:45:22 UTC
(In reply to Bakul Shah from comment #18)

Wow, that was fast.   Thank you!  Can you remind me of the PCI ID of your card?
Comment 20 Bakul Shah 2024-05-22 03:51:18 UTC
iwlwifi0@pci0:0:7:0:    class=0x028000 rev=0x29 hdr=0x00 vendor=0x8086 device=0x2526 subvendor=0x8086 subdevice=0x0014
    vendor     = 'Intel Corporation'
    device     = 'Wi-Fi 5(802.11ac) Wireless-AC 9x6x [Thunder Peak]'
    class      = network


iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: base HW address: 8c:a9:82:fc:e8:9c, OTP minor version: 0x4
iwlwifi0: <iwlwifi> mem 0xc1034000-0xc1037fff at device 7.0 on pci0
iwlwifi0: Detected crf-id 0x2816, cnv-id 0x1000200 wfpm id 0x80000000
iwlwifi0: PCI dev 2526/0014, rev=0x321, rfid=0x105110
iwlwifi0: successfully loaded firmware image 'iwlwifi-9260-th-b0-jf-b0-46.ucode'

Note that on reboot the device seems to be in some odd state and I see messages like

iwlwifi0: loaded firmware version 46.ff18e32a.0 9260-th-b0-jf-b0-46.ucode op_mode iwlmvm
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321
iwlwifi0: Failed to load firmware chunk!
iwlwifi0: iwlwifi transaction failed, dumping registers
iwlwifi0: iwlwifi device config registers:
iwlwifi 0000:00:07.0: 0000   86 80 26 25 07 04 10 00 29 00 80 02 00 00 00 00  |..
...
iwlwifi0: Could not load the [0] uCode section
iwlwifi0: Failed to start INIT ucode: -60
iwlwifi0: WRT: Collecting data: ini trigger 13 fired (delay=0ms).
iwlwifi0: Not valid error log pointer 0x00000000 for Init uCode
iwlwifi0: IML/ROM dump:
...
iwlwifi0: 0xE27C6CB6 | FSEQ_CLASS_TP_VERSION
iwlwifi0: Failed to run INIT ucode: -60
iwlwifi0: retry init count 0
iwlwifi0: Detected Intel(R) Wireless-AC 9260 160MHz, REV=0x321

But it seems to work fine (at least with my manula ifconfig ... create...). Haven't tried enabling it from rc.conf yet.
Comment 21 commit-hook freebsd_committer freebsd_triage 2024-05-22 21:08:28 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=5a4d24610fc6143ac1d570fe2b5160e8ae893c2c

commit 5a4d24610fc6143ac1d570fe2b5160e8ae893c2c
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-05-22 02:24:51 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-05-22 21:04:19 +0000

    LinuxKPI: 802.11: change teardown order to avoid iwlwifi firmware crashes

    While the previous order worked well for iwlwifi 22000 and later chipsets
    (AXxxx, BE200), earlier chipsets had trouble and ran into firmware crashes.
    Change the teardown order to avoid these problems.  The inline comments
    in lkpi_sta_run_to_init() (and lkpi_disassoc()) try to document the new
    order and also the old problems we were seeing (too early sta removal or
    silent non-removal) leading to follow-up problems.

    There is a possible further problem still lingering but a lot harder to
    trigger (see comment in review) and likely related to some other doings
    so we'll track it separately.

    Sponsored by:   The FreeBSD Foundation
    MFC after:      3 days
    PR:             275255
    Tested with:    AX210, 8265 (bz); 9260 (Bakul Shah)
    Differential Revision: https://reviews.freebsd.org/D45293

 sys/compat/linuxkpi/common/src/linux_80211.c | 84 ++++++++++++++++++----------
 1 file changed, 55 insertions(+), 29 deletions(-)
Comment 22 commit-hook freebsd_committer freebsd_triage 2024-06-12 16:42:32 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=7ad7453748e2adafa1e1a3e44b02fc852d4c5301

commit 7ad7453748e2adafa1e1a3e44b02fc852d4c5301
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-05-22 02:24:51 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-06-12 13:58:36 +0000

    LinuxKPI: 802.11: change teardown order to avoid iwlwifi firmware crashes

    While the previous order worked well for iwlwifi 22000 and later chipsets
    (AXxxx, BE200), earlier chipsets had trouble and ran into firmware crashes.
    Change the teardown order to avoid these problems.  The inline comments
    in lkpi_sta_run_to_init() (and lkpi_disassoc()) try to document the new
    order and also the old problems we were seeing (too early sta removal or
    silent non-removal) leading to follow-up problems.

    There is a possible further problem still lingering but a lot harder to
    trigger (see comment in review) and likely related to some other doings
    so we'll track it separately.

    Sponsored by:   The FreeBSD Foundation
    PR:             275255
    Tested with:    AX210, 8265 (bz); 9260 (Bakul Shah)
    Differential Revision: https://reviews.freebsd.org/D45293

    (cherry picked from commit 5a4d24610fc6143ac1d570fe2b5160e8ae893c2c)

 sys/compat/linuxkpi/common/src/linux_80211.c | 84 ++++++++++++++++++----------
 1 file changed, 55 insertions(+), 29 deletions(-)
Comment 23 commit-hook freebsd_committer freebsd_triage 2024-06-14 18:43:24 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=def43d8a4a3c0a2868fe74ef6aefe16c435ea19c

commit def43d8a4a3c0a2868fe74ef6aefe16c435ea19c
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-05-22 02:24:51 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-06-14 14:55:16 +0000

    LinuxKPI: 802.11: change teardown order to avoid iwlwifi firmware crashes

    While the previous order worked well for iwlwifi 22000 and later chipsets
    (AXxxx, BE200), earlier chipsets had trouble and ran into firmware crashes.
    Change the teardown order to avoid these problems.  The inline comments
    in lkpi_sta_run_to_init() (and lkpi_disassoc()) try to document the new
    order and also the old problems we were seeing (too early sta removal or
    silent non-removal) leading to follow-up problems.

    There is a possible further problem still lingering but a lot harder to
    trigger (see comment in review) and likely related to some other doings
    so we'll track it separately.

    Sponsored by:   The FreeBSD Foundation
    PR:             275255
    Tested with:    AX210, 8265 (bz); 9260 (Bakul Shah)
    Differential Revision: https://reviews.freebsd.org/D45293

    (cherry picked from commit 5a4d24610fc6143ac1d570fe2b5160e8ae893c2c)

 sys/compat/linuxkpi/common/src/linux_80211.c | 84 ++++++++++++++++++----------
 1 file changed, 55 insertions(+), 29 deletions(-)
Comment 24 Bjoern A. Zeeb freebsd_committer freebsd_triage 2024-06-14 20:05:56 UTC
We believe this is fixed in main, stable/13 and stable/14.

If  you run a release  before (not including) 14.2 or 13.4 then there is no need to report it anymore.
If you have a chance in these cases to try a stable branch after the commits it would be great.

Thanks to everyone who reported the problem, provided debug information or tested the change(s).