Bug 276083 - iwlwifi development in private build: 802.11n crash in iwl_mvm_sta_fw_id_mask()
Summary: iwlwifi development in private build: 802.11n crash in iwl_mvm_sta_fw_id_mask()
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: wireless (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Cheng Cui
URL:
Keywords: tracking
Depends on:
Blocks: 276095
  Show dependency treegraph
 
Reported: 2024-01-02 19:33 UTC by Cheng Cui
Modified: 2024-04-01 23:47 UTC (History)
4 users (show)

See Also:


Attachments
info for a crash of 11na in an iwlwifi private development build (19.61 KB, text/plain)
2024-01-02 19:33 UTC, Cheng Cui
no flags Details
grp,cmd prints during 802.11n association of iwlwifi Linux (63.00 KB, text/plain)
2024-01-10 20:29 UTC, Cheng Cui
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Cheng Cui freebsd_committer freebsd_triage 2024-01-02 19:33:26 UTC
Created attachment 247417 [details]
info for a crash of 11na in an iwlwifi private development build

Per discussion with Bojern, debug the firmware error "lkpi_ic_ampdu_rx_start: mo_ampdu_action returned -60" after enabling 11na in a private build for development.

Attached log file contains reproduce method and the crash backtraces in detail.

quote from Bjoern:
"I would suggest you can go and start debugging the code path from the
call from LinuxKPI to mo_ampdu_action -> iwlwifi -> downcall into the
firmware and see if you find the problem that makes the firmware
"stall"."

quote of the firmware error messages:
iwlwifi0: lkpi_ic_ampdu_rx_start: mo_ampdu_action returned -60. ni 0xfffffe00bdfbb000 rap 0xfffffe00bdfbbd58
iwlwifi0: HCMD_ACTIVE already clear for command UNKNOWN
iwlwifi0: No beacon heard and the session protection is over already...
iwlwifi0: linuxkpi_ieee80211_connection_loss: vif 0xfffffe00bdf1fc80 vap 0xfffffe00bdf1f010 state RUN
iwlwifi0: Failed to send flush command (-5)
iwlwifi0: flush request fail
iwlwifi0: Couldn't send the SESSION_PROTECTION_CMD
WARNING trans->state != IWL_TRANS_FW_ALIVE failed at /usr/src/sys/contrib/dev/iwlwifi/iwl-trans.h:1446
iwlwifi0: iwl_trans_wait_txq_empty bad state = 0
iwlwifi0: Failed to trigger RX queues sync (-5)
iwlwifi0: Failed to synchronize multicast groups update
iwlwifi0: Failed to send MAC_CONFIG_CMD (action:2): -5
iwlwifi0: failed to update MAC 0xfffffe00bdf1fc9eM
iwlwifi0: Failed to send flush command (-5)
iwlwifi0: Failed to send MAC_CONFIG_CMD (action:2): -5
iwlwifi0: failed to update MAC 0xfffffe00bdf1fc9eM
iwlwifi0: Failed to send LINK_CONFIG_CMD (action:2): -5
iwlwifi0: PHY ctxt cmd error. ret=-5
Comment 1 Cheng Cui freebsd_committer freebsd_triage 2024-01-05 15:00:31 UTC
For the iwlwifi0: lkpi_ic_ampdu_rx_start: mo_ampdu_action returned -60 error,
"-60" means "-ETIMEDOUT".

Here are the function call sequences:
lkpi_ic_ampdu_rx_start
 lkpi_80211_mo_ampdu_action
  iwl_mvm_mac_ampdu_action
   iwl_mvm_sta_rx_agg
    iwl_mvm_fw_baid_op
     iwl_mvm_fw_baid_op_cmd
      iwl_mvm_send_cmd_pdu_status
       iwl_mvm_send_cmd_status
        iwl_trans_send_cmd
         iwl_trans_txq_send_hcmd
          iwl_trans_txq_send_hcmd_sync

In iwl_trans_txq_send_hcmd_sync(), the code shows the event wait timeout:

static int iwl_trans_txq_send_hcmd_sync(struct iwl_trans *trans,
					struct iwl_host_cmd *cmd)
{
...
	cmd_idx = trans->ops->send_cmd(trans, cmd);   << cmd_idx=20 in my print
	if (cmd_idx < 0) {
		ret = cmd_idx;
		clear_bit(STATUS_SYNC_HCMD_ACTIVE, &trans->status);
		IWL_ERR(trans, "Error sending %s: enqueue_hcmd failed: %d\n",
			cmd_str, ret);
		return ret;
	}

	ret = wait_event_timeout(trans->wait_command_queue,
				 !test_bit(STATUS_SYNC_HCMD_ACTIVE,
					   &trans->status),
				 HOST_COMPLETE_TIMEOUT);
	if (!ret) {
		IWL_ERR(trans, "Error sending %s: time out after %dms.\n",
			cmd_str, jiffies_to_msecs(HOST_COMPLETE_TIMEOUT));

		IWL_ERR(trans, "Current CMD queue read_ptr %d write_ptr %d\n",
			txq->read_ptr, txq->write_ptr);

		clear_bit(STATUS_SYNC_HCMD_ACTIVE, &trans->status);
		IWL_DEBUG_INFO(trans, "Clearing HCMD_ACTIVE for command %s\n",
			       cmd_str);
		ret = -ETIMEDOUT;                       << -ETIMEDOUT is returned here

		iwl_trans_sync_nmi(trans);
		goto cancel;
	}
...

Will dig further from here.
Comment 2 Cheng Cui freebsd_committer freebsd_triage 2024-01-05 15:04:23 UTC
(In reply to Cheng Cui from comment #1)

Also the cmd_str shows "UNKNOWN", I guess the cmd_idx=20 played something wrong or the idx is larger than the CMD queue size.

based on error messages:
iwlwifi0: Error sending UNKNOWN: time out after 2000ms.
iwlwifi0: Current CMD queue read_ptr 16 write_ptr 17
Comment 3 Cheng Cui freebsd_committer freebsd_triage 2024-01-05 20:29:45 UTC
(In reply to Cheng Cui from comment #2)

In private build I triggered a panic to see backtraces:
iwlwifi0: cc: iwl_get_cmd_string ret->cmd_id=0 ret_str=UNKNOWN
iwlwifi0: cc: iwl_trans_txq_send_hcmd_sync cmd_idx=12 ret=0
iwlwifi0: Error sending UNKNOWN: time out after 2000ms.
iwlwifi0: Current CMD queue read_ptr 140 write_ptr 141
panic: cc: iwl_trans_txq_send_hcmd_sync shall panic here

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:403
#2  0xffffffff804a4c1a in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, 
    dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:590
#3  0xffffffff804a4a1d in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=true)
    at /usr/src/sys/ddb/db_command.c:503
#4  0xffffffff804a46dd in db_command_loop () at /usr/src/sys/ddb/db_command.c:550
#5  0xffffffff804a7db6 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#6  0xffffffff80ba0a63 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe007ebd5390)
    at /usr/src/sys/kern/subr_kdb.c:790
#7  0xffffffff81058839 in trap (frame=0xfffffe007ebd5390) at /usr/src/sys/amd64/amd64/trap.c:606
#8  <signal handler called>
#9  kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#10 0xffffffff80b51623 in vpanic (fmt=0xffffffff82680f62 "cc: %s shall panic here\n", ap=ap@entry=0xfffffe007ebd55c0)
    at /usr/src/sys/kern/kern_shutdown.c:953
#11 0xffffffff80b51403 in panic (fmt=0xffffffff81980420 <cnputs_mtx> "MW\025\201\377\377\377\377")
    at /usr/src/sys/kern/kern_shutdown.c:889
#12 0xffffffff8267ecd7 in iwl_trans_txq_send_hcmd_sync (trans=0xfffffe00804d4040, cmd=0xfffffe007ebd56e8)
    at /usr/src/sys/contrib/dev/iwlwifi/queue/tx.c:1816
#13 iwl_trans_txq_send_hcmd (trans=0xfffffe00804d4040, cmd=0xfffffe007ebd56e8)
    at /usr/src/sys/contrib/dev/iwlwifi/queue/tx.c:1904
#14 0xffffffff8261b12c in iwl_trans_send_cmd (trans=0xfffffe00804d4040, cmd=cmd@entry=0xfffffe007ebd56e8)
    at /usr/src/sys/contrib/dev/iwlwifi/iwl-trans.c:191
#15 0xffffffff8265e054 in iwl_mvm_send_cmd_status (mvm=mvm@entry=0xfffffe00bde30448, cmd=cmd@entry=0xfffffe007ebd56e8, 
    status=status@entry=0xfffffe007ebd57a0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/utils.c:100
#16 0xffffffff8265e17f in iwl_mvm_send_cmd_pdu_status (mvm=0xffffffff81980420 <cnputs_mtx>, mvm@entry=0xfffffe00bde30448, 
    id=id@entry=1302, len=len@entry=16, data=0xffffffff8121d172, data@entry=0xfffffe007ebd5748, status=0x0, 
    status@entry=0xfffffe007ebd57a0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/utils.c:139
#17 0xffffffff82650b05 in iwl_mvm_fw_baid_op_cmd (mvm=0xfffffe00bde30448, sta=0xfffff80003dc2880, start=true, tid=0, ssn=0, 
    buf_size=64, baid=-5) at /usr/src/sys/contrib/dev/iwlwifi/mvm/sta.c:2849
#18 iwl_mvm_fw_baid_op (mvm=0xfffffe00bde30448, sta=0xfffff80003dc2880, start=true, tid=0, ssn=0, buf_size=64, baid=-1)
    at /usr/src/sys/contrib/dev/iwlwifi/mvm/sta.c:2876
#19 iwl_mvm_sta_rx_agg (mvm=mvm@entry=0xfffffe00bde30448, sta=sta@entry=0xfffff80003dc2880, tid=0, ssn=0, start=true, 
    buf_size=64, timeout=0) at /usr/src/sys/contrib/dev/iwlwifi/mvm/sta.c:2949
#20 0xffffffff8262e640 in iwl_mvm_mac_ampdu_action (hw=<optimized out>, vif=0xfffffe00bdf67c80, params=<optimized out>)
    at /usr/src/sys/contrib/dev/iwlwifi/mvm/mac80211.c:983
#21 0xffffffff80dd9e31 in lkpi_ic_ampdu_rx_start (ni=0xfffffe00be003000, rap=0xfffffe00be003d58, baparamset=4099, 
    batimeout=0, baseqctl=0) at /usr/src/sys/compat/linuxkpi/common/src/linux_80211.c:3788
#22 0xffffffff80ce1033 in ht_recv_action_ba_addba_request (ni=0xfffffe00be003000, wh=<optimized out>, frm=<optimized out>, 
    efrm=<optimized out>) at /usr/src/sys/net80211/ieee80211_ht.c:2441
#23 0xffffffff80d0b5d3 in sta_input (ni=0xfffffe00be003000, m=0xfffff80007b01500, rxs=0xfffffe007ebd59f8, rssi=-110, 
    nf=<optimized out>) at /usr/src/sys/net80211/ieee80211_sta.c:973
#24 0xffffffff80ce5816 in ieee80211_input_mimo (ni=ni@entry=0xfffffe00be003000, m=m@entry=0xfffff80007b01500)
    at /usr/src/sys/net80211/ieee80211_input.c:100
#25 0xffffffff80ddace9 in linuxkpi_ieee80211_rx (hw=0xfffffe00bde30380, skb=0xfffff80005770800, 
    sta=sta@entry=0xfffff80003dc2880, napi=<optimized out>, list=list@entry=0x0)
    at /usr/src/sys/compat/linuxkpi/common/src/linux_80211.c:4845
#26 0xffffffff82643587 in ieee80211_rx_napi (hw=0xffffffff81980420 <cnputs_mtx>, sta=0xfffff80003dc2880, 
    skb=0xfffff80005770800, napi=0xfffffe00bde3c098) at /usr/src/sys/compat/linuxkpi/common/include/net/mac80211.h:1618
#27 iwl_mvm_pass_packet_to_mac80211 (mvm=0xfffffe00bde30448, napi=0xfffffe00bde3c098, skb=0xfffff80005770800, queue=0, 
    sta=0xfffff80003dc2880, link_sta=0xfffff80003dc2938) at /usr/src/sys/contrib/dev/iwlwifi/mvm/rxmq.c:257
#28 iwl_mvm_rx_mpdu_mq (mvm=0xfffffe00bde30448, napi=0xfffffe00bde3c098, rxb=<optimized out>, queue=0)
    at /usr/src/sys/contrib/dev/iwlwifi/mvm/rxmq.c:2645
#29 0xffffffff8266de5f in iwl_op_mode_rx (op_mode=0xffffffff81980420 <cnputs_mtx>, napi=0xfffffe00bde3c098, 
    rxb=0xfffffe007ebd5d50) at /usr/src/sys/contrib/dev/iwlwifi/iwl-op-mode.h:139
#30 iwl_pcie_rx_handle_rb (rxq=0xfffffe00bde3c000, i=<optimized out>, trans=<optimized out>, rxb=<optimized out>, 
    emergency=<optimized out>) at /usr/src/sys/contrib/dev/iwlwifi/pcie/rx.c:1381
#31 iwl_pcie_rx_handle (trans=trans@entry=0xfffffe00804d4040, queue=<optimized out>, budget=budget@entry=64)
    at /usr/src/sys/contrib/dev/iwlwifi/pcie/rx.c:1568
#32 0xffffffff8266d8fd in iwl_pcie_napi_poll_msix (napi=0xfffffe00bde3c098, budget=64)
    at /usr/src/sys/contrib/dev/iwlwifi/pcie/rx.c:1044
#33 0xffffffff80deb8ff in lkpi_napi_task (ctx=0xfffffe00bde3c098, pending=<optimized out>)
    at /usr/src/sys/compat/linuxkpi/common/src/linux_netdev.c:302
#34 0xffffffff80bb743b in taskqueue_run_locked (queue=queue@entry=0xfffff8000782e900)
    at /usr/src/sys/kern/subr_taskqueue.c:511
#35 0xffffffff80bb84f3 in taskqueue_thread_loop (arg=arg@entry=0xfffffe00804d6120) at /usr/src/sys/kern/subr_taskqueue.c:823
#36 0xffffffff80b06ec2 in fork_exit (callout=0xffffffff80bb8420 <taskqueue_thread_loop>, arg=0xfffffe00804d6120, 
    frame=0xfffffe007ebd5f40) at /usr/src/sys/kern/kern_fork.c:1157
#37 <signal handler called>

(kgdb) f 12
#12 0xffffffff8267ecd7 in iwl_trans_txq_send_hcmd_sync (trans=0xfffffe00804d4040, cmd=0xfffffe007ebd56e8)
    at /usr/src/sys/contrib/dev/iwlwifi/queue/tx.c:1816
1816			panic("cc: %s shall panic here\n", __func__);
(kgdb) l iwl_trans_txq_send_hcmd_sync
1772	#define HOST_COMPLETE_TIMEOUT	(2 * HZ)
1773	
1774	static int iwl_trans_txq_send_hcmd_sync(struct iwl_trans *trans,
1775						struct iwl_host_cmd *cmd)
1776	{
1777		const char *cmd_str = iwl_get_cmd_string(trans, cmd->id);
1778		struct iwl_txq *txq = trans->txqs.txq[trans->txqs.cmd.q_id];
1779		int cmd_idx;
1780		int ret;
1781	
(kgdb) 
1782		IWL_DEBUG_INFO(trans, "Attempting to send sync command %s\n", cmd_str);
1783	
1784		if (WARN(test_and_set_bit(STATUS_SYNC_HCMD_ACTIVE,
1785					  &trans->status),
1786			 "Command %s: a command is already active!\n", cmd_str))
1787			return -EIO;
1788	
1789		IWL_DEBUG_INFO(trans, "Setting HCMD_ACTIVE for command %s\n", cmd_str);
1790	
1791		cmd_idx = trans->ops->send_cmd(trans, cmd);
(kgdb) 
1792		if (cmd_idx < 0) {
1793			ret = cmd_idx;
1794			clear_bit(STATUS_SYNC_HCMD_ACTIVE, &trans->status);
1795			IWL_ERR(trans, "Error sending %s: enqueue_hcmd failed: %d\n",
1796				cmd_str, ret);
1797			return ret;
1798		}
1799	
1800		ret = wait_event_timeout(trans->wait_command_queue,
1801					 !test_bit(STATUS_SYNC_HCMD_ACTIVE,
(kgdb) 
1802						   &trans->status),
1803					 HOST_COMPLETE_TIMEOUT);
1804		if (!ret) {
1805			IWL_ERR(trans, "cc: %s cmd_idx=%d ret=%d\n", __func__, cmd_idx, ret);
1806			IWL_ERR(trans, "Error sending %s: time out after %dms.\n",
1807				cmd_str, jiffies_to_msecs(HOST_COMPLETE_TIMEOUT));
1808	
1809			IWL_ERR(trans, "Current CMD queue read_ptr %d write_ptr %d\n",
1810				txq->read_ptr, txq->write_ptr);
1811	
(kgdb) 
1812			clear_bit(STATUS_SYNC_HCMD_ACTIVE, &trans->status);
1813			IWL_DEBUG_INFO(trans, "Clearing HCMD_ACTIVE for command %s\n",
1814				       cmd_str);
1815	
1816			panic("cc: %s shall panic here\n", __func__);          << hit here
1817			ret = -ETIMEDOUT;
1818	
1819			iwl_trans_sync_nmi(trans);
1820			goto cancel;
1821		}
(kgdb) p cmd_str
$1 = 0xffffffff82682169 "UNKNOWN"
(kgdb) p cmd
$4 = (struct iwl_host_cmd *) 0xfffffe007ebd56e8
(kgdb) p *cmd
$5 = {data = {0xfffffe007ebd5748, 0x0}, resp_pkt = 0x0, _page = 0x0, _rx_page_order = 0, flags = 2, id = 1302, len = {16, 0}, 
  dataflags = "\000"}
(kgdb) p cmd->id
$6 = 1302
(kgdb) p (cmd->id & 0xFF00) >> 8	// iwl_cmd_groupid
$7 = 5                              << grp=5
(kgdb) p cmd->id & 0xFF				// iwl_cmd_opcode
$8 = 22                             << cmd=22
(kgdb) p trans->command_groups
$9 = (const struct iwl_hcmd_arr *) 0xffffffff8269a080 <iwl_mvm_groups>
(kgdb) p trans->command_groups_size
$10 = 13
(kgdb) p trans->command_groups[5].arr
$11 = (const struct iwl_hcmd_names *) 0xffffffff8269a840 <iwl_mvm_data_path_names>
(kgdb) p *trans->command_groups[5].arr
$14 = {cmd_id = 0 '\000', cmd_name = 0xffffffff8268699b "DQA_ENABLE_CMD"}
(kgdb) p *(trans->command_groups[5].arr+1)
$15 = {cmd_id = 1 '\001', cmd_name = 0xffffffff82685561 "UPDATE_MU_GROUPS_CMD"}
(kgdb) p *(trans->command_groups[5].arr+2)
$16 = {cmd_id = 2 '\002', cmd_name = 0xffffffff82680a46 "TRIGGER_RX_QUEUES_NOTIF_CMD"}
(kgdb) p *(trans->command_groups[5].arr+3)
$17 = {cmd_id = 7 '\a', cmd_name = 0xffffffff8268d23b "STA_HE_CTXT_CMD"}
(kgdb) p *(trans->command_groups[5].arr+4)
$18 = {cmd_id = 8 '\b', cmd_name = 0xffffffff8268b4e1 "RLC_CONFIG_CMD"}
(kgdb) p *(trans->command_groups[5].arr+5)
$19 = {cmd_id = 13 '\r', cmd_name = 0xffffffff82681dc9 "RFH_QUEUE_CONFIG_CMD"}
(kgdb) p *(trans->command_groups[5].arr+6)
$20 = {cmd_id = 15 '\017', cmd_name = 0xffffffff82680030 "TLC_MNG_CONFIG_CMD"}
(kgdb) p *(trans->command_groups[5].arr+7)
$21 = {cmd_id = 20 '\024', cmd_name = 0xffffffff82685576 "CHEST_COLLECTOR_FILTER_CONFIG_CMD"}
(kgdb) p *(trans->command_groups[5].arr+8)
$22 = {cmd_id = 23 '\027', cmd_name = 0xffffffff8268887b "SCD_QUEUE_CONFIG_CMD"}
(kgdb) p *(trans->command_groups[5].arr+9)
$23 = {cmd_id = 24 '\030', cmd_name = 0xffffffff82691716 "SEC_KEY_CMD"}
(kgdb) p *(trans->command_groups[5].arr+10)
$24 = {cmd_id = 244 '\364', cmd_name = 0xffffffff82687db2 "MONITOR_NOTIF"}
(kgdb) p *(trans->command_groups[5].arr+11)
$25 = {cmd_id = 246 '\366', cmd_name = 0xffffffff8268a000 "THERMAL_DUAL_CHAIN_REQUEST"}
(kgdb) p *(trans->command_groups[5].arr+12)
$26 = {cmd_id = 253 '\375', cmd_name = 0xffffffff8268aa94 "STA_PM_NOTIF"}
(kgdb) p *(trans->command_groups[5].arr+13)
$27 = {cmd_id = 254 '\376', cmd_name = 0xffffffff826920ea "MU_GROUP_MGMT_NOTIF"}
(kgdb) p *(trans->command_groups[5].arr+14)
$28 = {cmd_id = 255 '\377', cmd_name = 0xffffffff8268e54f "RX_QUEUES_NOTIFICATION"}

In iwl_get_cmd_string(), the ret was NULL because of no match found. 
	ret = bsearch(&cmd, arr->arr, arr->size, size, iwl_hcmd_names_cmp);

So there was no match of such "cmd=22" from the "trans->command_groups[5].arr".

cmd=22== 0x16, which shall be requesting: "RX_BAID_ALLOCATION_CONFIG_CMD" according to "datapath.h".

Stay tuned.
Comment 4 Cheng Cui freebsd_committer freebsd_triage 2024-01-08 17:50:01 UTC
According to iwl_mvm_fw_baid_op_cmd(), the group is DATA_PATH_GROUP (grp=5)
and the cmd is 22 which means "RX_BAID_ALLOCATION_CONFIG_CMD". I am wondering
why trans->command_groups[5].arr did not have the "cmd_id = 22" entry.

enum iwl_mvm_command_groups {
	LEGACY_GROUP = 0x0,
	LONG_GROUP = 0x1,
	SYSTEM_GROUP = 0x2,
	MAC_CONF_GROUP = 0x3,
	PHY_OPS_GROUP = 0x4,
	DATA_PATH_GROUP = 0x5,		<< grp=5
	SCAN_GROUP = 0x6,
	NAN_GROUP = 0x7,
	LOCATION_GROUP = 0x8,
	PROT_OFFLOAD_GROUP = 0xb,
	REGULATORY_AND_NVM_GROUP = 0xc,
	DEBUG_GROUP = 0xf,
};

enum iwl_data_path_subcmd_ids {
...
	/**
	 * @RX_BAID_ALLOCATION_CONFIG_CMD: Allocate/deallocate a BAID for an RX
	 *	blockack session, uses &struct iwl_rx_baid_cfg_cmd for the
	 *	command, and &struct iwl_rx_baid_cfg_resp as a response.
	 */
	RX_BAID_ALLOCATION_CONFIG_CMD = 0x16,	<< 0x16==22
...
}
Comment 5 Cheng Cui freebsd_committer freebsd_triage 2024-01-08 21:47:43 UTC
(In reply to Cheng Cui from comment #4)

Here, "RX_BAID_ALLOCATION_CONFIG_CMD" is not added in iwl_mvm_data_path_names[]. And this is not just a problem in 11na, 11a also has this same problem. I wonder why iwl_mvm_fw_baid_op_cmd() was added in below Linux patch without adding "RX_BAID_ALLOCATION_CONFIG_CMD" in iwl_mvm_data_path_names[].

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97f70c56e6ca1e9bad1cbf6d3b0592f150bb95b1

/* Please keep this array *SORTED* by hex value.
 * Access is done through binary search
 */
static const struct iwl_hcmd_names iwl_mvm_data_path_names[] = {
	HCMD_NAME(DQA_ENABLE_CMD),
	HCMD_NAME(UPDATE_MU_GROUPS_CMD),
	HCMD_NAME(TRIGGER_RX_QUEUES_NOTIF_CMD),
	HCMD_NAME(STA_HE_CTXT_CMD),
	HCMD_NAME(RLC_CONFIG_CMD),
	HCMD_NAME(RFH_QUEUE_CONFIG_CMD),
	HCMD_NAME(TLC_MNG_CONFIG_CMD),
	HCMD_NAME(CHEST_COLLECTOR_FILTER_CONFIG_CMD),
	HCMD_NAME(SCD_QUEUE_CONFIG_CMD),
	HCMD_NAME(SEC_KEY_CMD),
	HCMD_NAME(MONITOR_NOTIF),
	HCMD_NAME(THERMAL_DUAL_CHAIN_REQUEST),
	HCMD_NAME(STA_PM_NOTIF),
	HCMD_NAME(MU_GROUP_MGMT_NOTIF),
	HCMD_NAME(RX_QUEUES_NOTIFICATION),
};
Comment 6 Cheng Cui freebsd_committer freebsd_triage 2024-01-09 20:12:01 UTC
(In reply to Cheng Cui from comment #5)

This problem grp=5 cmd=22 exists in 11na and 11ng association.

on FreeBSD 11a, 11b and 11g, there is no grp=5 cmd=22 during association.

Also, I noticed in 11a, the iwl_mvm_mld_free_sta_link() is called frequently after association, but not sure if it is related with the panic.

Next I will test a custom Linux kernel to see the pattern of grp & cmd.
Comment 7 Cheng Cui freebsd_committer freebsd_triage 2024-01-10 20:29:24 UTC
Created attachment 247579 [details]
grp,cmd prints during 802.11n association of iwlwifi Linux
Comment 8 Cheng Cui freebsd_committer freebsd_triage 2024-01-10 20:30:44 UTC
(In reply to Cheng Cui from comment #7)

Built a custom Linux kernel in my Ubuntu22.04 VM with the same debug print
added in iwl_get_cmd_string().

With 2.4Ghz and 5Ghz bands, tested iwlwifi of Linux 802.11n in both bands
(11ng and 11na), but didn't find the combination of (grp=5 cmd=22). Full print
info is attached in file named "linux_iwl_print.log".

So there are some possibilities of the problem that makes the firmware "stall".
[1] the firmware does not recognize (grp=5 cmd=22), so it freeze
[2] iwl_mvm_fw_baid_op_cmd() is not used in Linux, but used in FreeBSD

Will further explore the above speculations.
Comment 9 Cheng Cui freebsd_committer freebsd_triage 2024-01-12 16:12:41 UTC
(In reply to Cheng Cui from comment #8)

In Linux kernel v5.10, which is used in my Ubuntu Linux 22.04 VM (v5.15.0-91),
iwlagn_mac_ampdu_action() is used instead of iwl_mvm_mac_ampdu_action() for
option IEEE80211_AMPDU_RX_START. The following are my inspections of the
calling traces that prove iwl_mvm_fw_baid_op_cmd() is no longer used to send
out (grp=5 cmd=22 == RX_BAID_ALLOCATION_CONFIG_CMD) for RX_AMPDU.

linux kernel v5.10:
 iwl_sta_rx_agg_start
  iwl_send_add_sta
   iwl_dvm_send_cmd
    iwl_trans_send_cmd
     iwl_trans_pcie_gen2_send_hcmd
Comment 10 Cheng Cui freebsd_committer freebsd_triage 2024-01-12 16:13:58 UTC
(In reply to Cheng Cui from comment #9)
In FreeBSD, once I disabled HT_RXAGG, the 802.11na association worked as
follows:

root@n1_iwl_vm:~ # sysctl -d compat.linuxkpi.iwlwifi_11n_disable
compat.linuxkpi.iwlwifi_11n_disable: disable 11n functionality, bitmap: 1: full, 2: disable agg TX, 4: disable agg RX, 8 enable agg TX
root@n1_iwl_vm:~ # sysctl compat.linuxkpi.iwlwifi_11n_disable
compat.linuxkpi.iwlwifi_11n_disable: 4

root@n1_iwl_vm:~ # ifconfig wlan0
wlan0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=0
	ether e0:2e:0b:92:e5:82
	groups: wlan
	ssid TP-LINK_5G channel 48 (5240 MHz 11a ht/20) bssid 30:b5:c2:aa:83:34
	regdomain FCC country US authmode WPA2/802.11i privacy ON
	deftxkey UNDEF TKIP 2:128-bit txpower 17 bmiss 7 mcastrate 6
	mgmtrate 6 scanvalid 60 ampdulimit 64k ampdudensity 4 -amsdutx amsdurx
	shortgi -ldpctx ldpcrx -uapsd wme roaming MANUAL
	parent interface: iwlwifi0
	media: IEEE 802.11 Wireless Ethernet MCS mode 11na
	status: associated
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
root@n1_iwl_vm:~ # dhclient wlan0
DHCPREQUEST on wlan0 to 255.255.255.255 port 67
DHCPACK from 192.168.0.1
Bogus Host Name option 12: n1_iwl_vm (n1_iwl_vm)
bound to 192.168.0.190 -- renewal in 3600 seconds.
root@n1_iwl_vm:~ # ifconfig wlan0
wlan0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=0
	ether e0:2e:0b:92:e5:82
	inet 192.168.0.190 netmask 0xffffff00 broadcast 192.168.0.255
	groups: wlan
	ssid TP-LINK_5G channel 48 (5240 MHz 11a ht/20) bssid 30:b5:c2:aa:83:34
	regdomain FCC country US authmode WPA2/802.11i privacy ON
	deftxkey UNDEF TKIP 2:128-bit txpower 17 bmiss 7 mcastrate 6
	mgmtrate 6 scanvalid 60 ampdulimit 64k ampdudensity 4 -amsdutx amsdurx
	shortgi -ldpctx ldpcrx -uapsd wme roaming MANUAL
	parent interface: iwlwifi0
	media: IEEE 802.11 Wireless Ethernet MCS mode 11na
	status: associated
	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Comment 11 Cheng Cui freebsd_committer freebsd_triage 2024-01-12 16:42:59 UTC
(In reply to Cheng Cui from comment #9)
Well, comment #9 is not quite accurate.

Linux kernel v5.10 does not call iwl_mvm_fw_baid_op_cmd() to send out (grp=5 cmd=22), but the call traces are not right as they are from dvm, not mvm.

And the latest Linux kernel master branch has the call to iwl_mvm_fw_baid_op_cmd(), I need more time to figure out what's going on.
Comment 12 Cheng Cui freebsd_committer freebsd_triage 2024-01-12 21:25:20 UTC
(In reply to Cheng Cui from comment #11)
The latest Linux kernel v6.7.0 does send out the (grp=5 cmd=22) for RX_AMPDU.
I can confirm this from my custom build with prints in iwl_get_cmd_string(). Although the return is also "UNKNOWN", there seems to be nothing wrong with the user experience on using 802.11na in Linux.

And the firmware version is the same as it in the current FreeBSD.

root@n2linuxvm:~ # lshw -C network | egrep firmware
       configuration: broadcast=yes driver=iwlwifi driverversion=6.7.0+ firmware=77.2df8986f.0 cc-a0-77.ucode latency=0 link=no multicast=yes wireless=IEEE 802.11

Need to dig deeper.
Comment 13 Cheng Cui freebsd_committer freebsd_triage 2024-01-18 21:37:24 UTC
(In reply to Cheng Cui from comment #12)
in fbsd:
lkpi_ic_ampdu_rx_start
 lkpi_80211_mo_ampdu_action
  iwl_mvm_mac_ampdu_action				
   iwl_mvm_sta_rx_agg					
    iwl_mvm_fw_baid_op
     iwl_mvm_fw_baid_op_cmd				<< RX_BAID_ALLOCATION_CONFIG_CMD added here
      iwl_mvm_send_cmd_pdu_status
       iwl_mvm_send_cmd_status
        iwl_trans_send_cmd
         iwl_trans_txq_send_hcmd
          iwl_trans_txq_send_hcmd_sync
           iwl_pcie_gen2_enqueue_hcmd

iwl_trans_txq_send_hcmd_sync called:
iwl_trans_sync_nmi
 iwl_trans_pcie_sync_nmi
  iwl_trans_sync_nmi_with_addr
   iwl_trans_fw_error				<< set IWL_TRANS_NO_FW
	iwl_op_mode_nic_error
	 iwl_mvm_nic_error				<< if fw crashes, nic will not be started
     
in linux v6.7
iwl_mvm_mac_ampdu_action		<< returns 0 on (grp=5 cmd=22)
 iwl_mvm_sta_rx_agg
  iwl_mvm_fw_baid_op
   iwl_mvm_fw_baid_op_cmd		<< start=true, (grp=5 cmd=22==RX_BAID_ALLOCATION_CONFIG_CMD) added here
    iwl_mvm_send_cmd_pdu_status
     iwl_mvm_send_cmd_status
      iwl_trans_send_cmd
       iwl_trans_txq_send_hcmd
        iwl_trans_txq_send_hcmd_sync << from which iwl_get_cmd_string() prints cc: *
         iwl_pcie_gen2_enqueue_hcmd  << linked by trans->ops->send_cmd
Comment 14 Cheng Cui freebsd_committer freebsd_triage 2024-01-23 00:17:46 UTC
(In reply to Cheng Cui from comment #13)
Also tried different versions of iwlwifi firmwares that could be found from 
the linux-firmware.git. For versions cc-a0-{77,74,73,72} the firmware crashes
on (grp=5 cmd=22), but for older versions cc-a0-{66,59,50} the firmware crashes
on (grp=1 cmd=24). So this problem is more likely from upper layer that sent
unacceptable/disordered cmds or a missing command, or a flag, etc..
Comment 15 Cheng Cui freebsd_committer freebsd_triage 2024-01-26 15:03:49 UTC
(In reply to Cheng Cui from comment #14)

Played with trace-cmd in Linux kernel for the iwlwifi function traces for more
than two days. It looks to be powerful, but too complex. At least for now, I still cannot figure out how to backtrack enough function traces from the iwl_trans_send_cmd() for various iwl cmds to the fw.

e.g.
"trace-cmd record -p function_graph -e iwlwifi nmcli radio wifi on"

Meanwhile, I am dumping out all the iwl cmds to the fw (Linux kernel 6.7.0 with the same fw version) during authentication/association, and try to understand the orders/workflow just under a few pkts exchange from the AP.
Comment 16 Cheng Cui freebsd_committer freebsd_triage 2024-02-08 17:04:30 UTC
(In reply to Cheng Cui from comment #15)

Dumped hex values of the hcmd (grp=5 cmd=22) manually in freebsd together with comparison with the same hcmd in Linux. Turned out there is nothing wrong with
the hcmd in values or size.

I also tested on a second AP where the association went well until the Add Block Ack Request packet was received to trigger the A-MPDU processing, which caused fw timeout:

iwlwifi0: 100162 2147469605 iwl_mvm_mac_sta_state_common IWL_DL_MAC80211 station 0xfffff8000d0ea928M state change 3->4
iwlwifi0: 100162 2147469606 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command REPLY_BEACON_FILTERING_CMD (01.d2), seq: 0x005F, 68 bytes at 95[31]:0
iwlwifi0: 100162 2147469606 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command MAC_CONFIG_CMD (03.08), seq: 0x0060, 60 bytes at 96[0]:0
iwlwifi0: 100162 2147469606 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command MAC_CONFIG_CMD (03.08), seq: 0x0061, 60 bytes at 97[1]:0
iwlwifi0: 100024 2147471399 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command STATISTICS_CMD (01.9c), seq: 0x0062, 12 bytes at 98[2]:0
iwlwifi0: 100024 2147471399 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command SCAN_CFG_CMD (01.0c), seq: 0x0063, 20 bytes at 99[3]:0
iwlwifi0: 100023 2147471725 iwl_pcie_gen2_enqueue_hcmd IWL_DL_HC Sending command SCAN_CFG_CMD (01.0c), seq: 0x0064, 20 bytes at 100[4]:0
iwlwifi0: 100160 2147471725 iwl_mvm_mac_ampdu_action IWL_DL_HT A-MPDU action on addr 0xfffff8000d0ea928M tid 6: action 0
iwlwifi0: Error sending SCAN_CFG_CMD: time out after 2000ms.         <<<
iwlwifi0: Current CMD queue read_ptr 100 write_ptr 101
iwlwifi0: Start IWL Error Log Dump:
iwlwifi0: Transport status: 0x0000004A, valid: 6
iwlwifi0: Loaded firmware version: 77.2df8986f.0 cc-a0-77.ucode
Comment 17 commit-hook freebsd_committer freebsd_triage 2024-02-22 22:49:34 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=759a996d610d9354aac5c48a6bdc9cedcba2f48b

commit 759a996d610d9354aac5c48a6bdc9cedcba2f48b
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-12 16:03:13 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-22 22:46:48 +0000

    LinuxKPI: 802.11: implement a deferred RX path

    Some calls, e.g., action frames cause us to call through all the
    way down to firmware from the RX path without any deferral in
    net80211.

    For LinuxKPI and iwlwifi this goes (with omissions) like this:
    lkpi_napi_task -> linuxkpi_ieee80211_rx -> ieee80211_input_mimo ->
    sta_input -> ht_recv_action_ba_addba_request ->
    lkpi_ic_ampdu_rx_start -> iwl_mvm_mac_ampdu_action ->
    iwl_trans_txq_send_hcmd.  At that point we are waiting for an
    interrupt from the firmware but given the lkpi_napi_task has not
    finished (and may have more to dispatch based on budget and what
    was received) we will not see the new interrupt/fw response.
    With no answer from the firmware, the software timeout in the
    driver kills the command and the firmware and issues a complete
    restart.

    Implement the deferred RX path in LinuxKPI for the moment.
    At a later point we should carefully shift this into net80211.

    This fixes the hangs for (*ic_ampdu_rx_start)() calls with iwlwifi.

    MFC after:      3 days
    PR:             276083
    Reviewed by:    cc
    Differential Revision: https://reviews.freebsd.org/D43968

 sys/compat/linuxkpi/common/src/linux_80211.c | 134 ++++++++++++++++++++++++---
 sys/compat/linuxkpi/common/src/linux_80211.h |  31 +++++++
 2 files changed, 154 insertions(+), 11 deletions(-)
Comment 18 commit-hook freebsd_committer freebsd_triage 2024-04-01 23:47:40 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=dd167e59079b65a53a45744055db6634f3c19a56

commit dd167e59079b65a53a45744055db6634f3c19a56
Author:     Bjoern A. Zeeb <bz@FreeBSD.org>
AuthorDate: 2024-02-12 16:03:13 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-04-01 23:46:10 +0000

    LinuxKPI: 802.11: implement a deferred RX path

    Some calls, e.g., action frames cause us to call through all the
    way down to firmware from the RX path without any deferral in
    net80211.

    For LinuxKPI and iwlwifi this goes (with omissions) like this:
    lkpi_napi_task -> linuxkpi_ieee80211_rx -> ieee80211_input_mimo ->
    sta_input -> ht_recv_action_ba_addba_request ->
    lkpi_ic_ampdu_rx_start -> iwl_mvm_mac_ampdu_action ->
    iwl_trans_txq_send_hcmd.  At that point we are waiting for an
    interrupt from the firmware but given the lkpi_napi_task has not
    finished (and may have more to dispatch based on budget and what
    was received) we will not see the new interrupt/fw response.
    With no answer from the firmware, the software timeout in the
    driver kills the command and the firmware and issues a complete
    restart.

    Implement the deferred RX path in LinuxKPI for the moment.
    At a later point we should carefully shift this into net80211.

    This fixes the hangs for (*ic_ampdu_rx_start)() calls with iwlwifi.

    PR:             276083
    Reviewed by:    cc

    (cherry picked from commit 759a996d610d9354aac5c48a6bdc9cedcba2f48b)

 sys/compat/linuxkpi/common/src/linux_80211.c | 134 ++++++++++++++++++++++++---
 sys/compat/linuxkpi/common/src/linux_80211.h |  31 +++++++
 2 files changed, 154 insertions(+), 11 deletions(-)