Bug 275616 - net80211: ieee80211_mbuf_adjust() et al mbuf assumptions no longer hold [was: bwn driver causes kernel panic after connect]
Summary: net80211: ieee80211_mbuf_adjust() et al mbuf assumptions no longer hold [was:...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: wireless (show other bugs)
Version: 14.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Bjoern A. Zeeb
URL:
Keywords:
Depends on:
Blocks: 277512
  Show dependency treegraph
 
Reported: 2023-12-07 22:23 UTC by Frank Hilgendorf
Modified: 2024-03-05 20:16 UTC (History)
2 users (show)

See Also:
linimon: mfc-stable14?
linimon: mfc-stable13?


Attachments
extract from /var/log/messages with KASAN output (665.04 KB, text/plain)
2023-12-07 22:23 UTC, Frank Hilgendorf
no flags Details
vmcore backtrace with crypto_tkip used (6.86 KB, text/plain)
2023-12-08 13:39 UTC, Frank Hilgendorf
no flags Details
vmcore backtrace with crypto_ccmp used (7.04 KB, text/plain)
2023-12-08 13:40 UTC, Frank Hilgendorf
no flags Details
output of gdb, ieee80211_encap() *m (2.02 KB, text/plain)
2023-12-10 17:56 UTC, Frank Hilgendorf
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Hilgendorf 2023-12-07 22:23:27 UTC
Created attachment 246887 [details]
extract from /var/log/messages with KASAN output

The broadcom bwn Wifi driver causes a kernel panic soon after a connection with an encrypted network is established. Soon means: immediate to approx. 30sec. after the connection is established.

- This did not happen with FreeBSD RELEASE 13.x
- This does not happen when connecting to an unencrypted network
- It crashes both with CCMP and TKIP encryption

Hardware:
————————-
Macbook Pro 3,1 with Broadcom BCM4321 wireless card

Setup:
______
At boot time, the bwn_v4_ucode firmware module is not loaded, the machine goes up and running.
When subsequently the firmware is loaded and the interface restarted,
it crashes after the connection is established.

  kldload bwn_v4_ucode
  service netif restart

The WLAN is a 2,4GHz 802.11g network.

wpa_supplicant.conf:

network={
	ssid="SomeNet"
	psk="SomePassword"
	pairwise=TKIP      # or CCMP
}

The attached messages extract contains KASAN output indicating 
trouble in the ieee_80211 crypto module.
Comment 1 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-07 22:35:49 UTC
What is the actual kernel panic?
Comment 2 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-07 22:44:49 UTC
Do you have IEEE80211_DEBUG enabled?

Also, if you have src+obj code, can you lookup and verify my assuptions based on main w/o kasan in kernel:

ieee80211_crypto_ccmp.c:669 ff:
  68 #1 0xffffffff81580899 at ccmp_encap+0xe99  (memcpy or debug error after?)
 193 #2 0xffffffff815808df at ccmp_encap+0xedf  (also around the memcpy, stats?)
ieee80211_crypto_ccmp.c:299:
  62 #1 0xffffffff81580953 at ccmp_encap+0xf53  (m_adj?)

I am just curious as the distribution later is quite wide, which indeed seems to indicate that the data buffer at that point isn't valid.

  25 #1 0xffffffff80599a6c at rijndaelEncrypt+0x25c
  25 #1 0xffffffff80599af7 at rijndaelEncrypt+0x2e7
  25 #1 0xffffffff80599b7c at rijndaelEncrypt+0x36c
  24 #1 0xffffffff80599c01 at rijndaelEncrypt+0x3f1
  26 #1 0xffffffff80599ca5 at rijndaelEncrypt+0x495
  27 #1 0xffffffff80599d2c at rijndaelEncrypt+0x51c
  21 #1 0xffffffff80599db1 at rijndaelEncrypt+0x5a1
  20 #1 0xffffffff80599e36 at rijndaelEncrypt+0x626

Strangely the ccmp code has not seen functional changes in years.
Comment 3 Frank Hilgendorf 2023-12-08 13:39:18 UTC
Created attachment 246904 [details]
vmcore backtrace with crypto_tkip used
Comment 4 Frank Hilgendorf 2023-12-08 13:40:33 UTC
Created attachment 246905 [details]
vmcore backtrace with crypto_ccmp used
Comment 5 Frank Hilgendorf 2023-12-08 13:44:13 UTC
- The actual panic is caused by a page fault.

- IEEE80211_DEBUG is enabled, the sysctl entry is present.

For me it seems that the root cause could be somewhere in the common crypto code, since both CCMP and TKIP dump core. Pls. find attached two vmcore backtraces.

I am not used working with gdb, but I will what Ican to help resolving the issue.
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2023-12-08 14:23:35 UTC
It would be useful (for me at least) to see the gdb backtrace from a KASAN report.  That is, either use addr2line to answer Bjoern's question from comment 2[*], or get a kernel dump from a KASAN kernel.

[*] I would also like to see what "ieee80211_encap+0x512" is.  To get this info:

$ sudo pkg install binutils
$ echo "ieee80211_encap+0x512" | /usr/local/bin/addr2line -e /usr/lib/debug/boot/kernel/kernel.debug

assuming that your kernel is installed at /boot/kernel.  (If it was installed to, for example, /boot/kernel.test, the debug file path would be /usr/lib/debug/boot/kernel.test/kernel.debug.)
Comment 7 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-08 15:57:18 UTC
Given I would assume the release kernel wouldn't ship with KASAN on, is this indeed releng/14.0 or is this stable/14 ?
What hash are you on according to uname -v?

And are you inded on 11g or on 11ng?
Comment 8 Frank Hilgendorf 2023-12-08 20:25:24 UTC
(In reply to Bjoern A. Zeeb from comment #7)
I checked out the releng/14.0 branch and compiled a GENERIC kernel with, KASAN BWN_DEBUG and BWN_GPL_PHY enabled, config file:

  include GENERIC-KASAN
  ident MACBOOKPRO31-KASAN

  options BWN_GPL_PHY
  options BWN_DEBUG

uname -v:

  FreeBSD 14.0-RELEASE-p2 #2 
  releng/14.0-n265396-06497fbd52e2: Fri Dec  8 20:57:05 CET 2023     
  root@hilschubookBSD:/usr/obj/usr/src/amd64.amd64/sys/MACBOOKPRO31-KASAN

(I had to recompile the KASAN Kernel because I unfortunately lost the old one)

WLAN: The access point is configured for 802.11gn, so both are possible.
I can try to figure out whether a g or n connection is established next time before the system crashes
Comment 9 Frank Hilgendorf 2023-12-08 20:58:58 UTC
(In reply to Mark Johnston from comment #6)
"ieee80211_encap+0x512" - the line is:
  /usr/src/sys/net80211/ieee80211_output.c:1666
Comment 10 Frank Hilgendorf 2023-12-08 21:11:56 UTC
(In reply to Bjoern A. Zeeb from comment #2)

ieee80211_crypto_ccmp.c:669 ff:
  68 #1 0xffffffff81580899 at ccmp_encap+0xe99  (memcpy or debug error after?)
    -> ieee80211_crypto_ccmp.c:325

 193 #2 0xffffffff815808df at ccmp_encap+0xedf  (also around the memcpy, stats?)
    -> ieee80211_crypto_ccmp.c:?

ieee80211_crypto_ccmp.c:299:
  62 #1 0xffffffff81580953 at ccmp_encap+0xf53  (m_adj?)
    -> ieee80211_crypto_ccmp.c:325

  25 #1 0xffffffff80599a6c at rijndaelEncrypt+0x25c
    -> rijndael-alg-fst.c:956

  25 #1 0xffffffff80599af7 at rijndaelEncrypt+0x2e7
    -> rijndael-alg-fst.c:962

  25 #1 0xffffffff80599b7c at rijndaelEncrypt+0x36c
    -> rijndael-alg-fst.c:968

  24 #1 0xffffffff80599c01 at rijndaelEncrypt+0x3f1
    -> rijndael-alg-fst.c:974

  26 #1 0xffffffff80599ca5 at rijndaelEncrypt+0x495
    -> rijndael-alg-fst.c:986

  27 #1 0xffffffff80599d2c at rijndaelEncrypt+0x51c
    -> rijndael-alg-fst.c:992

  21 #1 0xffffffff80599db1 at rijndaelEncrypt+0x5a1
    -> rijndael-alg-fst.c:998

  20 #1 0xffffffff80599e36 at rijndaelEncrypt+0x626
    -> rijndael-alg-fst.c:1004
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2023-12-09 15:52:53 UTC
(In reply to Frank Hilgendorf from comment #9)
So that corresponds to:

1660                 /* NB: this could be optimized 'cuz of ieee80211_mbuf_adjust */                                                                                                                                                                                                                                          
1661                 m_adj(m, sizeof(struct ether_header) - sizeof(struct llc));                                                                                                                                                                                                                                              
1662                 llc = mtod(m, struct llc *);                                                                                                                                                                                                                                                                             
1663                 llc->llc_dsap = llc->llc_ssap = LLC_SNAP_LSAP;                                                                                                                                                                                                                                                           
1664                 llc->llc_control = LLC_UI;                                                                                                                                                                                                                                                                               
1665                 llc->llc_snap.org_code[0] = 0;                                                                                                                                                                                                                                                                           
1666                 llc->llc_snap.org_code[1] = 0; <--- starts here                                                                                                                                                                                                                                                                          
1667                 llc->llc_snap.org_code[2] = 0;                                                                                                                                                                                                                                                                           
1668                 llc->llc_snap.ether_type = eh.ether_type;

which seems to make sense, since KASAN reports 2 1-byte writes followed by a 2-byte write.

That is strange.  I suspect the bug is elsewhere.  Perhaps something (bwn?) concurrently freed the mbuf chain?
Comment 12 Frank Hilgendorf 2023-12-09 22:09:46 UTC
I repeated the test 5 times. The 5 reports first entry was always 'ieee80211_encap+0x512'. The bug seems to work deterministic.

If it was kind of a race condition, where thread B frees the mbuf currently processed by thread A, I would expect multiple KASAN reports to start with different code lines because they should not be so exactly synchronized that thread B would always hit exactly the same point of execution of thread A.

One more observation:
Running the test with an unencrypted WLAN connection produces no KASAN report at all.
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2023-12-10 15:43:23 UTC
(In reply to Frank Hilgendorf from comment #12)
I think you're right.  In that case, it'd be useful to see a kernel dump from KASAN, i.e., configure KASAN to panic when a report is generated (which is the default setting).  In particular, I'd want to see the output of "p *m" from the ieee80211_encap() frame.
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2023-12-10 15:56:33 UTC
Hmm, for bwn vap->iv_ic->ic_headroom is pretty big, it's equal to sizeof(struct bwn_txhdr), which is 118 bytes.

So, ieee80211_mbuf_adjust() is going to set needed_space = 118 + 24 + 8 for CCMP. If ieee80211_mbuf_adjust() needs to allocate a new mbuf header, the header will have MHLEN - 150 = 10 bytes of space.

The assertion "needed_space <= MHLEN" should actually be "needed_space + sizeof(struct ether_header) <= MHLEN", and I suspect that that assertion would fail for bwn.  I'm not sure what changed in 14.0 to break - maybe struct mbuf layout changes that made MHLEN smaller?
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2023-12-10 16:20:06 UTC
(In reply to Mark Johnston from comment #14)
Oops, my comment about the assertion is wrong.  I don't really understand the frame layout.  But the problem is clear: ieee80211_mbuf_adjust() may return an mbuf with just MHLEN - 150 == 10 bytes left in its buffer, and ieee80211_encap() then calls m_adj(m, 6), leaving 4 bytes of space, but sizeof(struct llc) is 8 bytes.

And, in stable/13 sizeof(struct pkthdr) is 8 bytes smaller, so there was enough space then, but now it's not enough.
Comment 16 Frank Hilgendorf 2023-12-10 17:56:39 UTC
Created attachment 246957 [details]
output of gdb, ieee80211_encap() *m

in reply to comment #13
Comment 17 Mark Johnston freebsd_committer freebsd_triage 2023-12-10 22:40:57 UTC
(In reply to Frank Hilgendorf from comment #16)
Right, so here m_data is 0xfffffe006f8cb8fc, which is 4 bytes before the end of the buffer, but m_len is 8, which can't be right.  ieee80211_encap() is assuming that ieee80211_mbuf_adjust() returns an mbuf with at least sizeof(ether_header) byte of space, and that assumption is false now.  Hopefully Bjoern can comment on how this should be solved.
Comment 18 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-10 23:30:43 UTC
(In reply to Mark Johnston from comment #17)

I'll take it and have a look.

Probably also with an assertion rather than a comment...
Comment 19 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-11 02:03:22 UTC
(In reply to Bjoern A. Zeeb from comment #18)

And just as a semi-related comment given I hit it as well with rtw88 before (but hitting different assertions which also were changed during 14-CURRENT):

https://lists.freebsd.org/archives/freebsd-transport/2022-February/000012.html / 3d09d310d9981dde1d6e51fed6ecf9576480b9f7 / 232d323ef227109acce37f5a0d62492673666ee2 / which was later improved by c414347bc572219a0c849853ade6e664a6092f33 allowing us to grow.

In my initial email I already called bwi/bwn out for the large ic_headroom.
Comment 20 Frank Hilgendorf 2023-12-13 20:09:24 UTC
(In reply to Bjoern A. Zeeb from comment #19)
Possible solution?

I was wondering what the bwn driver needs the huge amount of space in each data mbuf for. Ok, the headroom in the mbufs claimed by the bwn driver is intended for a 'struct bwn_txhdr'. This seems to be driver/hardware related stuff that is not going to be sent out onto the network.

Looking into the bwn driver code, I found no piece of code using this space in the mbuf, not writing nor reading. Maybe the reserved space was not used at all?
To check it out, in if_bwn.c I set ic->ic_headroom to 0 and... now the bwn driver works without complaints.

I cannot check this with other broadcom chips, but unless there is something I overlooked it might solve the problem.
Comment 21 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-13 22:02:15 UTC
(In reply to Frank Hilgendorf from comment #20)

The net80211 code needs adjusting anyway as even if bwn now works other drivers exist.
In the non-native LinuxKPI based drivers I work around the problem by copying data which is annoying too.  So even if the code now works the next driver will come or avail mbuf size may shrink again and make it go beyond assumptions as well.

I have a draft for net80211 but let me look into bwn for a minute.

Hmm it indeed looks like the code DMAs the txhdr separately from the mbuf data later rather than merging the txhdr with the mbuf and then dmaing it all in possibly a single segment.  Seems it's been doing this since inception on FreeBSD and ic_headroom indeed was never used but only set.  Good spotting!

Can I take your ic_headroom = 0 as a patch you provided and commit it with you as Author and then I can do net80211 bits in peace and quite as a separate thing?
Comment 22 Bjoern A. Zeeb freebsd_committer freebsd_triage 2023-12-13 22:05:46 UTC
For the records:
bwi, malo, and mwl all do seem to use the ic_headroom and M_PREPEND().
And mt76, rtw88 and rtw89 would love to.
Comment 23 Frank Hilgendorf 2023-12-13 23:08:41 UTC
(In reply to Bjoern A. Zeeb from comment #21)
Yes, please proceed.
Comment 24 Frank Hilgendorf 2023-12-13 23:17:24 UTC
(In reply to Bjoern A. Zeeb from comment #21)
It is just a patch for Freebsd, but a solution for me :-) .
Comment 25 commit-hook freebsd_committer freebsd_triage 2023-12-13 23:55:08 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=59dba901f227420647e4856f03cb782a3375c220

commit 59dba901f227420647e4856f03cb782a3375c220
Author:     Frank Hilgendorf <frank.hilgendorf@posteo.de>
AuthorDate: 2023-12-13 23:48:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2023-12-13 23:54:05 +0000

    bwn: remove unused ic_headroom

    Unlike bwi(4), bwn(4) does not rely on ic_headroom (despite having it
    set) but splits the bwn_txhdr (first) segment into its own transaction.
    Remove ic_headroom to avoid net80211 troubles with not enough space in
    the mbuf around ieee80211_mbuf_adjust().

    PR:             275616
    MFC after:      3 days

 sys/dev/bwn/if_bwn.c | 2 --
 1 file changed, 2 deletions(-)
Comment 26 commit-hook freebsd_committer freebsd_triage 2024-02-18 21:12:28 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4e219655d0f7e1134c040468c0e7d91c671e57fb

commit 4e219655d0f7e1134c040468c0e7d91c671e57fb
Author:     Frank Hilgendorf <frank.hilgendorf@posteo.de>
AuthorDate: 2023-12-13 23:48:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-18 18:31:14 +0000

    bwn: remove unused ic_headroom

    Unlike bwi(4), bwn(4) does not rely on ic_headroom (despite having it
    set) but splits the bwn_txhdr (first) segment into its own transaction.
    Remove ic_headroom to avoid net80211 troubles with not enough space in
    the mbuf around ieee80211_mbuf_adjust().

    PR:             275616

    (cherry picked from commit 59dba901f227420647e4856f03cb782a3375c220)

 sys/dev/bwn/if_bwn.c | 2 --
 1 file changed, 2 deletions(-)
Comment 27 commit-hook freebsd_committer freebsd_triage 2024-02-19 08:09:28 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0f032804b6ec70c8f8ee4fe1ed83a3356946b5e8

commit 0f032804b6ec70c8f8ee4fe1ed83a3356946b5e8
Author:     Frank Hilgendorf <frank.hilgendorf@posteo.de>
AuthorDate: 2023-12-13 23:48:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 08:01:59 +0000

    bwn: remove unused ic_headroom

    Unlike bwi(4), bwn(4) does not rely on ic_headroom (despite having it
    set) but splits the bwn_txhdr (first) segment into its own transaction.
    Remove ic_headroom to avoid net80211 troubles with not enough space in
    the mbuf around ieee80211_mbuf_adjust().

    PR:             275616

    (cherry picked from commit 59dba901f227420647e4856f03cb782a3375c220)

 sys/dev/bwn/if_bwn.c | 2 --
 1 file changed, 2 deletions(-)
Comment 28 commit-hook freebsd_committer freebsd_triage 2024-02-19 16:11:02 UTC
A commit in branch releng/13.3 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=dc50f981e17c39a486616315d142785a20f87612

commit dc50f981e17c39a486616315d142785a20f87612
Author:     Frank Hilgendorf <frank.hilgendorf@posteo.de>
AuthorDate: 2023-12-13 23:48:08 +0000
Commit:     Bjoern A. Zeeb <bz@FreeBSD.org>
CommitDate: 2024-02-19 16:05:54 +0000

    bwn: remove unused ic_headroom

    Unlike bwi(4), bwn(4) does not rely on ic_headroom (despite having it
    set) but splits the bwn_txhdr (first) segment into its own transaction.
    Remove ic_headroom to avoid net80211 troubles with not enough space in
    the mbuf around ieee80211_mbuf_adjust().

    PR:             275616
    Approved by:    re (cperciva)

    (cherry picked from commit 59dba901f227420647e4856f03cb782a3375c220)
    (cherry picked from commit 0f032804b6ec70c8f8ee4fe1ed83a3356946b5e8)

 sys/dev/bwn/if_bwn.c | 2 --
 1 file changed, 2 deletions(-)