Bug 264191 - mbuf: debugnet panics with mbuf cache with multiple instances of the same driver
Summary: mbuf: debugnet panics with mbuf cache with multiple instances of the same driver
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: crash, needs-qa
Depends on:
Blocks:
 
Reported: 2022-05-23 20:40 UTC by Bryan Drewery
Modified: 2022-07-18 20:36 UTC (History)
6 users (show)

See Also:
koobs: maintainer-feedback? (markj)
koobs: maintainer-feedback? (hselasky)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bryan Drewery freebsd_committer freebsd_triage 2022-05-23 20:40:01 UTC
1. debugnet_mbuf_reinit() is racy.

With netdump we would only populate the mbuf cache when a device was *configured*. Now we populate the cache when the device comes up and if it *supports* debugnet. Thus if we have a driver with multiple devices then each device coming up will cause debugnet_mbuf_reinit() to race between multiple threads while touching the mbufqs. This is easily fixed but leaves more issues.

Doing this during driver link up makes sense because we may not configure the device until after panic in ddb with .netdump. 

2. dn_buf_import() may overflow an mbuf from the queue with trash_init() on <without INVARIANTS>.

If 1 device has jumbo frames, MTU 9000, and the other normal MTU of 1500, the hwm/dn_clsize can become MJUM9BYTES (9216).

[This next part may only be a problem for something like mlx4 which has some cached mbufs of its own. This can be seen in mlx4_en_alloc_buf() where it appears to always keep 1 extra mbuf around for each ring. It appears it may use that mbuf at panic time if mlx4_en_alloc_mbuf() fails. The issue I ran into downstream was a very different allocation scenario but the FreeBSD version appears to have a similar issue.]

If the device that is used at dump time has an MTU of 1500 it is possible for the device to return a smaller mbuf to the dn_clustq than expected for that zone (vs the high water mark of 9216). When it is removed in dn_buf_import() it has trash_init(9216) ran over it rather than the expected MCLBYTES size.
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-24 00:23:43 UTC
^Triage: Unsure of specific relevance, but see also src 5a7de2b42caf via 258923 given mlx mention here and mlx, panic, debugnet_mbuf_reinit() and debugnet activation there.
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2022-05-24 00:24:12 UTC
See also base 5a7de2b42caf via bug 258923 apologies.
Comment 3 Bryan Drewery freebsd_committer freebsd_triage 2022-05-24 16:38:22 UTC
(In reply to Kubilay Kocak from comment #2)

None of this is a recent regression. It is design flaws.