Bug 281714 - mlx5 Kernel panic after implementation of mlx5en support for ipsec offload on vlan if
Summary: mlx5 Kernel panic after implementation of mlx5en support for ipsec offload on...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 15.0-CURRENT
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Konstantin Belousov
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-09-25 15:11 UTC by Marie Helene Kvello-Aune
Modified: 2024-12-15 22:54 UTC (History)
1 user (show)

See Also:


Attachments
full backtrace (12.40 KB, text/plain)
2024-09-25 15:11 UTC, Marie Helene Kvello-Aune
no flags Details
kgdb digging (11.95 KB, text/plain)
2024-09-25 15:11 UTC, Marie Helene Kvello-Aune
no flags Details
relevant pciconf output (2.18 KB, text/plain)
2024-09-25 15:12 UTC, Marie Helene Kvello-Aune
no flags Details
minimal iovctl config (85 bytes, text/plain)
2024-09-25 15:13 UTC, Marie Helene Kvello-Aune
no flags Details
Screenshot of panic (197.01 KB, image/png)
2024-09-27 11:43 UTC, Marie Helene Kvello-Aune
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marie Helene Kvello-Aune 2024-09-25 15:11:25 UTC
Created attachment 253814 [details]
full backtrace

This kernel panic happens when initializing virtual functions (VFs) on mlx5 hardware, as of commit `e23731db48ef9c6568d4768b1f87d48514339faa` (22nd July) and later. The kernel panic occurs every time iovctl is run to initialize virtual functions on the ConnectX-6 Lx.
 
The issue does not appear prior to this commit, but does appear in this commit and in future commits.
 
I’ve attached backtrace (bt.txt), as well as output from my kgdb digging (gdb.txt).
 
It seems like there’s something funky with the lock in sys/dev/mlx5/mlx5_core/mlx5_fs_core.c:1056.
 
Info about my system:
AMD64, sr-iov and ARI enabled, 10bit TAG enabled (issue also happens without), ConnectX-6 Lx dualport NIC. I’ve also attached relevant pciconf output for the ConnectX card. (mlx5_core0_pciconf.txt)
 
Reproduction steps:
Configure the ConnectX-6 Lx with sr-iov enabled and a handful of supported VF’s
Use a minimal iovctl config file (attached: mlx5_core0.minimal.conf)
(kernel panic)
Comment 1 Marie Helene Kvello-Aune 2024-09-25 15:11:58 UTC
Created attachment 253815 [details]
kgdb digging
Comment 2 Marie Helene Kvello-Aune 2024-09-25 15:12:32 UTC
Created attachment 253816 [details]
relevant pciconf output
Comment 3 Marie Helene Kvello-Aune 2024-09-25 15:13:00 UTC
Created attachment 253817 [details]
minimal iovctl config
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2024-09-26 20:02:30 UTC
What is the panic, actually?

Show exact kernel output from the panic.
Comment 5 Marie Helene Kvello-Aune 2024-09-27 11:16:32 UTC
(In reply to Konstantin Belousov from comment #4)
Oops. Seems I missed the most important part, indeed!
Here goes:

panic: Assertion (i2) > 0 && (i2) <= w_max_used_index && (i2) < witness_count failed at /usr/home/mhka/git/freebsd/freebsd-src-bissect/sys/kern/subr_witness.c:2100
cpuid = 10
time = 1727172484
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe018d71e670
vpanic() at vpanic+0x13f/frame 0xfffffe018d71e7a0
panic() at panic+0x43/frame 0xfffffe018d71e800
_isitmyx() at _isitmyx+0x142/frame 0xfffffe018d71e820
witness_checkorder() at witness_checkorder+0x3c3/frame 0xfffffe018d71e9e0
_sx_xlock() at _sx_xlock+0x5d/frame 0xfffffe018d71ea20
connect_fwd_rules() at connect_fwd_rules+0x8f/frame 0xfffffe018d71ea80
__mlx5_create_flow_table() at __mlx5_create_flow_table+0x4b1/frame 0xfffffe018d71eaf0
mlx5_eswitch_enable_sriov() at mlx5_eswitch_enable_sriov+0x173/frame 0xfffffe018d71eb60
mlx5_iov_init() at mlx5_iov_init+0x39/frame 0xfffffe018d71eb80
pci_iov_ioctl() at pci_iov_ioctl+0x426/frame 0xfffffe018d71ec00
devfs_ioctl() at devfs_ioctl+0xd1/frame 0xfffffe018d71ec50
vn_ioctl() at vn_ioctl+0xbc/frame 0xfffffe018d71ecc0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe018d71ece0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe018d71ed40
sys_ioctl() at sys_ioctl+0x12d/frame 0xfffffe018d71ee00
amd64_syscall() at amd64_syscall+0x865/frame 0xfffffe018d71ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe018d71ef30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x185c70a1966a, rsp = 0x185c6be6a8f8, rbp = 0x185c6be6a950 ---
KDB: enter: panic
Uptime: 2m16s
Dumping 2109 out of 32586 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Comment 6 Marie Helene Kvello-Aune 2024-09-27 11:43:44 UTC
Created attachment 253848 [details]
Screenshot of panic

I've re-triggered the panic and taken a screenshot via IPMIView. This screenshot includes some information about E-Switch enable, which was not available from the core dump.

I also saved a copy of /var/run/dmesg.boot (booted in verbose mode) prior to triggering the panic - let me know if that would be useful to you.
Comment 7 commit-hook freebsd_committer freebsd_triage 2024-12-15 22:48:57 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=29a9d7c6ce78825251ea127ade445236607b0b81

commit 29a9d7c6ce78825251ea127ade445236607b0b81
Author:     Ariel Ehrenberg <aehrenberg@nvidia.com>
AuthorDate: 2024-11-12 09:16:26 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2024-12-15 22:27:31 +0000

    mlx5_core: fix panic on sriov enablement

    Align the code of fdb steering with flow steering core
    and add missing parts in namespace initialization and
    in prio logic

    PR:     281714
    Sponsored by:   NVidia networking

 sys/dev/mlx5/mlx5_core/mlx5_eswitch.c | 18 +++++++++++-------
 sys/dev/mlx5/mlx5_core/mlx5_fs_core.c |  8 ++++++++
 sys/dev/mlx5/mlx5_ifc.h               |  8 --------
 3 files changed, 19 insertions(+), 15 deletions(-)