Created attachment 253814 [details] full backtrace This kernel panic happens when initializing virtual functions (VFs) on mlx5 hardware, as of commit `e23731db48ef9c6568d4768b1f87d48514339faa` (22nd July) and later. The kernel panic occurs every time iovctl is run to initialize virtual functions on the ConnectX-6 Lx. The issue does not appear prior to this commit, but does appear in this commit and in future commits. I’ve attached backtrace (bt.txt), as well as output from my kgdb digging (gdb.txt). It seems like there’s something funky with the lock in sys/dev/mlx5/mlx5_core/mlx5_fs_core.c:1056. Info about my system: AMD64, sr-iov and ARI enabled, 10bit TAG enabled (issue also happens without), ConnectX-6 Lx dualport NIC. I’ve also attached relevant pciconf output for the ConnectX card. (mlx5_core0_pciconf.txt) Reproduction steps: Configure the ConnectX-6 Lx with sr-iov enabled and a handful of supported VF’s Use a minimal iovctl config file (attached: mlx5_core0.minimal.conf) (kernel panic)
Created attachment 253815 [details] kgdb digging
Created attachment 253816 [details] relevant pciconf output
Created attachment 253817 [details] minimal iovctl config
What is the panic, actually? Show exact kernel output from the panic.
(In reply to Konstantin Belousov from comment #4) Oops. Seems I missed the most important part, indeed! Here goes: panic: Assertion (i2) > 0 && (i2) <= w_max_used_index && (i2) < witness_count failed at /usr/home/mhka/git/freebsd/freebsd-src-bissect/sys/kern/subr_witness.c:2100 cpuid = 10 time = 1727172484 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe018d71e670 vpanic() at vpanic+0x13f/frame 0xfffffe018d71e7a0 panic() at panic+0x43/frame 0xfffffe018d71e800 _isitmyx() at _isitmyx+0x142/frame 0xfffffe018d71e820 witness_checkorder() at witness_checkorder+0x3c3/frame 0xfffffe018d71e9e0 _sx_xlock() at _sx_xlock+0x5d/frame 0xfffffe018d71ea20 connect_fwd_rules() at connect_fwd_rules+0x8f/frame 0xfffffe018d71ea80 __mlx5_create_flow_table() at __mlx5_create_flow_table+0x4b1/frame 0xfffffe018d71eaf0 mlx5_eswitch_enable_sriov() at mlx5_eswitch_enable_sriov+0x173/frame 0xfffffe018d71eb60 mlx5_iov_init() at mlx5_iov_init+0x39/frame 0xfffffe018d71eb80 pci_iov_ioctl() at pci_iov_ioctl+0x426/frame 0xfffffe018d71ec00 devfs_ioctl() at devfs_ioctl+0xd1/frame 0xfffffe018d71ec50 vn_ioctl() at vn_ioctl+0xbc/frame 0xfffffe018d71ecc0 devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe018d71ece0 kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe018d71ed40 sys_ioctl() at sys_ioctl+0x12d/frame 0xfffffe018d71ee00 amd64_syscall() at amd64_syscall+0x865/frame 0xfffffe018d71ef30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe018d71ef30 --- syscall (54, FreeBSD ELF64, ioctl), rip = 0x185c70a1966a, rsp = 0x185c6be6a8f8, rbp = 0x185c6be6a950 --- KDB: enter: panic Uptime: 2m16s Dumping 2109 out of 32586 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
Created attachment 253848 [details] Screenshot of panic I've re-triggered the panic and taken a screenshot via IPMIView. This screenshot includes some information about E-Switch enable, which was not available from the core dump. I also saved a copy of /var/run/dmesg.boot (booted in verbose mode) prior to triggering the panic - let me know if that would be useful to you.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=29a9d7c6ce78825251ea127ade445236607b0b81 commit 29a9d7c6ce78825251ea127ade445236607b0b81 Author: Ariel Ehrenberg <aehrenberg@nvidia.com> AuthorDate: 2024-11-12 09:16:26 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2024-12-15 22:27:31 +0000 mlx5_core: fix panic on sriov enablement Align the code of fdb steering with flow steering core and add missing parts in namespace initialization and in prio logic PR: 281714 Sponsored by: NVidia networking sys/dev/mlx5/mlx5_core/mlx5_eswitch.c | 18 +++++++++++------- sys/dev/mlx5/mlx5_core/mlx5_fs_core.c | 8 ++++++++ sys/dev/mlx5/mlx5_ifc.h | 8 -------- 3 files changed, 19 insertions(+), 15 deletions(-)