In order to try 13-RC3 on my box with Mellanox ConnectX-2 card, I checked out 13.0 branch and made a KERNCONF file from GENERIC added and removed a few lines with reference to FreeBSD Infiniband Wiki. This kernel ran into panic at boot time. On the other hand I have confirmed that both 13-RC3 GENERIC kernel (and mlx4 drivers compiled as module) 12.2 custom kernel and mlx4 drivers not as module work correctly. I don't know why mlx4 drivers compiled into 13.0 kernel causes panic. ---<<BOOT>>--- Copyright (c) 1992-2021 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 13.0-RC3 #1: Mon Mar 22 18:37:58 JST 2021 matsuo@build:/usr/obj/usr/src/amd64.amd64/sys/MICROSERVER-PR amd64 FreeBSD clang version 11.0.1 (git@github.com:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe) VT(vga): resolution 640x480 CPU: AMD Turion(tm) II Neo N54L Dual-Core Processor (2196.39-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x100f63 Family=0x10 Model=0x6 Stepping=3 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x802009<SSE3,MON,CX16,POPCNT> AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!> AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId> SVM: NP,NRIP,NAsids=64 TSC: P-state invariant real memory = 8589934592 (8192 MB) avail memory = 8249397248 (7867 MB) Event timer "LAPIC" quality 100 ACPI APIC Table: <HP ProLiant> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) random: unblocking device. Firmware Warning (ACPI): 32/64X length mismatch in FADT/Gpe0Block: 64/32 (20201113/tbfadt-748) ioapic0 <Version 2.1> irqs 0-23 Launching APs: 1 Timecounter "TSC-low" frequency 1098192980 Hz quality 800 KTLS: Initialized 2 threads random: entropy device external interface [ath_hal] loaded WARNING: Device "kbd" is Giant locked and may be deleted before FreeBSD 14.0. kbd1 at kbdmux0 000.000052 [4350] netmap_init netmap: loaded module nexus0 vtvga0: <VT VGA driver> cryptosoft0: <software crypto> aesni0: No AES or SHA support. acpi0: <HP ProLiant> acpi0: Power Button (fixed) acpi0: _OSC failed: AE_BUFFER_OVERFLOW cpu0: <ACPI CPU> on acpi0 attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0 atrtc0: registered as a time-of-day clock, resolution 1.000000s Event timer "RTC" frequency 32768 Hz quality 0 hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 550 Event timer "HPET1" frequency 14318180 Hz quality 450 Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <32-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0 apei0: <ACPI Platform Error Interface> on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0 pci1: <ACPI PCI bus> on pcib1 vgapci0: <VGA-compatible display> port 0xe000-0xe0ff mem 0xfa000000-0xfbffffff,0xfe7f0000-0xfe7fffff,0xfe600000-0xfe6fffff irq 18 at device 5.0 on pci1 vgapci0: Boot video device pcib2: <ACPI PCI-PCI bridge> irq 18 at device 2.0 on pci0 pci2: <ACPI PCI bus> on pcib2 pci2: <serial bus> at device 0.0 (no driver attached) pcib3: <ACPI PCI-PCI bridge> irq 18 at device 6.0 on pci0 pci3: <ACPI PCI bus> on pcib3 bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xfe9f0000-0xfe9fffff irq 18 at device 0.0 on pci3 bge0: CHIP ID 0x05784100; ASIC REV 0x5784; CHIP REV 0x57841; PCI-E miibus0: <MII bus> on bge0 brgphy0: <BCM5784 10/100/1000baseT PHY> PHY 1 on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow bge0: Using defaults for TSO: 65518/35/2048 bge0: Ethernet address: fc:15:b4:90:34:f3 ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem 0xfe5ffc00-0xfe5fffff irq 19 at device 17.0 on pci0 ahci0: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported ahci0: quirks=0x22000<ATI_PMP_BUG,1MSI> ahcich0: <AHCI channel> at channel 0 on ahci0 ahcich1: <AHCI channel> at channel 1 on ahci0 ahcich2: <AHCI channel> at channel 2 on ahci0 ahcich3: <AHCI channel> at channel 3 on ahci0 ahcich4: <AHCI channel> at channel 4 on ahci0 ahcich5: <AHCI channel> at channel 5 on ahci0 ohci0: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe5fe000-0xfe5fefff irq 18 at device 18.0 on pci0 usbus0 on ohci0 usbus0: 12Mbps Full Speed USB v1.0 ehci0: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe5ff800-0xfe5ff8ff irq 17 at device 18.2 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 usbus1: 480Mbps High Speed USB v2.0 ohci1: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe5fd000-0xfe5fdfff irq 18 at device 19.0 on pci0 usbus2 on ohci1 usbus2: 12Mbps Full Speed USB v1.0 ehci1: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe5ff400-0xfe5ff4ff irq 17 at device 19.2 on pci0 usbus3: EHCI version 1.0 usbus3 on ehci1 usbus3: 480Mbps High Speed USB v2.0 isab0: <PCI-ISA bridge> at device 20.3 on pci0 isa0: <ISA bus> on isab0 pcib4: <ACPI PCI-PCI bridge> at device 20.4 on pci0 pci4: <ACPI PCI bus> on pcib4 ohci2: <AMD SB7x0/SB8x0/SB9x0 USB controller> mem 0xfe5fc000-0xfe5fcfff irq 18 at device 22.0 on pci0 usbus4 on ohci2 usbus4: 12Mbps Full Speed USB v1.0 ehci2: <AMD SB7x0/SB8x0/SB9x0 USB 2.0 controller> mem 0xfe5ff000-0xfe5ff0ff irq 17 at device 22.2 on pci0 usbus5: EHCI version 1.0 usbus5 on ehci2 usbus5: 480Mbps High Speed USB v2.0 acpi_button0: <Power Button> on acpi0 hwpstate0: <Cool`n'Quiet 2.0> on cpu0 Timecounters tick every 1.000 msec ZFS filesystem version: 5 ZFS storage pool version: features support (5000) ugen2.1: <ATI OHCI root HUB> at usbus2 ugen4.1: <ATI OHCI root HUB> at usbus4 uhub0 on usbus2 uhub1 on usbus4 uhub0: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus2 uhub1: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4 ugen1.1: <ATI EHCI root HUB> at usbus1 ugen0.1: <ATI OHCI root HUB> at usbus0 uhub2 on usbus1 uhub3 on usbus0 uhub2: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1 uhub3: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0 mlx4_core0: <mlx4_core> mem 0xfe800000-0xfe8fffff,0xfd800000-0xfdffffff irq 18 at device 0.0 on pci2 mlx4_core: Mellanox ConnectX core driver v3.6.0 (December 2020) mlx4_core: Initializing mlx4_core ugen5.1: <ATI EHCI root HUB> at usbus5 uhub4 on usbus5 uhub4: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus5 ugen3.1: <ATI EHCI root HUB> at usbus3 uhub5 on usbus3 uhub5: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus3 bge0: link state changed to UP ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device ada0: Serial Number WD-WMC4N0D37EH5 ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada0: Command Queueing enabled ada0: 2861588MB (5860533168 512 byte sectors) ada0: quirks=0x1<4K> ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 ada1: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device ada1: Serial Number WD-WMC4N0D7W637 ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada1: Command Queueing enabled ada1: 2861588MB (5860533168 512 byte sectors) ada1: quirks=0x1<4K> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 ada2: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device ada2: Serial Number WD-WMC4N0D6EVLR ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada2: Command Queueing enabled ada2: 2861588MB (5860533168 512 byte sectors) ada2: quirks=0x1<4K> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0 ada3: <WDC WD30EZRX-00D8PB0 80.00A80> ACS-2 ATA SATA 3.x device ada3: Serial Number WD-WMC4N0DA7JCC ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada3: Command Queueing enabled ada3: 2861588MB (5860533168 512 byte sectors) ada3: quirks=0x1<4K> ada4 at ahcich5 bus 0 scbus5 target 0 lun 0 ada4: <WDC WD5000AAJS-55A8B2 01.03B01> ATA8-ACS SATA 2.x device ada4: Serial Number WD-WCASY8895731 ada4: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) ada4: Command Queueing enabled ada4: 476940MB (976773168 512 byte sectors) uhub1: 4 ports with 4 removable, self powered uhub3: 5 ports with 5 removable, self powered uhub0: 5 ports with 5 removable, self powered mlx4_core0: Old device ETS support detected mlx4_core0: Consider upgrading device FW. mlx4_core0: Unable to determine PCI device chain minimum BW <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v3.6.0 (December 2020) <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0 <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0 ib0: link state changed to DOWN ib0: post srq failed for buf 0 (-22) ib0: ipoib_cm_post_receive_srq failed for buf 0 Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x1f4bd438 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80ea7f03 stack pointer = 0x28:0xffffffff829ba990 frame pointer = 0x28:0xffffffff829ba9b0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (swapper) trap number = 12 panic: page fault cpuid = 0 time = 5 KDB: stack backtrace: #0 0xffffffff80c60b55 at kdb_backtrace+0x65 #1 0xffffffff80c13771 at vpanic+0x181 #2 0xffffffff80c135e3 at panic+0x43 #3 0xffffffff81135187 at trap_fatal+0x387 #4 0xffffffff811351df at trap_pfault+0x4f #5 0xffffffff8113483d at trap+0x27d #6 0xffffffff8110c028 at calltrap+0x8 #7 0xffffffff80ea7794 at ipoib_cm_dev_cleanup+0x94 #8 0xffffffff80ea6976 at ipoib_cm_dev_init+0x536 #9 0xffffffff80eaf242 at ipoib_transport_dev_init+0xf2 #10 0xffffffff80ea98d1 at ipoib_ib_dev_init+0x31 #11 0xffffffff80eaaf07 at ipoib_dev_init+0x97 #12 0xffffffff80eac812 at ipoib_add_one+0x312 #13 0xffffffff80e71848 at ib_register_device+0x768 #14 0xffffffff80ee2013 at mlx4_ib_add+0x1033 #15 0xffffffff80f00d40 at mlx4_add_device+0x40 #16 0xffffffff80f00c68 at mlx4_register_interface+0xb8 ----- KERNCONF diff ---------- --- GENERIC 2021-03-21 03:48:03.373297000 +0900 +++ MICROSERVER-PR 2021-03-22 09:22:06.646143000 +0900 @@ -19,7 +19,7 @@ # $FreeBSD$ cpu HAMMER -ident GENERIC +ident MICROSERVER-PR makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support @@ -249,9 +249,23 @@ # Nvidia/Mellanox Connect-X 4 and later, Ethernet only # mlx5ib requires ibcore infra and is not included by default -device mlx5 # Base driver -device mlxfw # Firmware update -device mlx5en # Ethernet driver +#device mlx5 # Base driver +#device mlxfw # Firmware update +#device mlx5en # Ethernet driver + + +# Mellanox +options OFED +options SDP +options IPOIB_CM + +device ipoib +device mlx4 +device mlx4ib +device mlx4en +device mthca + + # PCI Ethernet NICs that use the common MII bus controller code. # NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
Try to remove: options IPOIB_CM From your kernel configuration file. --HPS
Created attachment 223501 [details] Patch to try Please also try this patch with the IPOIB_CM option enabled. Thank you!
Removing options IPOIB_CM simply, boot completely. I need my servers to run in connected mode, so the second way is much better for me. I will try your patch. Could you hold on just a moment? Thank you
I tried your patch and confirm it works well. Please close this bug report. By the way I feel that IPOIB_CM link speed of 13-RC3 is slower than that of 12-RELEASE. But I have no evidence, only in my memory before upgrade. If I find something worth reporting I will send-pr another topic. Thank you.
Sure, let me know how the testing goes! --HPS
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4e38478c595a9e6225b525890d7ee269a203c200 commit 4e38478c595a9e6225b525890d7ee269a203c200 Author: Hans Petter Selasky <hselasky@FreeBSD.org> AuthorDate: 2021-03-25 15:55:02 +0000 Commit: Hans Petter Selasky <hselasky@FreeBSD.org> CommitDate: 2021-03-25 15:55:37 +0000 ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value. The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking sys/ofed/drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++---- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 ++++--- 3 files changed, 8 insertions(+), 8 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=eb008a6793dbbda39ad0a2fb3136eb542d7424e2 commit eb008a6793dbbda39ad0a2fb3136eb542d7424e2 Author: Hans Petter Selasky <hselasky@FreeBSD.org> AuthorDate: 2021-03-25 15:55:02 +0000 Commit: Hans Petter Selasky <hselasky@FreeBSD.org> CommitDate: 2021-04-01 09:19:42 +0000 MFC 4e38478c595a: ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value. The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 Sponsored by: Mellanox Technologies // NVIDIA Networking (cherry picked from commit 4e38478c595a9e6225b525890d7ee269a203c200) sys/ofed/drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++---- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 ++++--- 3 files changed, 8 insertions(+), 8 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=4316db31c2c861c60ba39af35f5a781ce7db95da commit 4316db31c2c861c60ba39af35f5a781ce7db95da Author: Hans Petter Selasky <hselasky@FreeBSD.org> AuthorDate: 2021-03-25 15:55:02 +0000 Commit: Hans Petter Selasky <hselasky@FreeBSD.org> CommitDate: 2021-04-01 09:20:44 +0000 MFC 4e38478c595a: ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value. The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 Sponsored by: Mellanox Technologies // NVIDIA Networking (cherry picked from commit 4e38478c595a9e6225b525890d7ee269a203c200) sys/ofed/drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++---- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 ++++--- 3 files changed, 8 insertions(+), 8 deletions(-)
A commit in branch stable/11 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9d03dfae43d2a1a91c846db38223bc33516d1130 commit 9d03dfae43d2a1a91c846db38223bc33516d1130 Author: Hans Petter Selasky <hselasky@FreeBSD.org> AuthorDate: 2021-03-25 15:55:02 +0000 Commit: Hans Petter Selasky <hselasky@FreeBSD.org> CommitDate: 2021-04-01 09:25:55 +0000 MFC 4e38478c595a: ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value. The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 Sponsored by: Mellanox Technologies // NVIDIA Networking (cherry picked from commit 4e38478c595a9e6225b525890d7ee269a203c200) sys/ofed/drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++---- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 ++++--- 3 files changed, 8 insertions(+), 8 deletions(-)
A commit in branch stable/10 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ca32a11644b182ae9176631cdff1f4dcd7e49b32 commit ca32a11644b182ae9176631cdff1f4dcd7e49b32 Author: Hans Petter Selasky <hselasky@FreeBSD.org> AuthorDate: 2021-03-25 15:55:02 +0000 Commit: Hans Petter Selasky <hselasky@FreeBSD.org> CommitDate: 2021-04-01 09:34:24 +0000 MFC 4e38478c595a: ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value. The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 Sponsored by: Mellanox Technologies // NVIDIA Networking (cherry picked from commit 4e38478c595a9e6225b525890d7ee269a203c200) sys/ofed/drivers/infiniband/ulp/ipoib/ipoib.h | 7 +++---- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- sys/ofed/drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 ++++--- 3 files changed, 8 insertions(+), 8 deletions(-)