Hello, My ConnectX-2 is not working in 13.1-RELEASE and was working up through 13.0-RELEASE. I do have all of the required modules loaded: Id Refs Address Size Name 1 98 0xffffffff80200000 1f30590 kernel (trimmed) 33 2 0xffffffff82aec000 3ef50 mlx4.ko 34 1 0xffffffff82b2b000 13354 mlx4en.ko It does not show anything in dmesg. I have only these sysctls: # sysctl -a | grep mlx device mlx device mlx5 device mlxfw device mlx5en hw.mlx5.calibr.normal: 30 hw.mlx5.calibr.fast: 1 hw.mlx5.calibr.duration: 20 hw.mlx5.relaxed_ordering_write: 0 hw.mlx5.auto_fw_update: 0 hw.mlx5.fast_unload_enabled: 1 hw.mlx5.prof_sel: 2 hw.mlx5.debug_mask: 0 hw.mlx5.sw_reset_timeout: 1200 hw.mlx5.fw_reset_enable: 1 hw.mlx5.fw_dump_enable: 1 compat.linuxkpi.mlx4_inline_thold: 104 compat.linuxkpi.mlx4_pfcrx: 0 compat.linuxkpi.mlx4_pfctx: 0 compat.linuxkpi.mlx4_udp_rss: 1 compat.linuxkpi.mlx4_log_mtts_per_seg: 3 compat.linuxkpi.mlx4_use_prio: 0 compat.linuxkpi.mlx4_log_num_vlan: 0 compat.linuxkpi.mlx4_log_num_mac: 7 compat.linuxkpi.mlx4_enable_4k_uar: 0 compat.linuxkpi.mlx4_enable_64b_cqe_eqe: 1 compat.linuxkpi.mlx4_log_num_mgm_entry_size: 10 compat.linuxkpi.mlx4_msi_x: 1 compat.linuxkpi.mlx4_enable_qos: 0 compat.linuxkpi.mlx4_internal_err_reset: 1 # pciconf -e -l hostb0@pci0:0:0:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1480 subvendor=0x1849 subdevice=0x1480 none0@pci0:0:0:2: class=0x080600 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1481 subvendor=0x1022 subdevice=0x1481 hostb1@pci0:0:1:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 pcib1@pci0:0:1:3: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1483 subvendor=0x1022 subdevice=0x1234 hostb2@pci0:0:2:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 hostb3@pci0:0:3:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 hostb4@pci0:0:4:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 hostb5@pci0:0:5:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 hostb6@pci0:0:7:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 pcib10@pci0:0:7:1: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484 hostb7@pci0:0:8:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000 pcib11@pci0:0:8:1: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484 pcib12@pci0:0:8:2: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484 pcib13@pci0:0:8:3: class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484 intsmb0@pci0:0:20:0: class=0x0c0500 rev=0x61 hdr=0x00 vendor=0x1022 device=0x790b subvendor=0x1849 subdevice=0xffff isab0@pci0:0:20:3: class=0x060100 rev=0x51 hdr=0x00 vendor=0x1022 device=0x790e subvendor=0x1849 subdevice=0xffff hostb8@pci0:0:24:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1440 subvendor=0x0000 subdevice=0x0000 hostb9@pci0:0:24:1: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1441 subvendor=0x0000 subdevice=0x0000 hostb10@pci0:0:24:2: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1442 subvendor=0x0000 subdevice=0x0000 hostb11@pci0:0:24:3: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1443 subvendor=0x0000 subdevice=0x0000 hostb12@pci0:0:24:4: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1444 subvendor=0x0000 subdevice=0x0000 hostb13@pci0:0:24:5: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1445 subvendor=0x0000 subdevice=0x0000 hostb14@pci0:0:24:6: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1446 subvendor=0x0000 subdevice=0x0000 hostb15@pci0:0:24:7: class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1447 subvendor=0x0000 subdevice=0x0000 xhci0@pci0:3:0:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43d0 subvendor=0x1b21 subdevice=0x1142 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error ahci0@pci0:3:0:1: class=0x010601 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43c8 subvendor=0x1b21 subdevice=0x1062 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error pcib2@pci0:3:0:2: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c6 subvendor=0x1b21 subdevice=0x0201 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error pcib3@pci0:32:0:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib5@pci0:32:1:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib6@pci0:32:2:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib7@pci0:32:3:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib8@pci0:32:4:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib9@pci0:32:8:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306 pcib4@pci0:33:0:0: class=0x060400 rev=0x04 hdr=0x01 vendor=0x1a03 device=0x1150 subvendor=0x1849 subdevice=0x1150 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Receiver Error Advisory Non-Fatal Error vgapci0@pci0:34:0:0: class=0x030000 rev=0x41 hdr=0x00 vendor=0x1a03 device=0x2000 subvendor=0x1a03 subdevice=0x2000 igb0@pci0:35:0:0: class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1533 subvendor=0x1849 subdevice=0x1533 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Replay Timer Timeout Advisory Non-Fatal Error igb1@pci0:36:0:0: class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1533 subvendor=0x1849 subdevice=0x1533 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error ahci1@pci0:37:0:0: class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1849 subdevice=0x0612 nvme0@pci0:38:0:0: class=0x010802 rev=0x03 hdr=0x00 vendor=0x2646 device=0x2263 subvendor=0x2646 subdevice=0x2263 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error nvme1@pci0:42:0:0: class=0x010802 rev=0x01 hdr=0x00 vendor=0x1987 device=0x5012 subvendor=0x1987 subdevice=0x5012 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Receiver Error Advisory Non-Fatal Error none1@pci0:43:0:0: class=0x130000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x148a subvendor=0x1022 subdevice=0x148a none2@pci0:44:0:0: class=0x130000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1485 subvendor=0x1022 subdevice=0x1485 none3@pci0:44:0:1: class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1486 subvendor=0x1022 subdevice=0x1486 xhci1@pci0:44:0:3: class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x149c subvendor=0x1849 subdevice=0xffff hdac0@pci0:44:0:4: class=0x040300 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1487 subvendor=0x1022 subdevice=0xc950 ahci2@pci0:48:0:0: class=0x010601 rev=0x51 hdr=0x00 vendor=0x1022 device=0x7901 subvendor=0x1849 subdevice=0xffff ahci3@pci0:49:0:0: class=0x010601 rev=0x51 hdr=0x00 vendor=0x1022 device=0x7901 subvendor=0x1849 subdevice=0xffff Let me know if there's any other info I can provide.
I think the card just might be flaky. After several attempts of testing things from boots into single user mode it did finally get recognized when I loaded the driver, and its device id isn't even listed in my pciconf output above. It could also be a really old firmware as well. I've never updated it.
I was on firmware 2.9.1000 which was one release behind and the changelog between that and the latest version of 2.9.1200 includes the following items that seem likely: --- On rare occasions, a PCIe link failure of MLNX 10GbE single and dual port card occurred due to PCIe receiver detection failure PCIe FunctionLevelReset occasionally failed when the driver was down --- In the past I had issues where sometimes I needed a clean power cycle to get the card recognized as a normal reboot made it disappear, so for a long time I just did a power cycle of the server when I ran updates. However, it always worked after a clean reboot. As of FreeBSD 13.1 it was no longer being detected after a power cycle. I had to do many power cycles and booting into single user mode until I was lucky and it could be detected.
I also want to note to anyone trying to use the official mst / flint tool from Nvidia (Mellanox) on FreeBSD that it will screw up your FreeBSD install by changing the following directories to have an owner of uid 70705 /usr /usr/lib /usr/lib/bash_libs /usr/bin /usr/include /usr/share /usr/share/man /usr/share/man/man1 /etc
Created attachment 234590 [details] ConnectX2-FW-2_9_1200-release_notes.pdf Attaching this pdf as it's not officially hosted anymore and may be useful in the future for future research on these cards
This is now coming up for me every boot, so I think the firmware update is the fix. Useful notes for future explorers: 1. Use the sysutils/mstflint package 2. You can get the board and firmware version from sysctl. e.g., > sysctl sys.device.mlx4_core0 sys.device.mlx4_core0.hw.board_id: MT_0F60110010 sys.device.mlx4_core0.hw.fw_version: 2.9.1200 sys.device.mlx4_core0.mlx4_port1_mtu: 4096 sys.device.mlx4_core0.mlx4_port1: eth 3. The mstflint examples for Linux show the device ID formatted like: > mstflint -d 02:00.0 q On FreeBSD the pciconf output may show the following: mlx4_core0@pci0:1:0:0: class=0x020000 rev=0xb0 hdr=0x00 vendor=0x15b3 device=0x6750 subvendor=0x15b3 subdevice=0x0015 The correct device ID is pci0:1:0:0, so the command would be as follows: > mstflint -d pci0:1:0:0 q Hope this helps. Closing, I don't think this is a kernel bug anymore.