Bug 264583 - mlx4en(4): Mellanox ConnectX-2 10g not working in 13.1-RELEASE (fine on 13.0-RELEASE)
Summary: mlx4en(4): Mellanox ConnectX-2 10g not working in 13.1-RELEASE (fine on 13.0-...
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa, regression
Depends on:
Blocks: 264030
  Show dependency treegraph
 
Reported: 2022-06-10 01:15 UTC by Mark Felder
Modified: 2022-06-10 02:56 UTC (History)
1 user (show)

See Also:


Attachments
ConnectX2-FW-2_9_1200-release_notes.pdf (407.46 KB, application/pdf)
2022-06-10 02:48 UTC, Mark Felder
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Felder freebsd_committer freebsd_triage 2022-06-10 01:15:41 UTC
Hello,

My ConnectX-2 is not working in 13.1-RELEASE and was working up through 13.0-RELEASE. I do have all of the required modules loaded:

Id Refs Address                Size Name
 1   98 0xffffffff80200000  1f30590 kernel
(trimmed)
33    2 0xffffffff82aec000    3ef50 mlx4.ko
34    1 0xffffffff82b2b000    13354 mlx4en.ko


It does not show anything in dmesg.

I have only these sysctls:

# sysctl -a | grep mlx
device	mlx
device	mlx5
device	mlxfw
device	mlx5en
hw.mlx5.calibr.normal: 30
hw.mlx5.calibr.fast: 1
hw.mlx5.calibr.duration: 20
hw.mlx5.relaxed_ordering_write: 0
hw.mlx5.auto_fw_update: 0
hw.mlx5.fast_unload_enabled: 1
hw.mlx5.prof_sel: 2
hw.mlx5.debug_mask: 0
hw.mlx5.sw_reset_timeout: 1200
hw.mlx5.fw_reset_enable: 1
hw.mlx5.fw_dump_enable: 1
compat.linuxkpi.mlx4_inline_thold: 104
compat.linuxkpi.mlx4_pfcrx: 0
compat.linuxkpi.mlx4_pfctx: 0
compat.linuxkpi.mlx4_udp_rss: 1
compat.linuxkpi.mlx4_log_mtts_per_seg: 3
compat.linuxkpi.mlx4_use_prio: 0
compat.linuxkpi.mlx4_log_num_vlan: 0
compat.linuxkpi.mlx4_log_num_mac: 7
compat.linuxkpi.mlx4_enable_4k_uar: 0
compat.linuxkpi.mlx4_enable_64b_cqe_eqe: 1
compat.linuxkpi.mlx4_log_num_mgm_entry_size: 10
compat.linuxkpi.mlx4_msi_x: 1
compat.linuxkpi.mlx4_enable_qos: 0
compat.linuxkpi.mlx4_internal_err_reset: 1


# pciconf -e -l
hostb0@pci0:0:0:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1480 subvendor=0x1849 subdevice=0x1480
none0@pci0:0:0:2:	class=0x080600 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1481 subvendor=0x1022 subdevice=0x1481
hostb1@pci0:0:1:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
pcib1@pci0:0:1:3:	class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1483 subvendor=0x1022 subdevice=0x1234
hostb2@pci0:0:2:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
hostb3@pci0:0:3:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
hostb4@pci0:0:4:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
hostb5@pci0:0:5:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
hostb6@pci0:0:7:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
pcib10@pci0:0:7:1:	class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484
hostb7@pci0:0:8:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1482 subvendor=0x0000 subdevice=0x0000
pcib11@pci0:0:8:1:	class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484
pcib12@pci0:0:8:2:	class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484
pcib13@pci0:0:8:3:	class=0x060400 rev=0x00 hdr=0x01 vendor=0x1022 device=0x1484 subvendor=0x1022 subdevice=0x1484
intsmb0@pci0:0:20:0:	class=0x0c0500 rev=0x61 hdr=0x00 vendor=0x1022 device=0x790b subvendor=0x1849 subdevice=0xffff
isab0@pci0:0:20:3:	class=0x060100 rev=0x51 hdr=0x00 vendor=0x1022 device=0x790e subvendor=0x1849 subdevice=0xffff
hostb8@pci0:0:24:0:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1440 subvendor=0x0000 subdevice=0x0000
hostb9@pci0:0:24:1:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1441 subvendor=0x0000 subdevice=0x0000
hostb10@pci0:0:24:2:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1442 subvendor=0x0000 subdevice=0x0000
hostb11@pci0:0:24:3:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1443 subvendor=0x0000 subdevice=0x0000
hostb12@pci0:0:24:4:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1444 subvendor=0x0000 subdevice=0x0000
hostb13@pci0:0:24:5:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1445 subvendor=0x0000 subdevice=0x0000
hostb14@pci0:0:24:6:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1446 subvendor=0x0000 subdevice=0x0000
hostb15@pci0:0:24:7:	class=0x060000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1447 subvendor=0x0000 subdevice=0x0000
xhci0@pci0:3:0:0:	class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43d0 subvendor=0x1b21 subdevice=0x1142
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
ahci0@pci0:3:0:1:	class=0x010601 rev=0x01 hdr=0x00 vendor=0x1022 device=0x43c8 subvendor=0x1b21 subdevice=0x1062
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
pcib2@pci0:3:0:2:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c6 subvendor=0x1b21 subdevice=0x0201
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
pcib3@pci0:32:0:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib5@pci0:32:1:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib6@pci0:32:2:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib7@pci0:32:3:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib8@pci0:32:4:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib9@pci0:32:8:0:	class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x43c7 subvendor=0x1b21 subdevice=0x3306
pcib4@pci0:33:0:0:	class=0x060400 rev=0x04 hdr=0x01 vendor=0x1a03 device=0x1150 subvendor=0x1849 subdevice=0x1150
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
                 Advisory Non-Fatal Error
vgapci0@pci0:34:0:0:	class=0x030000 rev=0x41 hdr=0x00 vendor=0x1a03 device=0x2000 subvendor=0x1a03 subdevice=0x2000
igb0@pci0:35:0:0:	class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1533 subvendor=0x1849 subdevice=0x1533
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Replay Timer Timeout
                 Advisory Non-Fatal Error
igb1@pci0:36:0:0:	class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1533 subvendor=0x1849 subdevice=0x1533
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
ahci1@pci0:37:0:0:	class=0x010601 rev=0x02 hdr=0x00 vendor=0x1b21 device=0x0612 subvendor=0x1849 subdevice=0x0612
nvme0@pci0:38:0:0:	class=0x010802 rev=0x03 hdr=0x00 vendor=0x2646 device=0x2263 subvendor=0x2646 subdevice=0x2263
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error
nvme1@pci0:42:0:0:	class=0x010802 rev=0x01 hdr=0x00 vendor=0x1987 device=0x5012 subvendor=0x1987 subdevice=0x5012
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Receiver Error
                 Advisory Non-Fatal Error
none1@pci0:43:0:0:	class=0x130000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x148a subvendor=0x1022 subdevice=0x148a
none2@pci0:44:0:0:	class=0x130000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1485 subvendor=0x1022 subdevice=0x1485
none3@pci0:44:0:1:	class=0x108000 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1486 subvendor=0x1022 subdevice=0x1486
xhci1@pci0:44:0:3:	class=0x0c0330 rev=0x00 hdr=0x00 vendor=0x1022 device=0x149c subvendor=0x1849 subdevice=0xffff
hdac0@pci0:44:0:4:	class=0x040300 rev=0x00 hdr=0x00 vendor=0x1022 device=0x1487 subvendor=0x1022 subdevice=0xc950
ahci2@pci0:48:0:0:	class=0x010601 rev=0x51 hdr=0x00 vendor=0x1022 device=0x7901 subvendor=0x1849 subdevice=0xffff
ahci3@pci0:49:0:0:	class=0x010601 rev=0x51 hdr=0x00 vendor=0x1022 device=0x7901 subvendor=0x1849 subdevice=0xffff


Let me know if there's any other info I can provide.
Comment 1 Mark Felder freebsd_committer freebsd_triage 2022-06-10 01:59:36 UTC
I think the card just might be flaky. After several attempts of testing things from boots into single user mode it did finally get recognized when I loaded the driver, and its device id isn't even listed in my pciconf output above.

It could also be a really old firmware as well. I've never updated it.
Comment 2 Mark Felder freebsd_committer freebsd_triage 2022-06-10 02:43:59 UTC
I was on firmware 2.9.1000 which was one release behind and the changelog between that and the latest version of 2.9.1200 includes the following items that seem likely:


---
On rare occasions, a PCIe link failure of MLNX
10GbE single and dual port card occurred due to
PCIe receiver detection failure

PCIe FunctionLevelReset occasionally failed when
the driver was down
---

In the past I had issues where sometimes I needed a clean power cycle to get the card recognized as a normal reboot made it disappear, so for a long time I just did a power cycle of the server when I ran updates. However, it always worked after a clean reboot.

As of FreeBSD 13.1 it was no longer being detected after a power cycle. I had to do many power cycles and booting into single user mode until I was lucky and it could be detected.
Comment 3 Mark Felder freebsd_committer freebsd_triage 2022-06-10 02:45:01 UTC
I also want to note to anyone trying to use the official mst / flint tool from Nvidia (Mellanox) on FreeBSD that it will screw up your FreeBSD install by changing the following directories to have an owner of uid 70705

/usr
/usr/lib
/usr/lib/bash_libs
/usr/bin
/usr/include
/usr/share
/usr/share/man
/usr/share/man/man1
/etc
Comment 4 Mark Felder freebsd_committer freebsd_triage 2022-06-10 02:48:08 UTC
Created attachment 234590 [details]
ConnectX2-FW-2_9_1200-release_notes.pdf

Attaching this pdf as it's not officially hosted anymore and may be useful in the future for future research on these cards
Comment 5 Mark Felder freebsd_committer freebsd_triage 2022-06-10 02:56:12 UTC
This is now coming up for me every boot, so I think the firmware update is the fix.

Useful notes for future explorers:

1. Use the sysutils/mstflint package

2. You can get the board and firmware version from sysctl. e.g.,

> sysctl sys.device.mlx4_core0
sys.device.mlx4_core0.hw.board_id: MT_0F60110010
sys.device.mlx4_core0.hw.fw_version: 2.9.1200
sys.device.mlx4_core0.mlx4_port1_mtu: 4096
sys.device.mlx4_core0.mlx4_port1: eth

3. The mstflint examples for Linux show the device ID formatted like:

> mstflint -d 02:00.0 q

On FreeBSD the pciconf output may show the following:

mlx4_core0@pci0:1:0:0:	class=0x020000 rev=0xb0 hdr=0x00 vendor=0x15b3 device=0x6750 subvendor=0x15b3 subdevice=0x0015

The correct device ID is pci0:1:0:0, so the command would be as follows:

> mstflint -d pci0:1:0:0 q


Hope this helps. Closing, I don't think this is a kernel bug anymore.