Created attachment 247214 [details] core Kernel Panic (Page fault) happen when I tried to load mlx4en. My machine has Mellanox ConnectX-3. ---- % pciconf -vl hostb0@pci0:0:0:0: class=0x060000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x4e24 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = HOST-PCI vgapci0@pci0:0:2:0: class=0x030000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4e61 subvendor=0x8086 subdevice=0x2212 vendor = 'Intel Corporation' device = 'JasperLake [UHD Graphics]' class = display subclass = VGA xhci0@pci0:0:20:0: class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4ded subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = serial bus subclass = USB none0@pci0:0:20:2: class=0x050000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4def subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = memory subclass = RAM none1@pci0:0:22:0: class=0x078000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4de0 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' device = 'Management Engine Interface' class = simple comms sdhci_pci0@pci0:0:26:0: class=0x080501 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4dc4 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = base peripheral subclass = SD host controller pcib1@pci0:0:28:0: class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4db8 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-PCI pcib2@pci0:0:28:1: class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4db9 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-PCI pcib3@pci0:0:28:2: class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dba subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-PCI pcib4@pci0:0:28:3: class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dbb subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-PCI pcib5@pci0:0:28:4: class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dbc subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-PCI isab0@pci0:0:31:0: class=0x060100 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4d87 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' class = bridge subclass = PCI-ISA none2@pci0:0:31:3: class=0x040300 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4dc8 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' device = 'Jasper Lake HD Audio' class = multimedia subclass = HDA none3@pci0:0:31:4: class=0x0c0500 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4da3 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' device = 'Jasper Lake SMBus' class = serial bus subclass = SMBus none4@pci0:0:31:5: class=0x0c8000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4da4 subvendor=0x8086 subdevice=0x7270 vendor = 'Intel Corporation' device = 'Jasper Lake SPI Controller' class = serial bus igc0@pci0:1:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller I226-V' class = network subclass = ethernet igc1@pci0:2:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller I226-V' class = network subclass = ethernet igc2@pci0:3:0:0: class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000 vendor = 'Intel Corporation' device = 'Ethernet Controller I226-V' class = network subclass = ethernet nvme0@pci0:4:0:0: class=0x010802 rev=0x03 hdr=0x00 vendor=0x8086 device=0xf1a6 subvendor=0x8086 subdevice=0x390b vendor = 'Intel Corporation' device = 'SSD Pro 7600p/760p/E 6100p Series' class = mass storage subclass = NVM mlx4_core0@pci0:5:0:0: class=0x020000 rev=0x00 hdr=0x00 vendor=0x15b3 device=0x1003 subvendor=0x15b3 subdevice=0x0113 vendor = 'Mellanox Technologies' device = 'MT27500 Family [ConnectX-3]' class = network subclass = ethernet ---- Reproduce procedure: # kldload mlx4en the core is attached. Analysis: The way I see the stacktrace in the core, the kernel panic happened because "ifm->ifm_status" was NULL at https://cgit.freebsd.org/src/tree/sys/net/if_media.c?h=releng/14.0#n293 and that statement has been executed when mlx4en was calling ether_ifattach() function. https://cgit.freebsd.org/src/tree/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c?h=releng/14.0#n2296 ifm_status callback looks to be set in ifmedia_init() function https://cgit.freebsd.org/src/tree/sys/net/if_media.c?h=releng/14.0#n87 but mlx4en calls ifmedia_init() function after mlx4en calls ether_ifattach() function. https://cgit.freebsd.org/src/tree/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c?h=releng/14.0#n2298 I think that that is the root cause. I'd like to propose a patch to fix it as below. It changes the order of statements. ---- diff --git a/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c b/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c index c26afc0099b5..583de1816d1b 100644 --- a/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c +++ b/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c @@ -2293,7 +2293,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, dev_addr[ETHER_ADDR_LEN - 1 - i] = (u8) (priv->mac >> (8 * i)); - ether_ifattach(dev, dev_addr); if_link_state_change(dev, LINK_STATE_DOWN); ifmedia_init(&priv->media, IFM_IMASK | IFM_ETH_FMASK, mlx4_en_media_change, mlx4_en_media_status); @@ -2306,6 +2305,8 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port, DEBUGNET_SET(dev, mlx4_en); + ether_ifattach(dev, dev_addr); + en_warn(priv, "Using %d TX rings\n", prof->tx_ring_num); en_warn(priv, "Using %d RX rings\n", prof->rx_ring_num); ----
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=553ed8e38bfdd4832deecdec1c0b023824dcff94 commit 553ed8e38bfdd4832deecdec1c0b023824dcff94 Author: Yuji Hagiwara <yuuzi41@hotmail.com> AuthorDate: 2023-12-23 20:53:02 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-12-23 20:53:02 +0000 mlx4(5): fix driver initialization After netlinkification, ether_ifattach() requires ifmedia_init() to be done before it. PR: 275897 MFC after: 1 week sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=1e9df419f14c059aba8d6704256da5c7af4f182a commit 1e9df419f14c059aba8d6704256da5c7af4f182a Author: Yuji Hagiwara <yuuzi41@hotmail.com> AuthorDate: 2023-12-23 20:53:02 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-12-30 00:24:07 +0000 mlx4(5): fix driver initialization PR: 275897 MFC after: 1 week (cherry picked from commit 553ed8e38bfdd4832deecdec1c0b023824dcff94) sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Thanks, Yuji! 少し手間が省けました! I have a question for a maintainer w/ pull access... Will the fix will be merged into the next 14.x release? I encountered the same kernel panic as well while trying to load mlx4en for a 2x40G ConnectX-3. I also have a comment on that note... In the meantime, I hand-merged this change into /usr/src and recompiled the kernel, but I also made a 2-line change in iface.c to check a return value for NULL before dereferencing it. It won't make any failing drivers kldload successfully, but it at least reduces a kernel panic to an error message. Thanks in advance!
The change is in stable/14 for quite some time, so it is already in the releng branch that is on the road to 14.2.