Bug 275897 - mlx4en: Panic when mlx4en is loaded
Summary: mlx4en: Panic when mlx4en is loaded
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.0-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Konstantin Belousov
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-12-23 13:38 UTC by Yuji Hagiwara
Modified: 2024-05-07 12:23 UTC (History)
2 users (show)

See Also:
linimon: mfc-stable13?


Attachments
core (67.78 KB, text/plain)
2023-12-23 13:38 UTC, Yuji Hagiwara
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuji Hagiwara 2023-12-23 13:38:50 UTC
Created attachment 247214 [details]
core

Kernel Panic (Page fault) happen when I tried to load mlx4en.

My machine has Mellanox ConnectX-3.

----
% pciconf -vl
hostb0@pci0:0:0:0:      class=0x060000 rev=0x00 hdr=0x00 vendor=0x8086 device=0x4e24 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = HOST-PCI
vgapci0@pci0:0:2:0:     class=0x030000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4e61 subvendor=0x8086 subdevice=0x2212
    vendor     = 'Intel Corporation'
    device     = 'JasperLake [UHD Graphics]'
    class      = display
    subclass   = VGA
xhci0@pci0:0:20:0:      class=0x0c0330 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4ded subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = serial bus
    subclass   = USB
none0@pci0:0:20:2:      class=0x050000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4def subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = memory
    subclass   = RAM
none1@pci0:0:22:0:      class=0x078000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4de0 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    device     = 'Management Engine Interface'
    class      = simple comms
sdhci_pci0@pci0:0:26:0: class=0x080501 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4dc4 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = base peripheral
    subclass   = SD host controller
pcib1@pci0:0:28:0:      class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4db8 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
pcib2@pci0:0:28:1:      class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4db9 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
pcib3@pci0:0:28:2:      class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dba subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
pcib4@pci0:0:28:3:      class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dbb subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
pcib5@pci0:0:28:4:      class=0x060400 rev=0x01 hdr=0x01 vendor=0x8086 device=0x4dbc subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-PCI
isab0@pci0:0:31:0:      class=0x060100 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4d87 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-ISA
none2@pci0:0:31:3:      class=0x040300 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4dc8 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    device     = 'Jasper Lake HD Audio'
    class      = multimedia
    subclass   = HDA
none3@pci0:0:31:4:      class=0x0c0500 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4da3 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    device     = 'Jasper Lake SMBus'
    class      = serial bus
    subclass   = SMBus
none4@pci0:0:31:5:      class=0x0c8000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x4da4 subvendor=0x8086 subdevice=0x7270
    vendor     = 'Intel Corporation'
    device     = 'Jasper Lake SPI Controller'
    class      = serial bus
igc0@pci0:1:0:0:        class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet
igc1@pci0:2:0:0:        class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet
igc2@pci0:3:0:0:        class=0x020000 rev=0x04 hdr=0x00 vendor=0x8086 device=0x125c subvendor=0x8086 subdevice=0x0000
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller I226-V'
    class      = network
    subclass   = ethernet
nvme0@pci0:4:0:0:       class=0x010802 rev=0x03 hdr=0x00 vendor=0x8086 device=0xf1a6 subvendor=0x8086 subdevice=0x390b
    vendor     = 'Intel Corporation'
    device     = 'SSD Pro 7600p/760p/E 6100p Series'
    class      = mass storage
    subclass   = NVM
mlx4_core0@pci0:5:0:0:  class=0x020000 rev=0x00 hdr=0x00 vendor=0x15b3 device=0x1003 subvendor=0x15b3 subdevice=0x0113
    vendor     = 'Mellanox Technologies'
    device     = 'MT27500 Family [ConnectX-3]'
    class      = network
    subclass   = ethernet
----

Reproduce procedure:
# kldload mlx4en

the core is attached.

Analysis:

The way I see the stacktrace in the core, the kernel panic happened because "ifm->ifm_status" was NULL at 
https://cgit.freebsd.org/src/tree/sys/net/if_media.c?h=releng/14.0#n293
and that statement has been executed when mlx4en was calling ether_ifattach() function.
https://cgit.freebsd.org/src/tree/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c?h=releng/14.0#n2296

ifm_status callback looks to be set in ifmedia_init() function https://cgit.freebsd.org/src/tree/sys/net/if_media.c?h=releng/14.0#n87
but mlx4en calls ifmedia_init() function after mlx4en calls ether_ifattach() function.
https://cgit.freebsd.org/src/tree/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c?h=releng/14.0#n2298

I think that that is the root cause.

I'd like to propose a patch to fix it as below. It changes the order of statements.
----
diff --git a/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c b/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
index c26afc0099b5..583de1816d1b 100644
--- a/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
+++ b/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
@@ -2293,7 +2293,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
                dev_addr[ETHER_ADDR_LEN - 1 - i] = (u8) (priv->mac >> (8 * i));


-       ether_ifattach(dev, dev_addr);
        if_link_state_change(dev, LINK_STATE_DOWN);
        ifmedia_init(&priv->media, IFM_IMASK | IFM_ETH_FMASK,
            mlx4_en_media_change, mlx4_en_media_status);
@@ -2306,6 +2305,8 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,

        DEBUGNET_SET(dev, mlx4_en);

+       ether_ifattach(dev, dev_addr);
+
        en_warn(priv, "Using %d TX rings\n", prof->tx_ring_num);
        en_warn(priv, "Using %d RX rings\n", prof->rx_ring_num);
----
Comment 1 commit-hook freebsd_committer freebsd_triage 2023-12-23 20:59:04 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=553ed8e38bfdd4832deecdec1c0b023824dcff94

commit 553ed8e38bfdd4832deecdec1c0b023824dcff94
Author:     Yuji Hagiwara <yuuzi41@hotmail.com>
AuthorDate: 2023-12-23 20:53:02 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-12-23 20:53:02 +0000

    mlx4(5): fix driver initialization

    After netlinkification, ether_ifattach() requires ifmedia_init() to be
    done before it.

    PR:     275897
    MFC after:      1 week

 sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 2 commit-hook freebsd_committer freebsd_triage 2023-12-30 00:25:15 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=1e9df419f14c059aba8d6704256da5c7af4f182a

commit 1e9df419f14c059aba8d6704256da5c7af4f182a
Author:     Yuji Hagiwara <yuuzi41@hotmail.com>
AuthorDate: 2023-12-23 20:53:02 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-12-30 00:24:07 +0000

    mlx4(5): fix driver initialization

    PR:     275897
    MFC after:      1 week

    (cherry picked from commit 553ed8e38bfdd4832deecdec1c0b023824dcff94)

 sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 3 Jory Folker 2024-05-07 04:04:09 UTC
Thanks, Yuji! 少し手間が省けました!

I have a question for a maintainer w/ pull access...
Will the fix will be merged into the next 14.x release? I encountered the same kernel panic as well while trying to load mlx4en for a 2x40G ConnectX-3.

I also have a comment on that note...
In the meantime, I hand-merged this change into /usr/src and recompiled the kernel, but I also made a 2-line change in iface.c to check a return value for NULL before dereferencing it. It won't make any failing drivers kldload successfully, but it at least reduces a kernel panic to an error message.

Thanks in advance!
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2024-05-07 12:23:03 UTC
The change is in stable/14 for quite some time, so it is already in the
releng branch that is on the road to 14.2.