Bug 259748 - Loading mlx4en kernel module hangs
Summary: Loading mlx4en kernel module hangs
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-10 10:23 UTC by Tom Jones
Modified: 2021-12-05 01:03 UTC (History)
3 users (show)

See Also:


Attachments
procstat -akk (780 bytes, text/plain)
2021-11-10 16:54 UTC, Tom Jones
no flags Details
mstflint query output (534 bytes, text/plain)
2021-11-10 17:19 UTC, Tom Jones
no flags Details
procstat -akk run as root (85.10 KB, text/plain)
2021-11-11 16:46 UTC, Tom Jones
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Jones freebsd_committer 2021-11-10 10:23:36 UTC
Loading mlx4en on CURRENT (main-n250453-7ac82c96fe7) hangs kldload. Pressing ^C allows the module to continue and to load properly.

This happens from a shell and when loading the module using 'kld_list'

The mlx4en part of dmesg is:

mlx4_core0: <mlx4_core> mem 0xfce00000-0xfcefffff,0xe0000000-0xe07fffff irq 54 at device 0.0 on pci9
mlx4_core: Mellanox ConnectX core driver v3.7.0 (July 2021)
mlx4_core: Initializing mlx4_core
mlx4_core0: Unable to determine PCI device chain minimum BW
mlx4_en mlx4_core0: Activating port:1
mlxen0: Ethernet address: ec:0d:9a:e1:34:20
mlx4_en: mlx4_core0: Port 1: Using 16 TX rings
mlxen0: link state changed to DOWN
mlx4_en: mlx4_core0: Port 1: Using 16 RX rings
mlx4_en: mlxen0: Using 16 TX rings
mlx4_en: mlxen0: Using 16 RX rings
mlx4_en: mlxen0: Initializing port
mlx4_en mlx4_core0: Activating port:2
mlxen1: Ethernet address: ec:0d:9a:e1:34:21
mlx4_en: mlx4_core0: Port 2: Using 16 TX rings
mlxen1: link state changed to DOWN
mlx4_en: mlx4_core0: Port 2: Using 16 RX rings
mlx4_en: mlxen1: Using 16 TX rings
mlx4_en: mlxen1: Using 16 RX rings
mlx4_en: mlxen1: Initializing port
mlx4_en: mlxen1: Link Up
mlxen1: link state changed to UP
mlx4_en: mlxen0: Link Up
mlxen0: link state changed to UP

This happens on my Ryzen 3700X system, but happened previously on other systems. 13-RELEASE doesn't have this problem
Comment 1 Hans Petter Selasky freebsd_committer 2021-11-10 14:55:38 UTC
Can you do a procstat -akk , when this happens?

Have you updated the firmware on this device?

--HPS
Comment 2 Tom Jones freebsd_committer 2021-11-10 16:54:41 UTC
Created attachment 229410 [details]
procstat -akk
Comment 3 Tom Jones freebsd_committer 2021-11-10 17:19:04 UTC
Created attachment 229411 [details]
mstflint query output
Comment 4 Tom Jones freebsd_committer 2021-11-10 17:19:41 UTC
Updated to the latest firmware for this card (2.42.5000) and kldload still hangs
Comment 5 Hans Petter Selasky freebsd_committer 2021-11-11 16:36:05 UTC
Did you run the procstat -akk as root?
Comment 6 Tom Jones freebsd_committer 2021-11-11 16:46:17 UTC
Created attachment 229432 [details]
procstat -akk run as root
Comment 7 Hans Petter Selasky freebsd_committer 2021-11-11 17:01:56 UTC
Comment on attachment 229432 [details]
procstat -akk run as root

@kib:

The LinuxKPI can load modules inside kldload:

 1050 100580 kldload             -                   mi_switch+0x155 sleepq_switch+0x119 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 _sleep+0x294 kern_kldload+0xd5 mlx4_request_modules+0x9e mlx4_load_one+0x2f8d mlx4_init_one+0x4cc linux_pci_attach_device+0x42e device_attach+0x3c1 device_probe_and_attach+0x70 pci_driver_added+0xf3 devclass_driver_added+0x39 devclass_add_driver+0x147 _linux_pci_register_driver+0xcf 

This is a regression after:

commit e266a0f7f001c7886eab56d8c058d92d87010400
Author: Konstantin Belousov <kib@FreeBSD.org>
Date:   Thu May 20 17:50:43 2021 +0300

    kern linker: do not allow more than one kldload and kldunload syscalls simultaneously
    
    kld_sx is dropped e.g. for executing sysinits, which allows user
    to initiate kldunload while module is not yet fully initialized.
    
    Reviewed by:    markj
    Differential revision:  https://reviews.freebsd.org/D30456
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week





Maybe allow recursion here? Or use a taskqueue to load the module?


--HPS
Comment 8 Konstantin Belousov freebsd_committer 2021-11-12 19:53:20 UTC
Please try https://reviews.freebsd.org/D32972
Comment 9 Hans Petter Selasky freebsd_committer 2021-11-27 09:54:30 UTC
Tom Jones: ping
Comment 10 Tom Jones freebsd_committer 2021-11-27 12:04:15 UTC
kldload doesn't hang with this, tested on e383c423c492781bd7e7a0de9dfe433e4d6a4eed
Comment 11 commit-hook freebsd_committer 2021-11-28 08:37:01 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=4f924a786ae08af496dfe55230f8fe1e2ca16150

commit 4f924a786ae08af496dfe55230f8fe1e2ca16150
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2021-11-12 19:45:06 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2021-11-28 08:36:09 +0000

    linker_kldload_busy(): allow recursion

    Some drivers recursively loads modules by explicit calls to kldload
    during initialization, which might occur during kldload.

    PR:     259748
    Reported and tested by: thj
    Reviewed by:    markj
    Sponsored by:   Nvidia networking
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D32972

 sys/kern/kern_linker.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)
Comment 12 commit-hook freebsd_committer 2021-12-05 01:03:38 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2c52eba4f46e2cc9a4fda3a9e6e81e06fb8daf57

commit 2c52eba4f46e2cc9a4fda3a9e6e81e06fb8daf57
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2021-11-12 19:45:06 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2021-12-05 01:02:57 +0000

    linker_kldload_busy(): allow recursion

    PR:     259748

    (cherry picked from commit 4f924a786ae08af496dfe55230f8fe1e2ca16150)

 sys/kern/kern_linker.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)