Bug 241639 - Fatal trap 12: page fault ... current process = 0 (vmbusdev) when using mlx4en
Summary: Fatal trap 12: page fault ... current process = 0 (vmbusdev) when using mlx4en
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2019-11-01 09:42 UTC by Michael
Modified: 2022-10-12 00:49 UTC (History)
3 users (show)

See Also:


Attachments
Attachment 1 (Network adapter) (46.44 KB, image/png)
2019-11-01 09:42 UTC, Michael
no flags Details
Attachment 2 (VM setup) (74.71 KB, image/png)
2019-11-01 09:44 UTC, Michael
no flags Details
Attachment 3 (FS) (20.48 KB, image/png)
2019-11-01 09:45 UTC, Michael
no flags Details
Attachment 4 (ZFS) (40.80 KB, image/png)
2019-11-01 09:46 UTC, Michael
no flags Details
Attachment 5 (enable SR-IOV on port) (76.73 KB, image/png)
2019-11-01 09:47 UTC, Michael
no flags Details
Attachment 6 (work condition) (27.78 KB, image/png)
2019-11-01 09:48 UTC, Michael
no flags Details
Attachment 7 (52.38 KB, image/png)
2019-11-01 09:49 UTC, Michael
no flags Details
Attachment 8 (boot-2) (39.65 KB, image/png)
2019-11-01 09:50 UTC, Michael
no flags Details
Attachment 9 (boot-3) (39.55 KB, image/png)
2019-11-01 09:51 UTC, Michael
no flags Details
Attachment 10 (boot-4) (49.04 KB, image/png)
2019-11-01 09:52 UTC, Michael
no flags Details
Attachment 11 (Fatal trap 12) (50.06 KB, image/png)
2019-11-01 09:52 UTC, Michael
no flags Details
VFS PATCH (510 bytes, patch)
2019-11-25 09:23 UTC, Hans Petter Selasky
no flags Details | Diff
Kernel Linker patch (697 bytes, patch)
2019-11-25 10:27 UTC, Hans Petter Selasky
no flags Details | Diff
screenshot of fatal trap 12 (105.46 KB, image/png)
2019-11-27 10:15 UTC, Michael
no flags Details
verbose output of boot process, and, fatal trap 12 (44.23 KB, text/plain)
2019-11-27 10:16 UTC, Michael
no flags Details
verbose output of boot process with printf TD_PROC & P_FD (44.16 KB, text/plain)
2019-11-27 14:05 UTC, Michael
no flags Details
Turn ON SR-IOV on working VM (93.31 KB, image/jpeg)
2019-11-27 15:07 UTC, Michael
no flags Details
make + patch (35.58 KB, image/jpeg)
2019-11-27 15:10 UTC, Michael
no flags Details
verbose output with working patch (44.58 KB, text/plain)
2019-11-27 15:27 UTC, Michael
no flags Details
verbose output when 'chekpoint' have been made (4.18 KB, text/plain)
2019-11-27 15:28 UTC, Michael
no flags Details
verbose boot log with 'patch' and precompiled mlx4 (44.71 KB, text/plain)
2019-11-27 20:46 UTC, Michael
no flags Details
verbose log 'chekpoint' maded (4.33 KB, text/plain)
2019-11-27 20:47 UTC, Michael
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael 2019-11-01 09:42:26 UTC
Created attachment 208757 [details]
Attachment 1 [details] (Network adapter)

Fatal trap 12: page fault ... current process = 0 (vmbusdev)

This applies to these OS versions:
FreeBSD-11.2 RELEASE amd64
FreeBSD-11.3 RELEASE-amd64
FreeBSD-11.3 STABLE amd64 20191025 r354051
FreeBSD-12.0 RELEASE amd64
FreeBSD-12.1 RC1 amd64
FreeBSD-12.1 RC2 amd64
FreeBSD-12.1 STABLE amd64 20191025 r354051
FreeBSD-13.0 CURRENT amd64 20191025 r354057

For all these OSs, this bug is the same.

Hypervisor: Windows 2019 server Hyper-V (all update and drivers are the latest).
VM: Hyper-V generation V2.
Network adapter: Mellanox ConnectiX-3 EN 10G SR-IOV enabled, (Attachment 1 [details]).

All VM-OSs were installed under the same conditions, (Attachment 2 [details]):
mem: 6GB, HDD: 127GB;
FS: Auto (ZFS) Guided Root-on-ZFS, (Attachment 3 [details]);
ZFS: GPT (BIOS+UEFI), (Attachment 4 [details]);
LAN: one HV-V2 network adapter, (Attachment 2 [details]);

When we enable SR-IOV on the port, (Attachment 5 [details]),
and load the Mellanox driver in the OS
  root@frw03v2:~ # kldload mlx4en
everything works without problems as it should.
Status of network connection is OK (Support SR-IOV active), (Attachment 6 [details]).
Add to /boot/loader.conf to automatically load the ConnectiX-3 driver
  mlx4en_load="YES"
and reboot VM-OS.
Result - Fatal trap on loading stage (Attachment 7 [details]-11).

As a temporary solution to this problem,
the pfSense community suggested this:
in /boot/loader.conf add
  kern.cam.boot_delay="10000"
Indeed, all versions of the FreeBSD OS will now boot OK
(with parameter kern.cam.boot_delay="10000"
in the configuration file /boot/loader.conf),
but where is the guarantee that this focus
makes the system stable under load?
And yet, this focus stops working when
parameter kern.cam.boot_delay is less than 1000.
For example, if kern.cam.boot_delay="500",
then when booting the system - the same Fatal trap.
Comment 1 Michael 2019-11-01 09:44:25 UTC
Created attachment 208758 [details]
Attachment 2 [details] (VM setup)
Comment 2 Michael 2019-11-01 09:45:22 UTC
Created attachment 208759 [details]
Attachment 3 [details] (FS)
Comment 3 Michael 2019-11-01 09:46:19 UTC
Created attachment 208760 [details]
Attachment 4 [details] (ZFS)
Comment 4 Michael 2019-11-01 09:47:21 UTC
Created attachment 208761 [details]
Attachment 5 [details] (enable SR-IOV on port)
Comment 5 Michael 2019-11-01 09:48:43 UTC
Created attachment 208762 [details]
Attachment 6 [details] (work condition)
Comment 6 Michael 2019-11-01 09:49:37 UTC
Created attachment 208763 [details]
Attachment 7 [details]
Comment 7 Michael 2019-11-01 09:50:42 UTC
Created attachment 208764 [details]
Attachment 8 [details] (boot-2)
Comment 8 Michael 2019-11-01 09:51:08 UTC
Created attachment 208765 [details]
Attachment 9 [details] (boot-3)
Comment 9 Michael 2019-11-01 09:52:18 UTC
Created attachment 208766 [details]
Attachment 10 [details] (boot-4)
Comment 10 Michael 2019-11-01 09:52:51 UTC
Created attachment 208767 [details]
Attachment 11 [details] (Fatal trap 12)
Comment 11 Michael 2019-11-23 19:01:10 UTC
After applying the 4a46b2449c63e010014dc0fb2a3caa5e20b97933 commit, the kern.cam.boot_delay="10000" parameter in /boot/loader.conf stopped working. Catastrophe! Now I have to load mlx4en.ko in firewall rules!
 (in /etc/rc.firewall adding at end "kldload mlx4en")

Please correct the situation.

committer	mav <mav@FreeBSD.org>	
Fri, 22 Nov 2019 20:39:51 +0200 (18:39 +0000)
commit	4a46b2449c63e010014dc0fb2a3caa5e20b97933
Make CAM use root_mount_hold_token() to delay boot.

Before this change CAM used config_intrhook_establish() for this purpose,
but that approach does not allow to delay it again after releasing once.

USB stack uses root_mount_hold() to delay boot until bus scan is complete.
But once it is, CAM had no time to scan SCSI bus, registered by umass(4),
if it already done other scans and called config_intrhook_disestablish().
The new approach makes it work smooth, assuming the USB device is found
during the initial bus scan.  Devices appearing on USB bus later may still
require setting kern.cam.boot_delay, but hopefully those are minority.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Comment 12 Alexander Motin freebsd_committer freebsd_triage 2019-11-23 19:57:14 UTC
As I have answered to private email from originator, mine mentioned commit changed imeplementation of kern.cam.boot_delay withing its original semantics.  The fact that it is no longer possible to use it as workaround is pitiful, but it does not mean it is wrong.  It is original problem that needs to be diagnosed, not a workaround.

Hans, any idea what is going wrong with mlx4 driver here?
Comment 13 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-25 09:23:23 UTC
Created attachment 209403 [details]
VFS PATCH

(In reply to Alexander Motin from comment #12)

Can you try this patch?

It is not a bug in mlx4en :-)
Comment 14 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-25 09:25:39 UTC
kib: Can you quickly look at my VFS patch?
Comment 15 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-25 09:51:55 UTC
(In reply to Hans Petter Selasky from comment #14)
vfs_lookup() is not appropriate place to do this.

I think that kern_kldload() is much better place to put the check.
Comment 16 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-25 09:53:11 UTC
(In reply to Konstantin Belousov from comment #15)
And there, you would check rootvnode != NULL.
Comment 17 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-25 10:27:40 UTC
Created attachment 209407 [details]
Kernel Linker patch

Can you try this patch?
Comment 18 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-25 14:10:49 UTC
(In reply to Hans Petter Selasky from comment #17)
Note that in the 'else' branch around your patch, there is already check for rootvnode.
Comment 19 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-25 14:20:49 UTC
Yes, but that else check is skipped if the check before is true.

--HPS
Comment 20 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-25 14:33:33 UTC
(In reply to Hans Petter Selasky from comment #19)
Put the patch into phab.
Comment 21 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-25 14:38:36 UTC
https://reviews.freebsd.org/D22545
Comment 22 commit-hook freebsd_committer freebsd_triage 2019-11-26 12:21:03 UTC
A commit references this bug:

Author: hselasky
Date: Tue Nov 26 12:20:44 UTC 2019
New revision: 355108
URL: https://svnweb.freebsd.org/changeset/base/355108

Log:
  Fix panic when loading kernel modules before root file system is mounted.
  Make sure the rootvnode is always NULL checked.

  Differential Revision:	https://reviews.freebsd.org/D22545
  PR:		241639
  MFC after:	1 week
  Sponsored by:	Mellanox Technologies

Changes:
  head/sys/kern/kern_linker.c
Comment 23 Michael 2019-11-26 21:01:43 UTC
Unfortunately this patch did not help. Fatal trap 12: page fault while in kernel mode.
Comment 24 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-26 21:46:10 UTC
(In reply to Michael from comment #23)
Try this.

diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c
index 6dc21886066..89b575b0ab7 100644
--- a/sys/kern/kern_linker.c
+++ b/sys/kern/kern_linker.c
@@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid)
 	if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0)
 		return (error);
 
+	if (td->td_proc->p_fd == NULL)
+		return (EINVAL);
+
 	/*
 	 * It is possible that kldloaded module will attach a new ifnet,
 	 * so vnet context must be set when this ocurs.
Comment 25 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-26 21:48:06 UTC
Please double check patches are applied and not rejected.

--HPS
Comment 26 Michael 2019-11-27 10:14:10 UTC
I clarify the nature of the appearance of the fatal trap.
When the SR-IOV is off, it boots normally.
If you boot the VM with the SR-IOV is on the port "turned on" - a fatal trap.
I attach a screenshot and a verbose message log in a text file.
Comment 27 Michael 2019-11-27 10:15:29 UTC
Created attachment 209474 [details]
screenshot of fatal trap 12
Comment 28 Michael 2019-11-27 10:16:49 UTC
Created attachment 209475 [details]
verbose output of boot process, and, fatal trap 12
Comment 29 Michael 2019-11-27 10:27:04 UTC
(In reply to Konstantin Belousov from comment #24)
And this patch did not help. Fatal trap - one to one, as without this patch.
Comment 30 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-27 10:34:11 UTC
Can you show the patch you tried as a diff in your source tree?

It is very strange neither of the patches work.

--HPS
Comment 31 Michael 2019-11-27 11:40:27 UTC
Only the patch from comment #24 was applied and the GENERIC kernel was compiled. No other changes were made to the source code.
The kernel compiles with options:

include         GENERIC-NODEBUG
ident           TEST-MASTER
options         SC_HISTORY_SIZE=8000
nooptions       USB_DEBUG
options         MSGMNB=8192
options         MSGMNI=40
options         MSGSEG=512
options         MSGSSZ=32
options         MSGTQL=2048
options         ROUTETABLES=122
options         IPSEC
options         TCP_SIGNATURE
device          enc
options         IPFIREWALL
options         IPFIREWALL_VERBOSE
options         IPFIREWALL_VERBOSE_LIMIT=5000
options         IPFIREWALL_NAT
options         IPDIVERT
options         DUMMYNET
device          pf
device          pflog
device          pfsync
device          cpuctl
options         LIBALIAS
options         COMPAT_LINUXKPI
options         NETGRAPH
options         NETGRAPH_ASYNC
options         NETGRAPH_BPF
options         NETGRAPH_BRIDGE
options         NETGRAPH_CAR
options         NETGRAPH_CISCO
options         NETGRAPH_DEFLATE
options         NETGRAPH_ECHO
options         NETGRAPH_EIFACE
options         NETGRAPH_ETHER
options         NETGRAPH_IFACE
options         NETGRAPH_IPFW
options         NETGRAPH_FRAME_RELAY
options         NETGRAPH_HOLE
options         NETGRAPH_KSOCKET
options         NETGRAPH_L2TP
options         NETGRAPH_LMI
options         NETGRAPH_MPPC_ENCRYPTION
options         NETGRAPH_NAT
options         NETGRAPH_NETFLOW
options         NETGRAPH_ONE2MANY
options         NETGRAPH_PIPE
options         NETGRAPH_PPP
options         NETGRAPH_PPPOE
options         NETGRAPH_PPTPGRE
options         NETGRAPH_RFC1490
options         NETGRAPH_PRED1
options         NETGRAPH_SOCKET
options         NETGRAPH_SPLIT
options         NETGRAPH_TEE
options         NETGRAPH_TCPMSS
options         NETGRAPH_TTY
options         NETGRAPH_VJC
options         NETGRAPH_VLAN
options         NETGRAPH_UI
Comment 32 Michael 2019-11-27 11:56:49 UTC
With a clean system (installed from scratch) and only with the GENERIC kernel
git clone git://github.com/freebsd/freebsd.git --progress -v --single-branch -b master /usr/src
make cleanworld && make cleandir && make -j12 buildworld && make -j12 buildkernel KERNCONF=GENERIC
make installkernel KERNCONF=GENERIC
make installworld
mergemaster -Ui
mlx4en_load="YES" -> /boot/loadr.conf
reboot

... fatal trap
Comment 33 Michael 2019-11-27 12:12:13 UTC
The same fatal trap was noticed when doing checkpoint the virtual machine. Not always - once out of fifty approximately.
Checkpoint like this https://www.nakivo.com/blog/need-know-hyper-v-checkpoints/
Comment 34 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-27 13:21:32 UTC
Hi,

Can you try this patch and get those prints off the DMESG?

Maybe some memory is not zero-initialized ...

--HPS

diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c
index 6dc21886066..89b575b0ab7 100644
--- a/sys/kern/kern_linker.c
+++ b/sys/kern/kern_linker.c
@@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid)
 	if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0)
 		return (error);
 
+	printf("TD_PROC=%p\n", td->td_proc);
+	printf("P_FD=%p\n", td->td_proc->p_fd);
+
 	/*
 	 * It is possible that kldloaded module will attach a new ifnet,
 	 * so vnet context must be set when this ocurs.
Comment 35 Michael 2019-11-27 14:05:01 UTC
Created attachment 209482 [details]
verbose output of boot process with printf TD_PROC & P_FD
Comment 36 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-27 14:22:39 UTC
(In reply to Michael from comment #35)
Sigh, try this

diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c
index 6dc21886066..ed6a8f793ea 100644
--- a/sys/kern/kern_linker.c
+++ b/sys/kern/kern_linker.c
@@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid)
 	if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0)
 		return (error);
 
+	if (td->td_proc->p_fd->fd_rdir == NULL)
+		return (EINVAL);
+
 	/*
 	 * It is possible that kldloaded module will attach a new ifnet,
 	 * so vnet context must be set when this ocurs.
Comment 37 Michael 2019-11-27 15:02:10 UTC
(In reply to Konstantin Belousov from comment #36)
Yes! Its work!
Comment 38 Michael 2019-11-27 15:06:15 UTC
Firs, I turn on SR-IOV on port and made screenshot
Comment 39 Michael 2019-11-27 15:07:24 UTC
Created attachment 209483 [details]
Turn ON SR-IOV on working VM
Comment 40 Michael 2019-11-27 15:09:16 UTC
Then, I recompiled the kernel with such changes
Comment 41 Michael 2019-11-27 15:10:11 UTC
Created attachment 209485 [details]
make + patch
Comment 42 Michael 2019-11-27 15:27:07 UTC
Created attachment 209486 [details]
verbose output with working patch
Comment 43 Konstantin Belousov freebsd_committer freebsd_triage 2019-11-27 15:28:01 UTC
https://reviews.freebsd.org/D22571 should be the committable fix.  Please retest with it.
Comment 44 Michael 2019-11-27 15:28:32 UTC
Created attachment 209487 [details]
verbose output when 'chekpoint' have been made
Comment 45 Michael 2019-11-27 18:37:17 UTC
(In reply to Konstantin Belousov from comment #43)
Yes. https://reviews.freebsd.org/D22571 also work
Made changes in file sys/kern/kern_linker.c and sys/kern/subr_firmware.c
OS boot and work ok!
Comment 46 Michael 2019-11-27 20:44:19 UTC
https://reviews.freebsd.org/D22571 and GENERIC kernel with options
device          xz
device          mlxfw
device          firmware
device          mlx4
device          mlx4en
device          mlx5
device          mlx5en
also work ok!
Verbose boot log attached messages.3.txt
Verbose dmesg log by made 'checkpoint' VM attached messages.4.txt
Comment 47 Michael 2019-11-27 20:46:07 UTC
Created attachment 209493 [details]
verbose boot log with 'patch' and precompiled mlx4
Comment 48 Michael 2019-11-27 20:47:45 UTC
Created attachment 209494 [details]
verbose log 'chekpoint' maded
Comment 49 Michael 2019-11-28 05:35:54 UTC
If someone could help fix the following bugs
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238095
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236042
we get a wonderful super-router for cloud infrastructure
Comment 50 commit-hook freebsd_committer freebsd_triage 2019-11-28 08:47:53 UTC
A commit references this bug:

Author: hselasky
Date: Thu Nov 28 08:47:36 UTC 2019
New revision: 355170
URL: https://svnweb.freebsd.org/changeset/base/355170

Log:
  Factor out check for mounted root file system.

  Differential Revision:	https://reviews.freebsd.org/D22571
  PR:		241639
  MFC after:	1 week
  Sponsored by:	Mellanox Technologies

Changes:
  head/sys/kern/kern_linker.c
  head/sys/kern/subr_firmware.c
Comment 51 Hans Petter Selasky freebsd_committer freebsd_triage 2019-11-28 08:54:31 UTC
Let me know if this is still an issue.

Thank you!
Comment 52 commit-hook freebsd_committer freebsd_triage 2019-12-05 14:51:15 UTC
A commit references this bug:

Author: hselasky
Date: Thu Dec  5 14:50:46 UTC 2019
New revision: 355417
URL: https://svnweb.freebsd.org/changeset/base/355417

Log:
  MFC r355108 and r355170:
  Fix panic when loading kernel modules before root file system is mounted.
  Make sure the rootvnode is always NULL checked.

  Differential Revision:	https://reviews.freebsd.org/D22545
  PR:		241639
  Sponsored by:	Mellanox Technologies

Changes:
_U  stable/12/
  stable/12/sys/kern/kern_linker.c
  stable/12/sys/kern/subr_firmware.c
Comment 53 commit-hook freebsd_committer freebsd_triage 2019-12-05 14:52:17 UTC
A commit references this bug:

Author: hselasky
Date: Thu Dec  5 14:52:07 UTC 2019
New revision: 355418
URL: https://svnweb.freebsd.org/changeset/base/355418

Log:
  MFC r355108 and r355170:
  Fix panic when loading kernel modules before root file system is mounted.
  Make sure the rootvnode is always NULL checked.

  Differential Revision:	https://reviews.freebsd.org/D22545
  PR:		241639
  Sponsored by:	Mellanox Technologies

Changes:
_U  stable/11/
  stable/11/sys/kern/kern_linker.c
  stable/11/sys/kern/subr_firmware.c
Comment 54 commit-hook freebsd_committer freebsd_triage 2019-12-05 14:54:19 UTC
A commit references this bug:

Author: hselasky
Date: Thu Dec  5 14:53:47 UTC 2019
New revision: 355419
URL: https://svnweb.freebsd.org/changeset/base/355419

Log:
  MFC r355108 and r355170:
  Fix panic when loading kernel modules before root file system is mounted.
  Make sure the rootvnode is always NULL checked.

  Differential Revision:	https://reviews.freebsd.org/D22545
  PR:		241639
  Sponsored by:	Mellanox Technologies

Changes:
_U  stable/10/
  stable/10/sys/kern/kern_linker.c
  stable/10/sys/kern/subr_firmware.c