Created attachment 208757 [details] Attachment 1 [details] (Network adapter) Fatal trap 12: page fault ... current process = 0 (vmbusdev) This applies to these OS versions: FreeBSD-11.2 RELEASE amd64 FreeBSD-11.3 RELEASE-amd64 FreeBSD-11.3 STABLE amd64 20191025 r354051 FreeBSD-12.0 RELEASE amd64 FreeBSD-12.1 RC1 amd64 FreeBSD-12.1 RC2 amd64 FreeBSD-12.1 STABLE amd64 20191025 r354051 FreeBSD-13.0 CURRENT amd64 20191025 r354057 For all these OSs, this bug is the same. Hypervisor: Windows 2019 server Hyper-V (all update and drivers are the latest). VM: Hyper-V generation V2. Network adapter: Mellanox ConnectiX-3 EN 10G SR-IOV enabled, (Attachment 1 [details]). All VM-OSs were installed under the same conditions, (Attachment 2 [details]): mem: 6GB, HDD: 127GB; FS: Auto (ZFS) Guided Root-on-ZFS, (Attachment 3 [details]); ZFS: GPT (BIOS+UEFI), (Attachment 4 [details]); LAN: one HV-V2 network adapter, (Attachment 2 [details]); When we enable SR-IOV on the port, (Attachment 5 [details]), and load the Mellanox driver in the OS root@frw03v2:~ # kldload mlx4en everything works without problems as it should. Status of network connection is OK (Support SR-IOV active), (Attachment 6 [details]). Add to /boot/loader.conf to automatically load the ConnectiX-3 driver mlx4en_load="YES" and reboot VM-OS. Result - Fatal trap on loading stage (Attachment 7 [details]-11). As a temporary solution to this problem, the pfSense community suggested this: in /boot/loader.conf add kern.cam.boot_delay="10000" Indeed, all versions of the FreeBSD OS will now boot OK (with parameter kern.cam.boot_delay="10000" in the configuration file /boot/loader.conf), but where is the guarantee that this focus makes the system stable under load? And yet, this focus stops working when parameter kern.cam.boot_delay is less than 1000. For example, if kern.cam.boot_delay="500", then when booting the system - the same Fatal trap.
Created attachment 208758 [details] Attachment 2 [details] (VM setup)
Created attachment 208759 [details] Attachment 3 [details] (FS)
Created attachment 208760 [details] Attachment 4 [details] (ZFS)
Created attachment 208761 [details] Attachment 5 [details] (enable SR-IOV on port)
Created attachment 208762 [details] Attachment 6 [details] (work condition)
Created attachment 208763 [details] Attachment 7 [details]
Created attachment 208764 [details] Attachment 8 [details] (boot-2)
Created attachment 208765 [details] Attachment 9 [details] (boot-3)
Created attachment 208766 [details] Attachment 10 [details] (boot-4)
Created attachment 208767 [details] Attachment 11 [details] (Fatal trap 12)
After applying the 4a46b2449c63e010014dc0fb2a3caa5e20b97933 commit, the kern.cam.boot_delay="10000" parameter in /boot/loader.conf stopped working. Catastrophe! Now I have to load mlx4en.ko in firewall rules! (in /etc/rc.firewall adding at end "kldload mlx4en") Please correct the situation. committer mav <mav@FreeBSD.org> Fri, 22 Nov 2019 20:39:51 +0200 (18:39 +0000) commit 4a46b2449c63e010014dc0fb2a3caa5e20b97933 Make CAM use root_mount_hold_token() to delay boot. Before this change CAM used config_intrhook_establish() for this purpose, but that approach does not allow to delay it again after releasing once. USB stack uses root_mount_hold() to delay boot until bus scan is complete. But once it is, CAM had no time to scan SCSI bus, registered by umass(4), if it already done other scans and called config_intrhook_disestablish(). The new approach makes it work smooth, assuming the USB device is found during the initial bus scan. Devices appearing on USB bus later may still require setting kern.cam.boot_delay, but hopefully those are minority. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
As I have answered to private email from originator, mine mentioned commit changed imeplementation of kern.cam.boot_delay withing its original semantics. The fact that it is no longer possible to use it as workaround is pitiful, but it does not mean it is wrong. It is original problem that needs to be diagnosed, not a workaround. Hans, any idea what is going wrong with mlx4 driver here?
Created attachment 209403 [details] VFS PATCH (In reply to Alexander Motin from comment #12) Can you try this patch? It is not a bug in mlx4en :-)
kib: Can you quickly look at my VFS patch?
(In reply to Hans Petter Selasky from comment #14) vfs_lookup() is not appropriate place to do this. I think that kern_kldload() is much better place to put the check.
(In reply to Konstantin Belousov from comment #15) And there, you would check rootvnode != NULL.
Created attachment 209407 [details] Kernel Linker patch Can you try this patch?
(In reply to Hans Petter Selasky from comment #17) Note that in the 'else' branch around your patch, there is already check for rootvnode.
Yes, but that else check is skipped if the check before is true. --HPS
(In reply to Hans Petter Selasky from comment #19) Put the patch into phab.
https://reviews.freebsd.org/D22545
A commit references this bug: Author: hselasky Date: Tue Nov 26 12:20:44 UTC 2019 New revision: 355108 URL: https://svnweb.freebsd.org/changeset/base/355108 Log: Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 MFC after: 1 week Sponsored by: Mellanox Technologies Changes: head/sys/kern/kern_linker.c
Unfortunately this patch did not help. Fatal trap 12: page fault while in kernel mode.
(In reply to Michael from comment #23) Try this. diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c index 6dc21886066..89b575b0ab7 100644 --- a/sys/kern/kern_linker.c +++ b/sys/kern/kern_linker.c @@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid) if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0) return (error); + if (td->td_proc->p_fd == NULL) + return (EINVAL); + /* * It is possible that kldloaded module will attach a new ifnet, * so vnet context must be set when this ocurs.
Please double check patches are applied and not rejected. --HPS
I clarify the nature of the appearance of the fatal trap. When the SR-IOV is off, it boots normally. If you boot the VM with the SR-IOV is on the port "turned on" - a fatal trap. I attach a screenshot and a verbose message log in a text file.
Created attachment 209474 [details] screenshot of fatal trap 12
Created attachment 209475 [details] verbose output of boot process, and, fatal trap 12
(In reply to Konstantin Belousov from comment #24) And this patch did not help. Fatal trap - one to one, as without this patch.
Can you show the patch you tried as a diff in your source tree? It is very strange neither of the patches work. --HPS
Only the patch from comment #24 was applied and the GENERIC kernel was compiled. No other changes were made to the source code. The kernel compiles with options: include GENERIC-NODEBUG ident TEST-MASTER options SC_HISTORY_SIZE=8000 nooptions USB_DEBUG options MSGMNB=8192 options MSGMNI=40 options MSGSEG=512 options MSGSSZ=32 options MSGTQL=2048 options ROUTETABLES=122 options IPSEC options TCP_SIGNATURE device enc options IPFIREWALL options IPFIREWALL_VERBOSE options IPFIREWALL_VERBOSE_LIMIT=5000 options IPFIREWALL_NAT options IPDIVERT options DUMMYNET device pf device pflog device pfsync device cpuctl options LIBALIAS options COMPAT_LINUXKPI options NETGRAPH options NETGRAPH_ASYNC options NETGRAPH_BPF options NETGRAPH_BRIDGE options NETGRAPH_CAR options NETGRAPH_CISCO options NETGRAPH_DEFLATE options NETGRAPH_ECHO options NETGRAPH_EIFACE options NETGRAPH_ETHER options NETGRAPH_IFACE options NETGRAPH_IPFW options NETGRAPH_FRAME_RELAY options NETGRAPH_HOLE options NETGRAPH_KSOCKET options NETGRAPH_L2TP options NETGRAPH_LMI options NETGRAPH_MPPC_ENCRYPTION options NETGRAPH_NAT options NETGRAPH_NETFLOW options NETGRAPH_ONE2MANY options NETGRAPH_PIPE options NETGRAPH_PPP options NETGRAPH_PPPOE options NETGRAPH_PPTPGRE options NETGRAPH_RFC1490 options NETGRAPH_PRED1 options NETGRAPH_SOCKET options NETGRAPH_SPLIT options NETGRAPH_TEE options NETGRAPH_TCPMSS options NETGRAPH_TTY options NETGRAPH_VJC options NETGRAPH_VLAN options NETGRAPH_UI
With a clean system (installed from scratch) and only with the GENERIC kernel git clone git://github.com/freebsd/freebsd.git --progress -v --single-branch -b master /usr/src make cleanworld && make cleandir && make -j12 buildworld && make -j12 buildkernel KERNCONF=GENERIC make installkernel KERNCONF=GENERIC make installworld mergemaster -Ui mlx4en_load="YES" -> /boot/loadr.conf reboot ... fatal trap
The same fatal trap was noticed when doing checkpoint the virtual machine. Not always - once out of fifty approximately. Checkpoint like this https://www.nakivo.com/blog/need-know-hyper-v-checkpoints/
Hi, Can you try this patch and get those prints off the DMESG? Maybe some memory is not zero-initialized ... --HPS diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c index 6dc21886066..89b575b0ab7 100644 --- a/sys/kern/kern_linker.c +++ b/sys/kern/kern_linker.c @@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid) if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0) return (error); + printf("TD_PROC=%p\n", td->td_proc); + printf("P_FD=%p\n", td->td_proc->p_fd); + /* * It is possible that kldloaded module will attach a new ifnet, * so vnet context must be set when this ocurs.
Created attachment 209482 [details] verbose output of boot process with printf TD_PROC & P_FD
(In reply to Michael from comment #35) Sigh, try this diff --git a/sys/kern/kern_linker.c b/sys/kern/kern_linker.c index 6dc21886066..ed6a8f793ea 100644 --- a/sys/kern/kern_linker.c +++ b/sys/kern/kern_linker.c @@ -1066,6 +1066,9 @@ kern_kldload(struct thread *td, const char *file, int *fileid) if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0) return (error); + if (td->td_proc->p_fd->fd_rdir == NULL) + return (EINVAL); + /* * It is possible that kldloaded module will attach a new ifnet, * so vnet context must be set when this ocurs.
(In reply to Konstantin Belousov from comment #36) Yes! Its work!
Firs, I turn on SR-IOV on port and made screenshot
Created attachment 209483 [details] Turn ON SR-IOV on working VM
Then, I recompiled the kernel with such changes
Created attachment 209485 [details] make + patch
Created attachment 209486 [details] verbose output with working patch
https://reviews.freebsd.org/D22571 should be the committable fix. Please retest with it.
Created attachment 209487 [details] verbose output when 'chekpoint' have been made
(In reply to Konstantin Belousov from comment #43) Yes. https://reviews.freebsd.org/D22571 also work Made changes in file sys/kern/kern_linker.c and sys/kern/subr_firmware.c OS boot and work ok!
https://reviews.freebsd.org/D22571 and GENERIC kernel with options device xz device mlxfw device firmware device mlx4 device mlx4en device mlx5 device mlx5en also work ok! Verbose boot log attached messages.3.txt Verbose dmesg log by made 'checkpoint' VM attached messages.4.txt
Created attachment 209493 [details] verbose boot log with 'patch' and precompiled mlx4
Created attachment 209494 [details] verbose log 'chekpoint' maded
If someone could help fix the following bugs https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238095 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236042 we get a wonderful super-router for cloud infrastructure
A commit references this bug: Author: hselasky Date: Thu Nov 28 08:47:36 UTC 2019 New revision: 355170 URL: https://svnweb.freebsd.org/changeset/base/355170 Log: Factor out check for mounted root file system. Differential Revision: https://reviews.freebsd.org/D22571 PR: 241639 MFC after: 1 week Sponsored by: Mellanox Technologies Changes: head/sys/kern/kern_linker.c head/sys/kern/subr_firmware.c
Let me know if this is still an issue. Thank you!
A commit references this bug: Author: hselasky Date: Thu Dec 5 14:50:46 UTC 2019 New revision: 355417 URL: https://svnweb.freebsd.org/changeset/base/355417 Log: MFC r355108 and r355170: Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 Sponsored by: Mellanox Technologies Changes: _U stable/12/ stable/12/sys/kern/kern_linker.c stable/12/sys/kern/subr_firmware.c
A commit references this bug: Author: hselasky Date: Thu Dec 5 14:52:07 UTC 2019 New revision: 355418 URL: https://svnweb.freebsd.org/changeset/base/355418 Log: MFC r355108 and r355170: Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 Sponsored by: Mellanox Technologies Changes: _U stable/11/ stable/11/sys/kern/kern_linker.c stable/11/sys/kern/subr_firmware.c
A commit references this bug: Author: hselasky Date: Thu Dec 5 14:53:47 UTC 2019 New revision: 355419 URL: https://svnweb.freebsd.org/changeset/base/355419 Log: MFC r355108 and r355170: Fix panic when loading kernel modules before root file system is mounted. Make sure the rootvnode is always NULL checked. Differential Revision: https://reviews.freebsd.org/D22545 PR: 241639 Sponsored by: Mellanox Technologies Changes: _U stable/10/ stable/10/sys/kern/kern_linker.c stable/10/sys/kern/subr_firmware.c