Bug 229167

Summary:	[Hyper-V] [Jun 19, 2018] Recently FreeBSD VM panics during boot-up, especially with Mellanox VF configured
Product:	Base System	Reporter:	Dexuan Cui <decui>
Component:	kern	Assignee:	freebsd-virtualization (Nobody) <virtualization>
Status:	New ---
Severity:	Affects Some People	CC:	decui, mammoottym, pete, scorpionmage, weh, whu
Priority:	---	Keywords:	regression
Version:	11.2-STABLE
Hardware:	amd64
OS:	Any

Description Dexuan Cui 2018-06-19 21:49:59 UTC

People are seeing some different call-traces with FreeBSD 11.2 and/or the latest CURRENT code recently and some call-traces are mentioned here : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220453

#1: panic in namei():
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220453#c5

#2: panic in vmbus_msghc_exec():
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220453#c7
It's reported  FreeBSD-12.0-CURRENT-amd64-20180618-r335317-disc1.iso also reproduces this call-panic.

#3: No detailed call-trace. It looks the panic happens just after root is mounted.:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220453#c14

But it looks I can't reproduce the issue any more, after I fresh-build a kernel with the same version. Not sure if this is some kind of weird memory corruption issue.

Comment 1 Dexuan Cui 2018-06-19 21:59:32 UTC

Thanks pete for reporting the issue! No need for you to test UFS, as I think we can reproduce the issue with UFS, and our internal testing team is trying to narrow down the issue by bisecting.

Especially, symptom #2 happens to the Hyper-V vmbus driver, which hasn't been updated for 8 months, except for the PTI patch.

I suspect something unusual, like unexpected memory corruption, is happening here.

Comment 2 Pete French 2018-06-19 22:27:27 UTC

This is definitely a better place to discus this, sorry for hijacking the other thread!

Just so you know, I rebuilt with the latest STABLE, and have just rebooted using
that, loading the module at boot time, and it seems to be panicking again. Unfortunately my serial console has also stopped working on the test machine, so I can't verify that its the same panic. Will fiddle about and see if I can get that back somehow and let you know.

Comment 3 Dexuan Cui 2018-06-21 03:12:34 UTC

(In reply to pete from comment #2)
Hi pete, can you please help to test the FreeBSD 11.2-RC3 iso (https://download.freebsd.org/ftp/releases/amd64/amd64/ISO-IMAGES/11.2/FreeBSD-11.2-RC3-amd64-dvd1.iso ?

Our test team didn't find the issue with this .iso.

The latest HEAD is indeed broken, and the first bad commit is  https://github.com/freebsd/freebsd/commit/8222f5cb7cdf1ad98c5f281437e35ec2d5da0b96 (or SVN r335068) according to our test team, but I haven't verified it myself.

It looks Olivier Houchard fixed it with r335182:
https://www.mail-archive.com/svn-src-all@freebsd.org/msg164885.html

Comment 4 Pete French 2018-06-21 22:06:50 UTC

Am not sure how I would set about booting from an ISO in Azure, as have always done source upgrades, but I can build the latest 11.2 tree, form which that ISO was made, with no options and try that. Will have a go tomorrow - the test machine I upgraded to -STABLE is locking up in Azure a lot, and current inaccessible unfortunately.

Comment 5 Dexuan Cui 2018-06-21 22:48:24 UTC

(In reply to pete from comment #4)
Building the latest 11.2 tree should be equivalent to the 11.2 iso.

11.2 will be released very soon. If there is a major issue with it, we must try to fix it ASAP, and request an errata kernel update. We're short of hands to test FreeBSD and fix bugs recently...

According to our test team, r335182 can't fix the panic issue. :-(

Comment 6 Pete French 2018-06-22 10:58:42 UTC

I am building from 11.2 now. Looking at the commits you referenced, the big change was for kernel malloc to return unelectable memory unless requested, yes ? If that was causing the panic though, then would it not also affect the model when loaded after booting. Which seems to work fine.

By the way, when I am using this, I should only have config in rc.conf for the mlxen0 device, and not simultaneously for hn0 if I understand this right ?

Will let you know the test results in an hour or so hopefully.

Comment 7 Pete French 2018-06-22 12:54:50 UTC

  ______               ____   _____ _____
 |  ____|             |  _ \ / ____|  __ \
 | |___ _ __ ___  ___ | |_) | (___ | |  | |
 |  ___| '__/ _ \/ _ \|  _ < \___ \| |  | |
 | |   | | |  __/  __/| |_) |____) | |__| |
 | |   | | |    |    ||     |      |      |
 |_|   |_|  \___|\___||____/|_____/|_____/    ```                        `
                                             s` `.....---.......--.```   -/
 +============Welcome to FreeBSD===========+ +o   .--`         /y:`      +.
 |                                         |  yo`:.            :o      `+-
 |  1. Boot Multi User [Enter]             |   y/               -/`   -o/
 |  2. Boot [S]ingle User                  |  .-                  ::/sy+:.
 |  3. [Esc]ape to loader prompt           |  /                     `--  /
 |  4. Reboot                              | `:                          :`
 |                                         | `:                          :`
 |  Options:                               |  /                          /
 |  5. [K]ernel: kernel (1 of 2)           |  .-                        -.
 |  6. Configure Boot [O]ptions...         |   --                      -.
 |  7. Select Boot [E]nvironment...        |    `:`                  `:`
 |                                         |      .--             `--.
 |                                         |         .---.....----.
 +=========================================+


/boot/kernel/kernel text=0x153d048 data=0x143f70+0x4bc418 syms=[0x8+0x16ad00
+0x8+0x183cac]
/boot/entropy size=0x1000
/boot/kernel/mlx4.ko size 0x6a6e0 at 0x222d000
loading required module 'linuxkpi'
/boot/kernel/linuxkpi.ko size 0x26550 at 0x2298000
/boot/kernel/aesni.ko size 0xaec8 at 0x22bf000
/boot/kernel/cryptodev.ko size 0x7740 at 0x22ca000
/boot/kernel/zfs.ko size 0x37ff00 at 0x22d2000
loading required module 'opensolaris'
/boot/kernel/opensolaris.ko size 0xa380 at 0x2652000
Booting...
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.2-RELEASE #0 r335553: Fri Jun 22 11:10:58 UTC 2018
    webadmin@test-machine:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 6.0.0 (tags/RELEASE_600/final 326565) (based on LLVM 6.0.0)
SRAT: Ignoring memory at addr 0x4e0200000
SRAT: Ignoring memory at addr 0x1000000000
SRAT: Ignoring memory at addr 0x10000200000
SRAT: Ignoring memory at addr 0x20000200000
VT(vga): text 80x25
Hyper-V Version: 10.0.14393 [SP0]
  Features=0x2e7f<VPRUNTIME,TMREFCNT,SYNIC,SYNTM,APIC,HYPERCALL,VPINDEX,REFTSC,IDLE,TMFREQ>
  PM Features=0x0 [C2]
  Features3=0xed7b2<DEBUG,XMMHC,IDLE,NUMA,TMFREQ,SYNCMC,CRASH,NPIEP>
Timecounter "Hyper-V" frequency 10000000 Hz quality 2000
CPU: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz (2394.46-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306f2  Family=0x6  Model=0x3f  Stepping=2
  Features=0x1f83fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,SS,HTT>
  Features2=0xfed83223<SSE3,PCLMULQDQ,VMX,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x21<LAHF,ABM>
  Structured Extended Features=0x23a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,NFPUSG>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,PAUSE,EPT,VPID
Hypervisor: Origin = "Microsoft Hv"
real memory  = 17179869184 (16384 MB)
avail memory = 16513617920 (15748 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <VRTUAL MICROSFT>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 hardware threads
ioapic0: Changing APIC ID to 0
ioapic0 <Version 1.1> irqs 0-23 on motherboard
SMP: AP CPU #3 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #2 Launched!
random: entropy device external interface
Timecounter "Hyper-V-TSC" frequency 10000000 Hz quality 3000
kbd1 at kbdmux0
netmap: loaded module
module_register_init: MOD_LOAD (vesa, 0xffffffff80feab30, 0) error 19
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
nexus0
vtvga0: <VT VGA driver> on motherboard
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard
acpi0: <VRTUAL MICROSFT> on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
atrtc0: registered as a time-of-day clock, resolution 1.000000s
Event timer "RTC" frequency 32768 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <32-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pcib0: Length mismatch for 3 range: 20000000 vs 1fffffff
vmbus0: <Hyper-V Vmbus> on pcib0
pci0: <ACPI PCI bus> on pcib0
isab0: <PCI-ISA bridge> at device 7.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel PIIX4 UDMA33 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 7.1 on pci0
ata0: <ATA channel> at channel 0 on atapci0
ata1: <ATA channel> at channel 1 on atapci0
pci0: <bridge> at device 7.3 (no driver attached)
vgapci0: <VGA-compatible display> mem 0xf8000000-0xfbffffff irq 11 at device 8.0 on pci0
vgapci0: Boot video device
atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
psm0: model IntelliMouse Explorer, device ID 4
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: console (115200,n,8,1)
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
fdc0: <floppy drive controller (FDE)> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 onacpi0
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
vmbus_res0: <Hyper-V Vmbus Resource> irq 5,7 on acpi0
orm0: <ISA Option ROM> at iomem 0xc0000-0xcbfff on isa0
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
ppc0: cannot reserve I/O port range
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 10.000 msec
usb_needs_explore_all: no devclass
vmbus0: version 3.0
hvet0: <Hyper-V event timer> on vmbus0
Event timer "Hyper-V" frequency 10000000 Hz quality 1000
storvsc0: <Hyper-V IDE> on vmbus0
storvsc1: <Hyper-V IDE> on vmbus0
hvkbd0: <Hyper-V KBD> on vmbus0
hvheartbeat0: <Hyper-V Heartbeat> on vmbus0
hvkvp0: <Hyper-V KVP> on vmbus0
hvshutdown0: <Hyper-V Shutdown> on vmbus0
hvtimesync0: <Hyper-V Timesync> on vmbus0
hvvss0: <Hyper-V VSS> on vmbus0
storvsc2: <Hyper-V SCSI> on vmbus0
da0 at blkvsc0 bus 0 scbus2 target 0 lun 0
da0: <Msft Virtual Disk 1.0> Fixed Direct Access SPC-3 SCSI device
da1 at blkvsc1 bus 0 scbus3 target 1 lun 0
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da1: <Msft Virtual Disk 1.0> Fixed Direct Access SPC-3 SCSI device
da0: 30720MB (62914560 512 byte sectors)
storvsc3: da1: 300.000MB/s transfers
da1: Command Queueing enabled
da1: 32768MB (67108864 512 byte sectors)
<Hyper-V SCSI> on vmbus0
hn0: <Hyper-V Network Interface> on vmbus0
hn0: got notify, nvs type 128
hn0: Ethernet address: 00:0d:3a:20:9c:a0
hn0: link state changed to UP
pcib1: <Hyper-V PCI Express Pass Through> on vmbus0
Trying to mount root from zfs:test-zroot/ROOT/default []...
pci1: <PCI bus> on pcib1
mlx4_core0: <mlx4_core> at device 2.0 on pci1
<6>mlx4_core: Mellanox ConnectX core driver v3.4.1 (October 2017)
mlx4_core: Initializing mlx4_core
mlx4_core0: Detected virtual function - running in slave mode
mlx4_core0: Sending reset
mlx4_core0: Sending vhcr0
mlx4_core0: HCA minimum page size:512
mlx4_core0: Timestamping is not supported in slave mode


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x1d4
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80bb3474
stack pointer           = 0x28:0xfffffe03e27be9d0
frame pointer           = 0x28:0xfffffe03e27be9d0
code segment            = base rx0, limit 0xfffff, type 0x1b

Comment 8 Pete French 2018-06-22 12:58:15 UTC

One thing I did notice - if I boot the machine and load the module by hand, it doesn't crash, but it also no longer detects the Mellanox card either, unlike what I wa seeing in STABLE.

Am sorry this didn't work - I hope the above trace is useful, as you can see I did it on a clean build of 11.2 RELEASE. I won't have time to look at anything more until after the weekend I suspect (family & kids).

What I can try next week is to create a machine form one of your images, and then upgrade it to 11.2 from sources and see if that works, to eliminate any possibilities of this being affected by my ZFS environment and the fact I load those modules at boot too. Thanks for your attention to this so far - I really would like to get it working, but I also need ZFS.

Comment 9 Pete French 2018-06-22 19:11:52 UTC

Addendum - I created a complete new machine using the UFS image "MicrosoftOSTC:FreeBSD:11.1:latest" and did a source upgrade to 11.2-RELEASE using the version I compiled earlier.

This also panics if you load the module on boot and accelerated networking is enabled.

I do have the tuning parameter "CPUTYPE?=core2" in src.conf - is there the remotest possibility that this is a compiler bug which only manifests when compiling for that architecture, and if I did vanilla build it would work OK ? I don't really have the time to actually do another build now unfortunately, and I suspect I am just catching at straws here.

Comment 10 Dexuan Cui 2018-06-23 00:18:20 UTC

(In reply to pete from comment #6)
> By the way, when I am using this, I should only have config in rc.conf
>  for the mlxen0 device, and not simultaneously for hn0 if I understand
> this right ?

I think you're talking about the network configuration in /etc/rc.conf.

What we should use is the opposite. :-) In rc.conf, we should only use hn0 rather than mlxen0:

ifconfig_hn0="DHCP"

As I mentioned, thanks to sephe's work, mlxen0 can automatically work in "bond mode" with hn0, and we don't need use lagg.

Please let us know if this doesn't work. At least it's supposed to work this way, but obviously we didn't have enough resources to fully test it.

Comment 11 Dexuan Cui 2018-06-23 00:25:31 UTC

That first bad commit was reported by our test engineer. It turns out the bisection in unreliable, but I do suspect something outside of the Hyper-V drivers cause the panic, probably in the mlx driver, or the generic memory management code.

@pete, thanks very much for your help! It turns out both 11.2 and the HEAD are broken, but in my environment somehow I can't always reproduce it. I'll try to look into this next week.

Comment 12 Pete French 2018-06-23 08:00:44 UTC

(In reply to Dexuan Cui from comment #10)

Heh, yes, I suddenly realised how it was designed to work last night after building the UFS machine and seeing how that was configured. Obvious when I thought about it - nice piece of design by the way.

Is there anything else I can do to help debug this ? Obviously it's not going to be fixed for 11.2, but its a bit worrying if its something in the generic memory management code! I could give you access to one of my images which shows the problem...

Comment 13 Dexuan Cui 2018-06-23 15:12:12 UTC

(In reply to pete from comment #12)
Our test team can reproduce the issue. I'm going to dig into it.

Comment 14 commit-hook freebsd_committer

2018-07-07 00:41:17 UTC

A commit references this bug:

Author: dexuan
Date: Sat Jul  7 00:41:05 UTC 2018
New revision: 336054
URL: https://svnweb.freebsd.org/changeset/base/336054

Log:
  hyperv: Fix boot-up after malloc() returns memory of NX by default now

  FreeBSD VM can't boot up on Hyper-V after the recent malloc change in
  r335068: Make UMA and malloc(9) return non-executable memory in most cases.

  The hypercall page here must be executable.
  Fix the boot-up issue by adding M_EXEC.

  PR:		229167
  Sponsored by:	Microsoft

Changes:
  head/sys/dev/hyperv/vmbus/hyperv.c

Comment 15 Dexuan Cui 2018-07-11 00:47:17 UTC

(In reply to pete from comment #12)
I installed a 11.2 VM on my local Hyper-V host with http://ftp.freebsd.org/pub/FreeBSD/releases/ISO-IMAGES/11.2/FreeBSD-11.2-RELEASE-amd64-dvd1.iso, and can't reproduce the issue with MLX VF.

I also installed a 12-CURRENT VM on my local Hyper-V host with
http://ftp.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/12.0/FreeBSD-12.0-CURRENT-amd64-20180709-r336134-disc1.iso, and can't repro any panic issue either ("#2: panic in vmbus_msghc_exec()" has been fixed by r336054)

I used mlx4en_load="YES" in "/boot/loader.conf".

Need to figure out how to reproduce the issue...

Comment 16 Dexuan Cui 2018-07-12 06:08:30 UTC

(In reply to Dexuan Cui from comment #15)
I also created a 11.1 VM from the MicrosoftOSTC:FreeBSD:11.1:latest image, and updated it to 11.2 (11.2-RELEASE FreeBSD 11.2-RELEASE #0 r335510: Fri Jun 22 04:32:14 UTC 2018     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64) and can't reproduce the issue: I have mlx4en_load="YES" in /boot/loader.conf and the mlxen0 interface appears in "ifconfig" as expected.

According to the call-trace, the panic we previously saw happens in namei(), which is in VFS code. So my feeling is: the file system may be somehow corrupt or there is a race condition bug in the VFS code?

Comment 17 Pete French 2018-07-12 10:21:43 UTC

(In reply to Dexuan Cui from comment #16)

Thats interesting, as I did the same thing I thought (built from 11.1, upgraded to 11.2) and go the panic.

How did you do your upgrade ? I built mine from source, did you use the binary upgrades ?

I don't have time to try this again today, but might get a chance tomorrow. Will try a binary upgrade to 11.2 if thats what you did, and if that works will then try a recompile from the source.

Comment 18 Dexuan Cui 2018-07-12 16:00:38 UTC

(In reply to pete from comment #17)
I did both source code update and binary update, and can't repro the issue either way.

For binary update, I referred https://www.freebsd.org/doc/handbook/updating-upgrading-freebsdupdate.html:

freebsd-update -r 11.2-RELEASE upgradd
freebsd-update install
reboot
freebsd-update install

Comment 19 Pete French 2018-07-13 21:19:26 UTC

OK, I just did this and (to my surprise) I have a running machine, with accelerated networking, on 11.2. Will try building from source on this version, and then the version I got the panic on, to see if we can narrow it down a bit. Odd that when I upgraded from source it crashed, but hopefully we can narrow this down a bit now!

Comment 20 Pete French 2018-07-14 20:53:29 UTC

(In reply to Dexuan Cui from comment #18)

It's ZFS.

I build a UFS machine the way you did, then binary upgrade to 11.2, then source upgrade to STABLE. All fine. Then I copied the data to another disc, formatted as a pool, so I have two identical drives, differing only in filesystem.

The UFS derive boots fine, the ZFS one panics when it loads the mlx4en driver as we have seen before. The filesystem is not corrupt, so theres some odd interaction somewhere. Kind of worrying!

By the way, I do have real hardware, with real Mellanox cards, which boost fine from ZFS - but they are only ConnectX-2 cards.

Comment 21 Dexuan Cui 2018-07-15 02:45:02 UTC

(In reply to pete from comment #20)
I never used ZFS  -- I even don't know how to format a ZFS partition :-), but previously I did reproduce the panic when mlx VF driver loads: i.e. the panic in namei(), and now I can't reproduce it any more...

BTW, can you reproduce the same panic with the latest 12-CURRENT + ZFS? I'm wondering if the issue also exists in 12-CURRENT.

I'll try to reproduce the issue with ZFS, as you did.

Comment 22 Pete French 2018-07-15 08:58:47 UTC

(In reply to Dexuan Cui from comment #21)

Its interesting that you got the panic on UFS previous - so did I, but I can't reproduce it anymore either. However it is there on ZFS.

ZFS is superb - you have missed out if you haven't used it - but this is not the place for ZFS advocacy :-) A quick guide would be to simply attach a drive and partition as you would for UFS, except using freebsd-zfs for the type of the main partition. Use /boot/gptzfsboot instead of /boot/gptboot in the boot partition though. Then make a pool, and I usually create a separate filesystem for booting, like this:

   zpool create tank da2p2
   zfs create tank/ROOT
   zfs create tank/ROOT/default

Copy the existing system to it using cpdup:

   cpdup -v -i0 / /tank/ROOT/default

Finally set that filesystem as the one to be posted on the pool:

   zpool set bootfs=tank/ROOT/default

And you are good to go... (typed from memory BTW! hopefully all correct...)

Am just doing a rebuild with current to see if that has the same issue. Will build and boot on UFS first, then copy to a ZFS filesystem using the above method for testing. I'll let you know how it goes.

Comment 23 Pete French 2018-07-15 19:16:28 UTC

(In reply to Dexuan Cui from comment #21)

So, I rebuilt with CURRENT, and it booted fine on UFS. Then I rebooted it a couple of times, and the third time it did this:

mlxen0: Ethernet address: 00:0d:3a:2f:f9:61
<4>mlx4_en: mlx4_core0: Port 1: Using 4 TX rings
mlxen0: link state changed to DOWN
<4>mlx4_en: mlx4_core0: Port 1: Using 4 RX rings
<4>mlx4_en: mlxen0: Using 4 TX rings
hn0: link state changed to DOWN
<4>mlx4_en: mlxen0: Using 4 RX rings
<4>mlx4_en: mlxen0: Initializing port


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x1d8
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c6a3ea
stack pointer           = 0x28:0xfffffe00004bb950
frame pointer           = 0x28:0xfffffe00004bb960
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (vmbusdev)
[ thread pid 0 tid 100105 ]
Stopped at      vrefact+0xa:    cmpl    $0x4,ll+0x1b7(%r14)
db>

Hmmmm... thats the same (UFS based) machine being rebooted, no changes, so it look unpredictable :-( Am going to try it again and also try with ZFS if I can, but it does look like it shows up in CURRENT, just not every time.

Comment 24 Pete French 2018-07-15 19:18:38 UTC

Also worth noting that the 3rd reboot was after a size change. Even on UFS on 11.2 I did find that when it booted sometimes the networking didn't work and I had to redeploy a couple of times to get it working. Maybe there is some dependency on the physical host somehow ? Which would be very nasty :-(

Comment 25 Pete French 2018-07-15 19:40:45 UTC

OK, this is interesting. With CURRENT booting from UFS:

Works fine if machine size is D16s_v3

Panics, as above, if machine size is D4s_v3

Comment 26 Pete French 2018-07-16 15:13:13 UTC

More tests - I built a ZFS machine as D16s_v3 - that boots fine. Then I resized both of them back to D4s_v3. The UFS machine now boots (which it didn't at that size before) but the ZFS machine doesn't.

So its looking like either a race condition (so the number of CPU's makes a difference) or some dependency on the underlying hardware (as the resize may move to a different host presumably?)

Either way its depressingly inconsistent :-( Once it happens its reproducible though.

Do you need me to do any more testing ?

Comment 27 Pete French 2018-07-16 15:24:13 UTC

Here is a 'bt' from the last ZFS boot attempt, in case it helps.

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 10.000 msec
usb_needs_explore_all: no devclass
vmbus0: version 3.0
hvet0: <Hyper-V event timer> on vmbus0
Event timer "Hyper-V" frequency 10000000 Hz quality 1000
storvsc0: <Hyper-V IDE> on vmbus0
storvsc1: <Hyper-V IDE> on vmbus0
hvkbd0: <Hyper-V KBD> on vmbus0
hvheartbeat0: <Hyper-V Heartbeat> on vmbus0
hvkvp0: <Hyper-V KVP> on vmbus0
da0 at blkvsc0 bus 0 scbus2 target 0 lun 0
da0: <Msft Virtual Disk 1.0> Fixed Direct Access SPC-3 SCSI device
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 30720MB (62914560 512 byte sectors)
hvshutdown0: da1 at blkvsc1 bus 0 scbus3 target 1 lun 0
da1: <Msft Virtual Disk 1.0> Fixed Direct Access SPC-3 SCSI device
da1: 300.000MB/s transfers
da1: Command Queueing enabled
da1: 32768MB (67108864 512 byte sectors)
<Hyper-V Shutdown> on vmbus0
hvtimesync0: <Hyper-V Timesync> on vmbus0
hvtimesync0: RTT
hvvss0: <Hyper-V VSS> on vmbus0
hn0: <Hyper-V Network Interface> on vmbus0
hn0: got notify, nvs type 128
hn0: Ethernet address: 00:0d:3a:2e:a3:c7
hn0: link state changed to UP
storvsc2: <Hyper-V SCSI> on vmbus0
storvsc3: <Hyper-V SCSI> on vmbus0
pcib1: <Hyper-V PCI Express Pass Through> on vmbus0
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from zfs:zroot/ROOT/default []...
pci1: <PCI bus> on pcib1
mlx4_core0: <mlx4_core> at device 2.0 on pci1
<6>mlx4_core: Mellanox ConnectX core driver v3.4.1 (October 2017)
mlx4_core: Initializing mlx4_core
mlx4_core0: Detected virtual function - running in slave mode
mlx4_core0: Sending reset
mlx4_core0: Sending vhcr0
mlx4_core0: HCA minimum page size:512
mlx4_core0: Timestamping is not supported in slave mode
mlx4_en mlx4_core0: Activating port:1
mlxen0: Ethernet address: 00:0d:3a:2e:a3:c7
<4>mlx4_en: mlx4_core0: Port 1: Using 4 TX rings
mlxen0: link state changed to DOWN
<4>mlx4_en: mlx4_core0: Port 1: Using 4 RX rings
hn0: link state changed to DOWN
<4>mlx4_en: mlxen0: Using 4 TX rings
<4>mlx4_en: mlxen0: Using 4 RX rings
<4>mlx4_en: mlxen0: Initializing port


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address   = 0x1d8
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80c6a3ea
stack pointer           = 0x28:0xfffffe00004bb950
frame pointer           = 0x28:0xfffffe00004bb960
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (vmbusdev)
[ thread pid 0 tid 100105 ]
Stopped at      vrefact+0xa:    cmpl    $0x4,ll+0x1b7(%r14)
db> bt
Tracing pid 0 tid 100105 td 0xfffff80004a02580
vrefact() at vrefact+0xa/frame 0xfffffe00004bb960
namei() at namei+0x20d/frame 0xfffffe00004bba20
vn_open_cred() at vn_open_cred+0x213/frame 0xfffffe00004bbb60
linker_load_module() at linker_load_module+0x4cf/frame 0xfffffe00004bbe70
kern_kldload() at kern_kldload+0xf1/frame 0xfffffe00004bbec0
mlx4_request_modules() at mlx4_request_modules+0x93/frame 0xfffffe00004bbf80
mlx4_load_one() at mlx4_load_one+0x3196/frame 0xfffffe00004bc630
mlx4_init_one() at mlx4_init_one+0x410/frame 0xfffffe00004bc690
linux_pci_attach() at linux_pci_attach+0x411/frame 0xfffffe00004bc6f0
device_attach() at device_attach+0x3f3/frame 0xfffffe00004bc730
device_probe_and_attach() at device_probe_and_attach+0x71/frame 0xfffffe00004bc760
bus_generic_attach() at bus_generic_attach+0x18/frame 0xfffffe00004bc780
pci_attach() at pci_attach+0xd5/frame 0xfffffe00004bc7c0
device_attach() at device_attach+0x3f3/frame 0xfffffe00004bc800
device_probe_and_attach() at device_probe_and_attach+0x71/frame 0xfffffe00004bc830
bus_generic_attach() at bus_generic_attach+0x18/frame 0xfffffe00004bc850
vmbus_pcib_attach() at vmbus_pcib_attach+0x95e/frame 0xfffffe00004bc940
device_attach() at device_attach+0x3f3/frame 0xfffffe00004bc980
device_probe_and_attach() at device_probe_and_attach+0x71/frame 0xfffffe00004bc9b0
vmbus_add_child() at vmbus_add_child+0x6a/frame 0xfffffe00004bc9e0
taskqueue_run_locked() at taskqueue_run_locked+0x14c/frame 0xfffffe00004bca40
taskqueue_thread_loop() at taskqueue_thread_loop+0x88/frame 0xfffffe00004bca70
fork_exit() at fork_exit+0x84/frame 0xfffffe00004bcab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004bcab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db>

Comment 28 Andriy Gapon freebsd_committer

2018-07-16 19:06:29 UTC

I do not know anything about this code and this problem, but I couldn't help but notice that the driver attempts to load a module.
Does it ensure that the root filesystem is mounted?

Comment 29 Dexuan Cui 2018-07-16 20:06:38 UTC

(In reply to pete from comment #27)
@pete: You have done good tests! We know the issue exists in 12-CURRENT as well. Thanks! I'm doing more tests too, trying to find out which exact line of code causes the panic in namei(), or namei() -> vrefact().

Comment 30 Dexuan Cui 2018-07-16 20:10:41 UTC

(In reply to Andriy Gapon from comment #28)
Even if the root filesystem is not mounted when the mlx4_core driver tries to load mlx4en, I believe we should not get a panic like this. :-(

Comment 31 Andriy Gapon freebsd_committer

2018-07-17 08:04:07 UTC

(In reply to Dexuan Cui from comment #30)
Perhaps.  It depends on the documented contract of kern_kldload, if it exists.
In any case, kern_kldload is used very rarely and, except for mlx4 -- or rather linuxkpi, always in user-initiated actions. So, I wouldn't be surprised if it was designed to work only after the root is mounted.

Comment 32 Dexuan Cui 2018-08-27 21:33:37 UTC

(In reply to Andriy Gapon from comment #31)

Unluckily I still can't reproduce the issue recently. :-(

I created the Marketplace 11.1 VM on Azure, and can't reproduce the issue:

FreeBSD decui-bsd111 11.1-RELEASE-p6 FreeBSD 11.1-RELEASE-p6 #0 r313908+f9756700e2c(releng/11.1)-dirty: Thu Jan 11 23:23:54 CST 2018     root@bsd:/usr/obj/usr/xhx/freebsd/sys/GENERIC  amd64

Next, I upgraded it to the latest 11.1, and still no luck:

FreeBSD decui-bsd111 11.1-RELEASE-p13 FreeBSD 11.1-RELEASE-p13 #0: Tue Aug 14 19:34:21 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

Finally, I upgraded it to the latest 11.2, and still no luck:

FreeBSD decui-bsd111 11.2-RELEASE-p2 FreeBSD 11.2-RELEASE-p2 #0: Tue Aug 14 21:45:40 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

Please let us know if you can still reproduce the issue, especially with the same latest 11.1 and 11.2 releases.

Comment 33 Pete French 2018-09-04 18:41:13 UTC

(In reply to Dexuan Cui from comment #32)

You are trying to boot from ZFS, yes ? I get the panic when I have my root filesystem as ZFS. It seems to work fine on UFS.

I will try and reproduce when I get a moment though....

Comment 34 Dexuan Cui 2018-09-04 18:58:14 UTC

(In reply to pete from comment #33)
Last Friday, I happened to reproduce the issue again with the latest stable/11. UFS was used here. 

I'm debugging it now.

Comment 35 Pete French 2018-12-13 13:57:54 UTC

(In reply to Dexuan Cui from comment #34)

So, I just saw a new version of the Mellanox driver was imported into 12.0 yesterday - has anyone tried this yet ? I am rebuilding on one of my Azure machines and can potentailly try it later today.

Comment 36 Dexuan Cui 2018-12-13 20:03:57 UTC

(In reply to pete from comment #35)
I guess nobody tried this yet and I doubt the issue can disappear automatically with the new MLX driver, as the panic happened in namei() rather than in a MLX driver function.

Cc'd Wei Hu, who can help to further investigate the issue.

Comment 37 Wei Hu 2020-09-14 08:12:46 UTC

I hit this bug on FreeBSD 12.1 image in Azure with Mellanox CX3 VF. Looks the root file system is not available at the time mlx4 driver is trying to load mlx4en kernel module. Adding one second sleep in mlx4_request_modules makes the problem go away. But it doesn't look like a fix to me. At least namei() or vrefact() should check if the vnode is NULL to avoid the panic. 

Here is the detailed troubleshooting in debugger I did when the crash happened.

The panic on console:
----------------------
pci1: <PCI bus> on pcib1
mlx4_core0: <mlx4_core> at device 2.0 on pci1
<6>mlx4_core: Mellanox ConnectX core driver v3.5.1 (April 2019)
mlx4_core: Initializing mlx4_core
mlx4_core0: Detected virtual function - running in slave mode
mlx4_core0: Sending reset
mlx4_core0: Sending vhcr0
mlx4_core0: HCA minimum page size:512
mlx4_core0: Timestamping is not supported in slave mode
mlx4_en mlx4_core0: Activating port:1
mlxen0: Ethernet address: 00:0d:3a:e8:16:18
<4>mlx4_en: mlx4_core0: Port 1: Using 4 TX rings
mlxen0: link state changed to DOWN
<4>mlx4_en: mlx4_core0: Port 1: Using 4 RX rings
<4>mlx4_en: mlxen0: Using 4 TX rings
hn0: link state changed to DOWN
<4>mlx4_en: mlxen0: Using 4 RX rings
<4>mlx4_en: mlxen0: Initializing port
mlx4_core0: About to load mlx4_en


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x1d8   <- 0x1d8 is the offset of (struct vnode *)->v_type 
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80cb5c34
stack pointer           = 0x28:0xfffffe00004f4960
frame pointer           = 0x28:0xfffffe00004f4960
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (vmbusdev)
trap number             = 12
panic: page fault
cpuid = 2
time = 1599838711
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00004f4610
vpanic() at vpanic+0x19d/frame 0xfffffe00004f4660
panic() at panic+0x43/frame 0xfffffe00004f46c0
trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00004f4720
trap_pfault() at trap_pfault+0x49/frame 0xfffffe00004f4780
trap() at trap+0x29f/frame 0xfffffe00004f4890
calltrap() at calltrap+0x8/frame 0xfffffe00004f4890
--- trap 0xc, rip = 0xffffffff80cb5c34, rsp = 0xfffffe00004f4960, rbp = 0xfffffe00004f4960 ---
vrefact() at vrefact+0x4/frame 0xfffffe00004f4960
namei() at namei+0x172/frame 0xfffffe00004f4a20
vn_open_cred() at vn_open_cred+0x221/frame 0xfffffe00004f4b70
linker_load_module() at linker_load_module+0x480/frame 0xfffffe00004f4e90
kern_kldload() at kern_kldload+0xc3/frame 0xfffffe00004f4ee0
mlx4_request_modules() at mlx4_request_modules+0xc2/frame 0xfffffe00004f4fa0
mlx4_load_one() at mlx4_load_one+0x349c/frame 0xfffffe00004f5660
mlx4_init_one() at mlx4_init_one+0x3f0/frame 0xfffffe00004f56b0
linux_pci_attach() at linux_pci_attach+0x432/frame 0xfffffe00004f5710
device_attach() at device_attach+0x3e1/frame 0xfffffe00004f5760
bus_generic_attach() at bus_generic_attach+0x5c/frame 0xfffffe00004f5790
pci_attach() at pci_attach+0xd5/frame 0xfffffe00004f57d0
device_attach() at device_attach+0x3e1/frame 0xfffffe00004f5820
bus_generic_attach() at bus_generic_attach+0x5c/frame 0xfffffe00004f5850
vmbus_pcib_attach() at vmbus_pcib_attach+0x75e/frame 0xfffffe00004f5930
device_attach() at device_attach+0x3e1/frame 0xfffffe00004f5980
device_probe_and_attach() at device_probe_and_attach+0x42/frame 0xfffffe00004f59b0
vmbus_add_child() at vmbus_add_child+0x7b/frame 0xfffffe00004f59e0
taskqueue_run_locked() at taskqueue_run_locked+0x154/frame 0xfffffe00004f5a40
taskqueue_thread_loop() at taskqueue_thread_loop+0x98/frame 0xfffffe00004f5a70
fork_exit() at fork_exit+0x83/frame 0xfffffe00004f5ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00004f5ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 0 tid 100080 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why

db> x/i vrefact,10
vrefact:        pushq   %rbp
vrefact+0x1:    movq    %rsp,%rbp
vrefact+0x4:    cmpl    $0x4,ll+0x1b7(%rdi)  <-- if (__predict_false(vp->v_type == VCHR)) in vrefact()
vrefact+0xb:    jz      vrefact+0x1f
vrefact+0xd:    lock addl       $0x1,ll+0x19b(%rdi)
vrefact+0x15:   lock addl       $0x1,ll+0x19f(%rdi)
vrefact+0x1d:   popq    %rbp
vrefact+0x1e:   ret

db> x/i namei+0x160,10
namei+0x160:    call    _sx_slock_int
namei+0x165:    movq    0x10(%r13),%rdi
namei+0x169:    movq    %rdi,ll+0x7(%rbx)
namei+0x16d:    call    vrefact          <--- Place in namei() calling vrefact()
namei+0x172:    movq    0x18(%r13),%rax
namei+0x176:    movq    %rax,ll+0xf(%rbx)
namei+0x17a:    movq    ll+0x77(%rbx),%rax

This is the code in namei():

        /*
         * Get starting point for the translation.
         */
        FILEDESC_SLOCK(fdp);
        ndp->ni_rootdir = fdp->fd_rdir;
        vrefact(ndp->ni_rootdir);         <--- here
        ndp->ni_topdir = fdp->fd_jdir;

And fdp is (struct filedesc *) and got assigned earlier to the current proc's p_fd:

        p = td->td_proc;
        ...
        fdp = p->p_fd;


db> show thread
Thread 100080 at 0xfffff80004a95000:
 proc (pid 0): 0xffffffff81ff2060   <--- pointer to proc
 name: vmbusdev
 stack: 0xfffffe00004f2000-0xfffffe00004f5fff
 flags: 0x4  pflags: 0x200000
 state: RUNNING (CPU 2)
 priority: 8
 container lock: sched lock 2 (0xffffffff81eb3540)
 last voluntary switch: 50 ms ago
db> x/gx 0xffffffff81ff2060,20 (struct proc)
proc0:  0                               fffff80003609a60                ffffffff81ff25a0
proc0+0x18:     fffff80009866010                ffffffff81332c23
proc0+0x28:     30000                           0
proc0+0x38:     0                               fffff8000308ad00
proc0+0x48:     fffff800035168a0 (p_fd)               0
proc0+0x58:     fffff80003084e00                fffff8000308ac00

db> x/gx 0xfffff800035168a0, 10 (struct filedesc)
0xfffff800035168a0:     fffff80003516920                0
0xfffff800035168b0:     0  (fd_rdir)                    0
0xfffff800035168c0:     fffff80003516ce8                ffffffff
0xfffff800035168d0:     100000012                       1
0xfffff800035168e0:     ffffffff812c4e35                2330000
0xfffff800035168f0:     0                               21
0xfffff80003516900:     0                               0
0xfffff80003516910:     0                               0

So at the moment the fd_rdir (root directory) is still NULL.

Comment 38 mammootty 2021-02-26 18:02:49 UTC

I am hitting this panic often while creating VMs on Azure with Mellanox VFs on larger size VMs. Even with adding one second sleep as Wei Hu mentioned, though the frequency has reduced still hit the panic. The stack trace is exactly same as the one given in the earlier comment.

Comment 39 Wei Hu 2021-03-01 09:30:03 UTC

The cause of this problem is mlx4 driver was loaded before the root filesystem. On larger VM the odds of hitting this increase  as more things are in parallel.

Instead of adding 1 second delay in the driver, I think you can move the driver loading from /boot/loader.conf to /etc/rc.conf. Remove the line such as mlx4en_load="YES" in /boot/loader.conf and add following line in /etc/rc.conf:

kld_list="mlx4en mlx4ib"

This way the loading of mlx4 driver would be after the local disks are mounted, making this problem go away.

Comment 40 mammootty 2021-03-01 19:25:17 UTC

Thanks Wei Hu, We have made mlxen4 module as part of the kernel and commented out the call to mlx4_request_modules() temporarily. 
However we are looking for any change in namei() or vrefact() to handle the NULL vnode case gracefully to avoid the panic.

Comment 41 Dexuan Cui 2021-03-01 19:33:52 UTC

(In reply to mammootty from comment #40)
This sounds like a bug in the kernel function kern_kldload(): if it can't find a mounted root file system, it should return some error rather than panic.