After bug 216493 is closed, let's focus on this live migration bug. With the latest HEAD or any stable/ branches, we still get the below when Live Migration can't work (please refer to bug 216493 for possible patches we need to port from Linux): Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01) Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: Failed to free mtt range at:5937 order:0 Jan 11 19:16:54 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:16:54 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode CLOSE_PORT (0xa) Jan 11 19:18:04 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:18:04 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01) Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000002 Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000003 Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000004 Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000005 Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000006 Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66) Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000007 Jan 11 19:26:16 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:26:16 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode SET_MCAST_FLTR (0x48) Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01) Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: Failed to free icm of qp:2279 Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01) Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: Failed to release qp range base:2279 cnt:1 Jan 11 19:29:46 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:29:46 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode 2RST_QP (0x21) Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode HW2SW_CQ (0x17) Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: HW2SW_CQ failed (-35) for CQN 0000b5 Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5) Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01) Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: Failed freeing cq:181
Hi, has there been any progress on this ? Am interested in enabling high performance networking on my Azure machines, which I believe uses this driver, but I do not want to stop live migration form working for obvious reasons!
(In reply to pete from comment #1) When I reported the bug, I was testing the case on my local Hyper-V hosts. I did not tested this case recently, but I suspect it may have been fixed already. hselasky (hps) knows this better than me. It looks so far Azure doesn't live migrate a VM from a host to another host. I suppose this bug should not block you from using SR-IOV (i.e. Accelerated Networking) on Azure. If you use FreeBSD 10.4 and the coming 11.2, or the latest CURRENT code, I think SR-IOV should work out-of-box.
Ok, thanks I shall give it a try. I am using 11-STABLE (so 11.2 basically). You say "out of the box" - odes that mean I don't need to recompile with OFED enabled ? I usually do this when running Mellanox adapters.
(In reply to pete from comment #3) I didn't check it myself, but I remember it's mentioned the Mellanox drivers are built by default now: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211528#c10
Using what is about to become 11.2 I get a panic when I have the Mellanox drivers loaded and accelerated networking enabled unfortunately: Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x1d4 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bb34b4 stack pointer = 0x28:0xfffffe03e27be9d0 frame pointer = 0x28:0xfffffe03e27be9d0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (vmbusdev) trap number = 12 timeout stopping cpus panic: page fault cpuid = 2 KDB: stack backtrace: #0 0xffffffff80b36697 at kdb_backtrace+0x67 #1 0xffffffff80af00f7 at vpanic+0x177 #2 0xffffffff80aeff73 at panic+0x43 #3 0xffffffff80f6ee0f at trap_fatal+0x35f #4 0xffffffff80f6ee69 at trap_pfault+0x49 #5 0xffffffff80f6e636 at trap+0x2c6 #6 0xffffffff80f4e5ac at calltrap+0x8 #7 0xffffffff80ba6494 at namei+0x1b4 #8 0xffffffff80bc1d53 at vn_open_cred+0x233 #9 0xffffffff80ac31de at linker_load_module+0x47e #10 0xffffffff80ac5061 at kern_kldload+0xc1 #11 0xffffffff8228b6f2 at mlx4_request_modules+0x92 #12 0xffffffff8228f986 at mlx4_load_one+0x3056 #13 0xffffffff82292c70 at mlx4_init_one+0x3c0 #14 0xffffffff82259ed5 at linux_pci_attach+0x405 #15 0xffffffff80b28e68 at device_attach+0x3b8 #16 0xffffffff80b2a0fd at bus_generic_attach+0x3d #17 0xffffffff8076fac5 at pci_attach+0xd5
(In reply to pete from comment #5) Thank for reporting this! Let us have a look.
Using today's HEAD (3309c975db94bf91f18da6a0285649a8903e56c1), I got this: vmbus0: vmbus IDT vector 251 vmbus0: smp_started = 1 panic: vm_fault_hold: fault on nofault entry, addr: 0xfffffe0000dff000 cpuid = 15 time = 1 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff826d5520 vpanic() at vpanic+0x1a3/frame 0xffffffff826d5580 panic() at panic+0x43/frame 0xffffffff826d55e0 vm_fault_hold() at vm_fault_hold+0x2373/frame 0xffffffff826d5720 vm_fault() at vm_fault+0x60/frame 0xffffffff826d5760 trap_pfault() at trap_pfault+0x188/frame 0xffffffff826d57b0 trap() at trap+0x2ba/frame 0xffffffff826d58c0 calltrap() at calltrap+0x8/frame 0xffffffff826d58c0 --- trap 0xc, rip = 0xfffffe0000dff000, rsp = 0xffffffff826d5990, rbp = 0xffffffff826d59a0 --- ??() at 0xfffffe0000dff000/frame 0xffffffff826d59a0 vmbus_msghc_exec() at vmbus_msghc_exec+0x58/frame 0xffffffff826d59e0 vmbus_intrhook() at vmbus_intrhook+0x633/frame 0xffffffff826d5aa0 run_interrupt_driven_config_hooks() at run_interrupt_driven_config_hooks+0x7c/frame 0xffffffff826d5ac0 boot_run_interrupt_driven_config_hooks() at boot_run_interrupt_driven_config_hooks+0x20/frame 0xffffffff826d5b50 mi_startup() at mi_startup+0x118/frame 0xffffffff826d5b70 btext() at btext+0x2c KDB: enter: panic [ thread pid 0 tid 100000 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why db>
(In reply to Dexuan Cui from comment #6) With today's releng/11.2 (f55f63f9f9c29dae38ac323adfd253cec627873c), the mlx4 driver works fine for me. What's the exact version you're using?
(In reply to Dexuan Cui from comment #8) Hmmm, I can't reproduce the panic in comment #7 any more, after I built a kernel from scratch (i.e. git clone the repo into a new directory, and make&install the kernel, and reboot) with the same version (3309c975db94bf91f18da6a0285649a8903e56c1). @pete, can you consistently reproduce your panic every time you reboot the VM? What if you also build a kernel from scratch?
Just a quick one, will try and add more detail tomorrow. I am using r334458 on STABLE-11 which was more or less the point where 11_2 branched I believe. I can't see anything in 11_2 after that point which is mlx4en or linuxkbi related. I can try doing an update tomorrow to the latest STABLE and see though. Will try and prepare a box just for testing this. The compile has the following in src.conf # All our Intel machines are post-Core2 CPUTYPE?=core2 # We are using exim and cups WITHOUT_SENDMAIL=true WITHOUT_LPR=true and the following in make.conf # Build ports in local WRKDIRPREFIX=/usr/local/port-build # Use new format packages WITH_PKGNG=yes DISABLE_VULNERABILITIES=yes # If we install cups it overwrites the base CUPS_OVERWRITE_BASE=yes I switched the interface over to accelerated networking and rebooted, but it still came up using hn0. So I added a 'mlx4en_load="YES"' to loader.conf and rebooted - which gave me the panic. I tried many times, it panicked every time. The only oddity in the setup is I am booting off ZFS instead of UFS. Drive is GPT partitioned, and I have the following in loader.conf boot_serial="YES" comconsole_speed="115200" console="comconsole" vfs.mountroot.timeout="300" kern.geom.label.disk_ident.enable="0" kern.geom.label.gptid.enable="0" ahci_load="YES" aesni_load="YES" cryptodev_load="YES" zfs_load="YES" vfs.zfs.arc_max="1G" vfs.zfs.prefetch_disable="1" vfs.zfs.txg.timeout="5" vfs.zfs.vdev.cache.size="10M" vfs.zfs.vdev.cache.max="10M" Don't know if any of that is significant at all - I assume adding the line to loader.conf is the right thing to do, yes ?
(In reply to pete from comment #10) Yes, I also has mlx4en_load="YES" in my /boot/loader.conf.
So, I cloned the machine, to giuve me something to experiment with, and the clone panics in the same way. Watching it boot, it only does this when it gets to the networking part of the boot - and the message before it is that hn0 has status of DWON. I am surprised that it is also finding the hn0 interface - is that expected ? Am trying to recover the box now, and will remove all networjking refernces from rc.conf to see what it does. Will then try a rebuild from todays STABLE and see if that works.
(In reply to pete from comment #12) Yes, it's expected that you saw a hn interface. Hyper-V is a little different from other hypervisors with respect to SR-IOV support for NIC. Hyper-V provides a pair of NICs to the VM: one is the Para-Virtualized NIC (hn), and the other is the hardware VF NIC, and both the NICs share the same MAC address, and usually almost all of the network traffic goes through the VF NIC so that we can take advantage of the benefits of the hardware VF NIC (i.e. lower latency and CPU utilization, and higher throughput) , but if necessary the network traffic can dynamically switch to the PV NIC, which facilitates things like live migration of the VM. To enable SR-IOV in FreeBSD on Hyper-V, in 2016 we updated the PV NIC driver (i.e. the netvsc driver) a little, and we added a pci front-end driver (i.e. the pcib driver) to discover the VF device, and in Aug 2017 sephe implemented an automatic “bond mode”, with which we don’t need to manually use the lagg driver any more, and the configuration work is done automatically for the PV NIC (we don't and shouldn't directly touch the VF interface).
I got a capture of what it does right before the panic... so, this takes place directly after root is mounted. Trying to mount root from zfs:zroot/ROOT/default []... pci1: <PCI bus> on pcib1 mlx4_core0: <mlx4_core> at device 2.0 on pci1 <6>mlx4_core: Mellanox ConnectX core driver v3.4.1 (October 2017) mlx4_core: Initializing mlx4_core mlx4_core0: Detected virtual function - running in slave mode mlx4_core0: Sending reset mlx4_core0: Sending vhcr0 mlx4_core0: HCA minimum page size:512 mlx4_core0: Timestamping is not supported in slave mode mlx4_en mlx4_core0: Activating port:1 mlxen0: Ethernet address: 00:0d:3a:20:9c:a0 <4>mlx4_en: mlx4_core0: Port 1: Using 4 TX rings mlxen0: link state changed to DOWN <4>mlx4_en: mlx4_core0: Port 1: Using 4 RX rings hn0: link state changed to DOWN <4>mlx4_en: mlxen0: Using 4 TX rings <4>mlx4_en: mlxen0: Using 4 RX rings <4>mlx4_en: mlxen0: Initializing port
(In reply to pete from comment #14) Is it easy for you to test UFS? I didn't really use ZFS before. :-)
Thanks for the explanation - interesting! I don't have any UFS machines I could clone to test with, but will do so if we don't get anywhere with this. meanwhile, however, I did the following test - comment out the line in loader.conf and instead load the module after the machine has booted. That works fine and the device appears with no panic (though of course its not in use). So its something to do with it being loaded during boot. What version would you like me to try compiling to and tests g again ? latest STABLE, or latest 11_2 ?
(In reply to pete from comment #16) I opened a new bug for the panic issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229167 Let's use it to discuss the panic issue, as this one (Bug 220453) is supposed to focus on live migration with mlx VF. :-)
(In reply to Dexuan Cui from comment #7) Hi - this also happens when booting off the latest snapshots of Current ISO, and it also gave the same panic after the latest compile of Current. I am running on Hyper V.
(In reply to scorpionmage from comment #18) Let's track the vmbus_msghc_exec() panic in bug 229167. I'm going to commit a fix soon.