Bug 216493 - [Hyper-V] Mellanox ConnectX-3 VF driver can't work when FreeBSD runs on Hyper-V 2016
Summary: [Hyper-V] Mellanox ConnectX-3 VF driver can't work when FreeBSD runs on Hyper...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-virtualization
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-26 13:08 UTC by Dexuan Cui
Modified: 2017-12-17 07:14 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dexuan Cui 2017-01-26 13:08:06 UTC
Windows Server 2016 (Hyper-V 2016) has the ability to support PCIe pass-through and NIC SR-IOV for non-Windows virtual machines (VMs) like Linux and FreeBSD VMs. A few months ago, we enabled PCIe pass-through for FreeBSD VM running on Hyper-V and successfully assigned a Mellanox ConnectX-3 PF device to the VM and the device worked fine in the VM. 

Now we have added code to support NIC SR-IOV (which is based on PCIe pass-through) in the Hyper-V hv_netvsc driver, but it turned out the VF driver failed to load, so I ported two patches from Linux:
https://reviews.freebsd.org/D8867
https://reviews.freebsd.org/D8868

(Note: I only tested the PF/VF drivers in FreeBSD VM running on Hyper-V, but I didn’t test them with the patches on a bare metal FreeBSD machine (it’s not so easy to install such a FreeBSD machine in our lab for now), so it would be really helpful & important if people could review the patches and help to test bare metal.)

With the 2 patches, the VF driver worked in my limited test.

BTW, this link (https://community.mellanox.com/docs/DOC-2242) shows how to enable Mellanox ConnectX-3 VF for Windows VM running on Hyper-V 2012 R2. What I did to FreeBSD VM on Hyper-V 2016 is pretty similar. 


Next, I did more testing and identified 4 issues we need to address:
1. When the VF is hot removed, I see the below error, but it looks nonfatal, because later when the VF is hot added, it can still work.

mlx4_core0: Failed to free mtt range at:20769 order:0
mlx4_core0: detached


2. The VF works fine when the VM has <=12 virtual CPUs, but if the VM has >=13 vCPUs, the VF driver fails to load:
  
  mlx4_core0: <mlx4_core> at device 2.0 on pci1
  mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
  vmbus0: allocated type 3 (0xfe0800000-0xfe0ffffff) for rid 18 of mlx4_core0
  mlx4_core0: Lazy allocation of 0x800000 bytes rid 0x18 type 3 at 0xfe0800000
  mlx4_core0: Detected virtual function - running in slave mode
  mlx4_core0: Sending reset
  mlx4_core0: Sending vhcr0
  mlx4_core0: HCA minimum page size:512
  mlx4_core0: Timestamping is not supported in slave mode.
  mlx4_core0: attempting to allocate 20 MSI-X vectors (52 supported)
  mlx4_core0: using IRQs 256-275 for MSI-X
  mlx4_core0: Failed to allocate mtts for 1024 pages(order 10)
  mlx4_core0: Failed to initialize event queue table (err=-12), aborting.
  

3. The VF can't ping other VM's VF on the same host, and can't ping the PF on the same host either.

On the same host,
    Windows VM <-> Windows VM
and 
    Windows VM <-> Linux VM
are both OK.

Only FreeBSD VM <-> Windows/Linux VMs  can't work.

I suspect something is wrong or missing in the mlx4 VF driver in FreeBSD.


4. I got the below when Live Migration didn’t work. It seems the VF’s detach method couldn’t finish successfully.

Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01)
Jan 11 19:16:43 decui-b11 kernel: mlx4_core0: Failed to free mtt range at:5937 order:0
Jan 11 19:16:54 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:16:54 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode CLOSE_PORT (0xa)
Jan 11 19:18:04 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:18:04 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01)
Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:19:14 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000002
Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:20:24 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000003
Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:21:34 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000004
Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:22:46 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000005
Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:23:56 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000006
Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode QP_FLOW_STEERING_DETACH (0x66)
Jan 11 19:25:06 decui-b11 kernel: mlx4_core0: Fail to detach network rule. registration id = 0x9000000000007
Jan 11 19:26:16 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:26:16 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode SET_MCAST_FLTR (0x48)
Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01)
Jan 11 19:27:26 decui-b11 kernel: mlx4_core0: Failed to free icm of qp:2279
Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01)
Jan 11 19:28:36 decui-b11 kernel: mlx4_core0: Failed to release qp range base:2279 cnt:1
Jan 11 19:29:46 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:29:46 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode 2RST_QP (0x21)
Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode HW2SW_CQ (0x17)
Jan 11 19:30:56 decui-b11 kernel: mlx4_core0: HW2SW_CQ failed (-35) for CQN 0000b5
Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: mlx4_comm_cmd_wait: Comm channel is not idle. My toggle is 0 (op: 0x5)
Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: failed execution of VHCR_POST commandopcode FREE_RES (0xf01)
Jan 11 19:32:06 decui-b11 kernel: mlx4_core0: Failed freeing cq:181

More info about issue 4:

In the case of Live Migration, it looks the host just rescinds the VF by force without sending the PCI_EJECT message to the VM. It looks the current Mellanox VF driver in FreeBSD can’t handle this case (i.e. the VF device disappears suddenly) and always hangs due to command timeout, because at that time the host denies the VM’s access to the VF.  

BTW, the VF driver in Linux VM doesn’t hang and it looks Live Migration can work, but the driver also prints out these scary messages:

Jan 26 02:40:06 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: Internal error detected on the communication channel
Jan 26 02:40:06 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: device is going to be reset
Jan 26 02:40:06 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: VF reset is not needed
Jan 26 02:40:06 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: device was reset successfully
Jan 26 02:40:06 decui-lin-vm kernel: mlx4_en 99bb:00:02.0: Internal error detected, restarting device
Jan 26 02:40:06 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: command 0x5 failed: fw status = 0x1
Jan 26 02:40:06 decui-lin-vm kernel: hv_netvsc vmbus_16 eth1: VF down: enP39355p0s2
Jan 26 02:40:06 decui-lin-vm kernel: hv_netvsc vmbus_16 eth1: Data path switched from VF: enP39355p0s2
Jan 26 02:40:06 decui-lin-vm kernel: hv_netvsc vmbus_16 eth1: VF unregistering: enP39355p0s2

Jan 26 02:40:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: Failed to close slave function
Jan 26 02:40:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: Detected virtual function - running in slave mode
Jan 26 02:40:37 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: recovering from previously mis-behaved VM
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: Communication channel is offline.
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: PF is not responsive, skipping initialization
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: Failed to initialize slave
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: mlx4_restart_one: ERROR: mlx4_load_one failed, pci_name=99bb:00:02.0, err=-5
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: mlx4_restart_one was ended, ret=-5
Jan 26 02:41:07 decui-lin-vm kernel: mlx4_core 99bb:00:02.0: mlx4_remove_one: interface is down

I think at least we need to port this patch
“net/mlx4_core: Enable device recovery flow with SRIOV “ (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=55ad359225b2232b9b8f04a0dfa169bd3a7d86d2) from Linux to FreeBSD.
Comment 1 Dexuan Cui 2017-01-26 13:13:45 UTC
To reproduce the 4 issues, we need to use today's HEAD code or newer:

r312690 | dexuan | 2017-01-24 17:27:13 +0800 (Tue, 24 Jan 2017) | 8 lines
hyperv/hn: add devctl_notify for VF_UP/DOWN events

and manually apply the 2 patches mentioned in the bug report:
https://reviews.freebsd.org/D8867
https://reviews.freebsd.org/D8868
Comment 2 Dexuan Cui 2017-02-13 02:33:38 UTC
Updates:

hselasky committed the 2 patches (D8867, D8868) into the HEAD last Friday.

Issue 1: No update.

Issue 2: 
Actually Linux version of the driver has the same (similar?) issue and I reported it here:
https://www.spinics.net/lists/netdev/msg420136.html

Some people suspected that it failed to allocate a UAR, but after we increased LOG_BAR_SIZE with mlxconfig from 3 (8MB) to 5 (32MB), the issue was still there.

We'll continue to work on this.

Issue 3: No update.

Issue 4: 
It looks this may be a host side issue. on Live Migration, the host just removes the VF from the guest by force suddenly. Working on this. 
But meanwhile, the VF driver should be improved to be more robust to cope with this scenario.
Comment 3 Dexuan Cui 2017-02-17 05:33:05 UTC
(In reply to Dexuan Cui from comment #2)
Updates:
Issue 2:
	Linux has made a patch (not posted yet) and we'll need to port it:
	https://www.spinics.net/lists/netdev/msg421306.html

Issue 4:
	We may need to port more patches that were mentioned here:
	https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1650058/comments/7

"
Mellanox has told me that the following three commits are needed for SR-IOV in Azure:
1.	d585df1c5ccf net/mlx4_core: Avoid command timeouts during VF driver device shutdown
2.	7c3945bc2073 net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
3.	291c566a2891 net/mlx4_core: Fix racy CQ (Completion Queue) free
"

and the 4th patch:
commit 0cd9302734111abc0b5912b695336f2ee63cb22b
net/mlx4_core: Reset flow activation upon SRIOV fatal command cases
Comment 4 Dexuan Cui 2017-02-22 02:00:51 UTC
(In reply to Dexuan Cui from comment #3)
One more patch is needed:
6496bbf0ec48 net/mlx4_en: Fix bad WQE issue
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1650058/comments/8)
Comment 6 Hans Petter Selasky freebsd_committer 2017-05-05 12:05:14 UTC
Ping - any news on this issue? Have you ported more patches from Linux to FreeBSD?
Comment 7 Dexuan Cui 2017-05-05 12:46:13 UTC
(In reply to Hans Petter Selasky from comment #6)
There are a bunch of Linux patches needed to be ported to FreeBSD, and a fully testing is required.
Comment 8 Hans Petter Selasky freebsd_committer 2017-05-05 12:59:26 UTC
Do you know which release of Linux has all the needed MLX patches?
Comment 9 Dexuan Cui 2017-05-05 13:02:12 UTC
(In reply to Hans Petter Selasky from comment #8)
I would check https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Comment 10 commit-hook freebsd_committer 2017-06-01 10:39:34 UTC
A commit references this bug:

Author: hselasky
Date: Thu Jun  1 10:39:00 UTC 2017
New revision: 319413
URL: https://svnweb.freebsd.org/changeset/base/319413

Log:
  Free hardware queue resource after port is stopped in the mlx4en(4)
  driver. Else if the port is up the resource might still be busy and
  the MTT free will fail.

  PR:			216493
  MFC after:		3 days
  Sponsored by:		Mellanox Technologies

Changes:
  head/sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
Comment 11 commit-hook freebsd_committer 2017-06-01 10:45:41 UTC
A commit references this bug:

Author: hselasky
Date: Thu Jun  1 10:44:48 UTC 2017
New revision: 319414
URL: https://svnweb.freebsd.org/changeset/base/319414

Log:
  Allow communication between functions on the same host when using the
  mlx4en(4) driver in SRIOV mode.

  Place a copy of the destination MAC address in the send WQE only under
  SRIOV/eSwitch configuration or when the device is in selftest. This
  allows communication between functions on the same host.

  PR:			216493
  MFC after:		3 days
  Sponsored by:		Mellanox Technologies

Changes:
  head/sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
Comment 12 Hans Petter Selasky freebsd_committer 2017-06-01 10:46:26 UTC
Hi,

There has been a series of fixes to mlx4en(4) in 12-current regarding SRIOV support.

Can you re-test and report which issues are still remaining?

--HPS
Comment 13 Dexuan Cui 2017-06-01 10:56:12 UTC
(In reply to Hans Petter Selasky from comment #12)
Thanks, HPS! Will do.
Comment 14 commit-hook freebsd_committer 2017-06-04 08:25:53 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:25:29 UTC 2017
New revision: 319563
URL: https://svnweb.freebsd.org/changeset/base/319563

Log:
  MFC r319414:
  Allow communication between functions on the same host when using the
  mlx4en(4) driver in SRIOV mode.

  Place a copy of the destination MAC address in the send WQE only under
  SRIOV/eSwitch configuration or when the device is in selftest. This
  allows communication between functions on the same host.

  PR:		216493
  Approved by:	re (kib)
  Sponsored by:	Mellanox Technologies

Changes:
_U  stable/11/
  stable/11/sys/ofed/drivers/net/mlx4/en_tx.c
Comment 15 commit-hook freebsd_committer 2017-06-04 08:29:59 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:29:17 UTC 2017
New revision: 319564
URL: https://svnweb.freebsd.org/changeset/base/319564

Log:
  MFC r319414:
  Allow communication between functions on the same host when using the
  mlx4en(4) driver in SRIOV mode.

  Place a copy of the destination MAC address in the send WQE only under
  SRIOV/eSwitch configuration or when the device is in selftest. This
  allows communication between functions on the same host.

  PR:			216493
  Sponsored by:		Mellanox Technologies

Changes:
_U  stable/10/
  stable/10/sys/ofed/drivers/net/mlx4/en_tx.c
Comment 16 commit-hook freebsd_committer 2017-06-04 08:31:01 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:30:55 UTC 2017
New revision: 319565
URL: https://svnweb.freebsd.org/changeset/base/319565

Log:
  MFC r319414:
  Allow communication between functions on the same host when using the
  mlx4en(4) driver in SRIOV mode.

  Place a copy of the destination MAC address in the send WQE only under
  SRIOV/eSwitch configuration or when the device is in selftest. This
  allows communication between functions on the same host.

  PR:			216493
  Sponsored by:		Mellanox Technologies

Changes:
_U  stable/9/sys/
  stable/9/sys/ofed/drivers/net/mlx4/en_tx.c
Comment 17 commit-hook freebsd_committer 2017-06-04 08:46:15 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:45:14 UTC 2017
New revision: 319566
URL: https://svnweb.freebsd.org/changeset/base/319566

Log:
  MFC r319413:
  Free hardware queue resource after port is stopped in the mlx4en(4)
  driver. Else if the port is up the resource might still be busy and
  the MTT free will fail.

  PR:		216493
  Approved by:	re (kib)
  Sponsored by:	Mellanox Technologies

Changes:
_U  stable/11/
  stable/11/sys/ofed/drivers/net/mlx4/en_netdev.c
Comment 18 commit-hook freebsd_committer 2017-06-04 08:47:18 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:47:09 UTC 2017
New revision: 319567
URL: https://svnweb.freebsd.org/changeset/base/319567

Log:
  MFC r319413:
  Free hardware queue resource after port is stopped in the mlx4en(4)
  driver. Else if the port is up the resource might still be busy and
  the MTT free will fail.

  PR:			216493
  Sponsored by:		Mellanox Technologies

Changes:
_U  stable/10/
  stable/10/sys/ofed/drivers/net/mlx4/en_netdev.c
Comment 19 commit-hook freebsd_committer 2017-06-04 08:49:22 UTC
A commit references this bug:

Author: hselasky
Date: Sun Jun  4 08:48:27 UTC 2017
New revision: 319568
URL: https://svnweb.freebsd.org/changeset/base/319568

Log:
  MFC r319413:
  Free hardware queue resource after port is stopped in the mlx4en(4)
  driver. Else if the port is up the resource might still be busy and
  the MTT free will fail.

  PR:			216493
  Sponsored by:		Mellanox Technologies

Changes:
_U  stable/9/sys/
  stable/9/sys/ofed/drivers/net/mlx4/en_netdev.c
Comment 20 Dexuan Cui 2017-07-03 17:58:31 UTC
(In reply to commit-hook from comment #19)
All the first 3 issues have been fixed and the patches have been in HEAD, stable/11 and stable 10. 

Let's close this bug and open a new bug for the live migration issue only.
Comment 21 vali gholami 2017-12-17 07:14:41 UTC
MARKED AS SPAM