Summary: | FreeBSD VM on Hyper-V/Azure can't properly detected SCSI disk da2 | ||||||
---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Dexuan Cui <decui> | ||||
Component: | kern | Assignee: | Dexuan Cui <dexuan> | ||||
Status: | Closed FIXED | ||||||
Severity: | Affects Some People | CC: | decui, koobs, preetpal.sohal, virtualization | ||||
Priority: | --- | Keywords: | patch | ||||
Version: | CURRENT | Flags: | koobs:
mfc-stable11+
koobs: mfc-stable10+ |
||||
Hardware: | Any | ||||||
OS: | Any | ||||||
URL: | https://reviews.freebsd.org/D14690 | ||||||
Attachments: |
|
Description
Dexuan Cui
2018-03-13 22:38:47 UTC
When the issue happens, the Azure host (in the VM, "dmesg |grep "Hyper-V Version" shows: "Hyper-V Version: 10.0.14393 [SP0]) reports da2 as a SPC-2 device. I tried to reproduce the bug on local Hyper-V hosts (which always report da2 as a SPC-3 device), but I couldn't repro it. So far, I can only reproduce the bug with some specific VM sizes, e.g. Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region. Note: some people also reported some other VM sizes can also reproduce the bug. It looks all the affected VM sizes use HDD-backed disks. If we create a VM with SSD-backed disks, the bug can't repro. And, with a HDD-backed disk, if we avoid attaching the disk to LUN0, it looks the bug can't repro either. So, two possible mitigation solutions are: a) use a VM size that supports SSD-backed disks; b) attach any extra data disk to a LUN whose number is not zero. Created attachment 191475 [details]
The patch can fix the bug for me. Working to test it fully.
Posted a patch for review: https://reviews.freebsd.org/D14690 (In reply to Dexuan Cui from comment #3) I manually tested the patch with releng/{10.3, 10.4, 11.0, 11.1}, stable/{10, 11), and master on local Windows Server 2012 R2, 2016 and Azure, and I don't hit any regression. For Azure testing, I created multiple VMs (Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region and ran a lot of "hot add/remove disks" tests, e.g. a) hot add and remove a disk on LUN0 b) hot add and remove a disk on LUN1 c) hot add and remove a disk on LUN9 d) hot add and remove a disk on LUN0, and LUN9 e) hot add and remove a disk on LUN1, and LUN9 f) hot add and remove a disk on LUN0~7 g) hot add and remove a disk on LUN0~6, LUN8 h) hot add and remove a disk on LUN1, 9, 10, 11, 12, 13, 14, 15 i) hot add and remove a disk on LUN0, 9, 10, 11, 12, 15, 18, 63 I didn't find any issue. Hongxiong also helped to test it on 2012 R2 and 2016. No issue. Hongxiong is going to run more test cases on Azure. I encountered this bug using an A1 sized VM in Azure (you can probably change the importance level in the bug tracker). This bug seems to be affecting other users on Azure as well (see https://forums.freebsd.org/threads/disk-gets-detached-then-periph-destroyed-according-to-var-run-dmesg-boot-azure-data-disks.61266/#post-381342). A commit references this bug: Author: dexuan Date: Tue Apr 10 18:05:03 UTC 2018 New revision: 332385 URL: https://svnweb.freebsd.org/changeset/base/332385 Log: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav MFC after: 1 week Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: head/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c A commit references this bug: Author: dexuan Date: Tue Apr 24 03:06:05 UTC 2018 New revision: 332903 URL: https://svnweb.freebsd.org/changeset/base/332903 Log: MFC: 332385 r332385: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: _U stable/11/ stable/11/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c A commit references this bug: Author: dexuan Date: Tue Apr 24 03:07:49 UTC 2018 New revision: 332904 URL: https://svnweb.freebsd.org/changeset/base/332904 Log: MFC: 332385 r332385: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: _U stable/10/ stable/10/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c |