(The bug was previously reported at https://www.mail-archive.com/freebsd-virtualization@freebsd.org/msg05792.html) Since Jan 2018, some FreeBSD VMs can't properly detect HDD-backed disks. The issue happens to the latest 12-CURRENT too. da0 (0:0:0:0)is also backed by HDD, but somehow is not affected. da1 (1:0:1:0) it not affected either, only the extra disks, e,g. da2 (3:0:0:0), are affected. These are the logs of my VM, which has da0, da1 and da2. These 3 disks are on 3 different disk controllers. [root@decui-d4-repro ~]# dmesg -c > /dev/null [root@decui-d4-repro ~]# camcontrol rescan all Re-scan of bus 0 was successful Re-scan of bus 1 was successful Re-scan of bus 2 was successful Re-scan of bus 3 was successful Re-scan of bus 4 was successful Re-scan of bus 5 was successful [root@decui-d4-repro ~]# dmesg (aprobe0:ata0:0:0:0): Disable ATA disks on simulated ATA controller (0x71118086) (aprobe0:ata0:0:1:0): Disable ATA disks on simulated ATA controller (0x71118086) (probe1:(probe0:blkvsc0:0:1:0): Down reving Protocol Version from 4 to 2? blkvsc0:0:0:0): storvsc inquiry (36) [0 0 4 2 1f ... ] (probe1:blkvsc0:0:1:0): storvsc inquiry (36) [7f 0 5 2 1f ... ] (probe0:blkvsc0:0:0:0): storvsc scsi_status = 2, srb_status = 6 (probe0:(probe1:blkvsc0:0:0:0): storvsc inquiry (5) [0 0 0 5 0 ... ] (probe0:blkvsc0:0:1:1): Down reving Protocol Version from 4 to 2? blkvsc0:0:0:0): storvsc inquiry (6) [0 83 0 1c 1 ... ] (probe1:blkvsc0:0:1:1): invalid LUN 1 for op: INQUIRY (probe0:blkvsc0:0:0:1): invalid LUN 1 for op: INQUIRY (probe0:blkvsc1:0:0:0): Down reving Protocol Version from 4 to 2? (probe0:blkvsc1:0:0:0): storvsc inquiry (36) [7f 0 5 2 1f ... ] (probe1:(probe0:blkvsc1:0:1:0): storvsc inquiry (36) [0 0 5 2 1f ... ] (probe1:blkvsc1:0:0:1): Down reving Protocol Version from 4 to 2? blkvsc1:0:1:0): storvsc scsi_status = 2, srb_status = 6 (probe1:blkvsc1:0:1:0): storvsc inquiry (5) [0 0 0 8 0 ... ] (probe0:blkvsc1:0:0:1): invalid LUN 1 for op: INQUIRY (probe1:blkvsc1:0:1:0): storvsc inquiry (6) [0 83 0 30 1 ... ] (probe0:storvsc2:0:0:0): Down reving Protocol Version from 4 to 2? (probe1:(probe0:storvsc2:0:1:0): Down reving Protocol Version from 4 to 2? storvsc2:0:0:0): storvsc inquiry (36) [7f 0 5 2 1f ... ] (probe1:(probe0:storvsc2:0:1:0): storvsc inquiry (36) [7f 0 5 2 1f ... ] (probe1:storvsc2:0:0:1): Down reving Protocol Version from 4 to 2? (probe0:storvsc2:0:1:1): Down reving Protocol Version from 4 to 2? (probe1:storvsc2:0:0:1): invalid LUN 1 for op: INQUIRY storvsc2:0:1:1): invalid LUN 1 for op: INQUIRY (probe0:storvsc3:0:0:0): Down reving Protocol Version from 4 to 2? (probe1:(probe0:storvsc3:0:1:0): Down reving Protocol Version from 4 to 2? storvsc3:0:0:0): storvsc inquiry (36) [0 0 4 2 1f ... ] (probe1:(probe0:storvsc3:0:1:0): storvsc inquiry (36) [7f 0 5 2 1f ... ] storvsc3:0:0:0): storvsc scsi_status = 2, srb_status = 6 (probe0:(probe1:storvsc3:0:0:0): storvsc inquiry (5) [0 0 0 5 0 ... ] (probe0:storvsc3:0:1:1): Down reving Protocol Version from 4 to 2? storvsc3:0:0:0): storvsc inquiry (6) [0 83 0 1c 1 ... ] (probe1:storvsc3:0:1:1): invalid LUN 1 for op: INQUIRY GEOM: new disk da2 (probe0:pass3 at storvsc3 bus 0 scbus5 target 0 lun 0 storvsc3:0:0:1): invalid LUN 1 for op: INQUIRY pass3: <Msft Virtual Disk 1.0> Fixed Direct Access SPC-2 SCSI device pass3: 300.000MB/s transfers pass3: Command Queueing enabled pass3 at storvsc3 bus 0 scbus5 target 0 lun 0 pass3: <Msft Virtual Disk 1.0> detached (da2:storvsc3:0:0:0): fatal error, could not acquire reference count (pass3:(da2:storvsc3:0:storvsc3:0:0:0): storvsc inquiry (5) [0 b1 0 3c 0 ... ] 0:0): Periph destroyed da2: Delete methods: <NONE(*),ZERO> Opened disk da2 -> 6 g_access(918): provider da2 has error g_access(918): provider da2 has error g_access(918): provider da2 has error g_access(918): provider da2 has error According to the log, first it looks da2 is indeed detected, but then it's detached immediately: pass3: <Msft Virtual Disk 1.0> detached
When the issue happens, the Azure host (in the VM, "dmesg |grep "Hyper-V Version" shows: "Hyper-V Version: 10.0.14393 [SP0]) reports da2 as a SPC-2 device. I tried to reproduce the bug on local Hyper-V hosts (which always report da2 as a SPC-3 device), but I couldn't repro it. So far, I can only reproduce the bug with some specific VM sizes, e.g. Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region. Note: some people also reported some other VM sizes can also reproduce the bug. It looks all the affected VM sizes use HDD-backed disks. If we create a VM with SSD-backed disks, the bug can't repro. And, with a HDD-backed disk, if we avoid attaching the disk to LUN0, it looks the bug can't repro either. So, two possible mitigation solutions are: a) use a VM size that supports SSD-backed disks; b) attach any extra data disk to a LUN whose number is not zero.
Created attachment 191475 [details] The patch can fix the bug for me. Working to test it fully.
Posted a patch for review: https://reviews.freebsd.org/D14690
(In reply to Dexuan Cui from comment #3) I manually tested the patch with releng/{10.3, 10.4, 11.0, 11.1}, stable/{10, 11), and master on local Windows Server 2012 R2, 2016 and Azure, and I don't hit any regression. For Azure testing, I created multiple VMs (Standard D4 v2 (8 vcpus, 28 GB memory) ) in West Central US region and ran a lot of "hot add/remove disks" tests, e.g. a) hot add and remove a disk on LUN0 b) hot add and remove a disk on LUN1 c) hot add and remove a disk on LUN9 d) hot add and remove a disk on LUN0, and LUN9 e) hot add and remove a disk on LUN1, and LUN9 f) hot add and remove a disk on LUN0~7 g) hot add and remove a disk on LUN0~6, LUN8 h) hot add and remove a disk on LUN1, 9, 10, 11, 12, 13, 14, 15 i) hot add and remove a disk on LUN0, 9, 10, 11, 12, 15, 18, 63 I didn't find any issue. Hongxiong also helped to test it on 2012 R2 and 2016. No issue. Hongxiong is going to run more test cases on Azure.
I encountered this bug using an A1 sized VM in Azure (you can probably change the importance level in the bug tracker). This bug seems to be affecting other users on Azure as well (see https://forums.freebsd.org/threads/disk-gets-detached-then-periph-destroyed-according-to-var-run-dmesg-boot-azure-data-disks.61266/#post-381342).
A commit references this bug: Author: dexuan Date: Tue Apr 10 18:05:03 UTC 2018 New revision: 332385 URL: https://svnweb.freebsd.org/changeset/base/332385 Log: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav MFC after: 1 week Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: head/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
A commit references this bug: Author: dexuan Date: Tue Apr 24 03:06:05 UTC 2018 New revision: 332903 URL: https://svnweb.freebsd.org/changeset/base/332903 Log: MFC: 332385 r332385: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: _U stable/11/ stable/11/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c
A commit references this bug: Author: dexuan Date: Tue Apr 24 03:07:49 UTC 2018 New revision: 332904 URL: https://svnweb.freebsd.org/changeset/base/332904 Log: MFC: 332385 r332385: hyperv/storvsc: storvsc_io_done(): do not use CAM_SEL_TIMEOUT CAM_SEL_TIMEOUT was introduced in https://reviews.freebsd.org/D7521 (r304251), which claimed: "VM shall response to CAM layer with CAM_SEL_TIMEOUT to filter those invalid LUNs. Never use CAM_DEV_NOT_THERE which will block LUN scan for LUN number higher than 7." But it turns out this is not correct: I think what really filters the invalid LUNs in r304251 is that: before r304251, we could set the CAM_REQ_CMP without checking vm_srb->srb_status at all: ccb->ccb_h.status |= CAM_REQ_CMP. r304251 checks vm_srb->srb_status and sets ccb->ccb_h.status properly, so the invalid LUNs are filtered. I changed my code version to r304251 but replaced the CAM_SEL_TIMEOUT with CAM_DEV_NOT_THERE, and I confirmed the invalid LUNs can also be filtered, and I successfully hot-added and hot-removed 8 disks to/from the VM without any issue. CAM_SEL_TIMEOUT has an unwanted side effect -- see cam_periph_error(): For a selection timeout, we consider all of the LUNs on the target to be gone. If the status is CAM_DEV_NOT_THERE, then we only get rid of the device(s) specified by the path in the original CCB. This means: for a VM with a valid LUN on 3:0:0:0, when the VM inquires 3:0:0:1 and the host reports 3:0:0:1 doesn't exist and storvsc returns CAM_SEL_TIMEOUT to the CAM layer, CAM will detech 3:0:0:0 as well: this is the bug I reported recently: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226583 PR: 226583 Reviewed by: mav Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D14690 Changes: _U stable/10/ stable/10/sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c