Bug 278289 - nvme_opc_delete_io_sq NOT PERMITTED queue id 0
Summary: nvme_opc_delete_io_sq NOT PERMITTED queue id 0
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: bhyve (show other bugs)
Version: 14.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-virtualization (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-10 09:24 UTC by Markus Wild
Modified: 2024-04-14 21:36 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Wild 2024-04-10 09:24:46 UTC
We're running a Windows 2019 VM within bhyve, using the nvme drive emulation like so:

  bhyve -c 4 -m 16G -H -w \
  -s 0,hostbridge \
  -s 4,nvme,/dev/zvol/data/volumes/zvol2 \
  -s 5,virtio-net,tap11 \
  -s 7,virtio-net,tap21 \
  -s 6,nvme,/dev/zvol/data/volumes/zvol-bk01.r \
  -s 29,fbuf,tcp=0.0.0.0:5901 \
  -s 30,xhci,tablet \
  -s 31,lpc \
  -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
  winserv2 &

the VM is used to implement a veeam backup repository with ReFS on the 2nd disk. We just had
an incident, that this VM took the ReFS volume offline due to these events:

stornvme: Reset to device, \Device\RaidPort1, was issued
Disk: An error was detected on device\Device\Harddisk1\DR1 during a paging operation
stornvme: the driver detected a controller error on \Device\RaidPort1
Disk: An error was detected on device\Device\Harddisk1\DR1 during a paging operation
ReFS: The file system was unable to write metadata to the media backing volume R:. A write failed with status "The specified request is not a valid operation for the target device." ReFS will take the volume offline. It may be mounted again automatically.
 
On the freebsd side, I have error messages like these:
daemon[70039]: nvme_opc_delete_io_sq NOT PERMITTED queue id 0 / num_squeues 4
syslogd: last message repeated 5 times

Checking the source, a queue_id of 0 is invalid, so why would Windows attempt this? Could this
be a consequence of issuing a "Reset device" to the nvme controller, and if so, is there anything
the bhyve drive could do to recover from this without failing the request like it does at the moment?

Note that this system is rather under powered for the task, so timeouts are to be expected.
Comment 1 Chuck Tuffli freebsd_committer freebsd_triage 2024-04-14 21:36:42 UTC
Queue ID 0 is the Admin queue, and per the specification (e.g. section 5.6 in the 1.4 version of the specification):

    Note: It is not possible to delete the Admin Submission Queue.

The specification (section 5.6.1 in the 1.4 version) also defines the command specific error status 0x1 for Delete I/O Queue:

    Invalid Queue Identifier: The Queue Identifier specified in the command
    is invalid. This error is also indicated if the Admin Submission Queue
    identifier is specified.

Not only is the NVMe emulation's behavior correct, but the UNH IOL [1] Command Set Conformance tests check for this behavior in tests 1.4.2-5,7-8. Taking this check out for Windows would make the conformance tests fail.

As to why Windows is doing this? I don't know and would be curious to find out why.

But as to the failure you observe, I suspect the I/O queue deletion error isn't the cause,  and it would have occurred prior to the device reset (all the I/O queue delete commands would have failed). It would be interesting to see if any Write commands to the device returned a non-good status as that would more closely match "ReFS: The file system was unable to write metadata"

[1] https://www.iol.unh.edu/testing/storage/nvme