Bug 236042

Summary: Windows Server 2016 Hyper-V snapshot triggers SCSI errors
Product: Base System Reporter: Alex G <alex.gacovski>
Component: kernAssignee: freebsd-virtualization (Nobody) <virtualization>
Status: New ---    
Severity: Affects Many People CC: avg, bsdic, decui, glukken, honzhan, j.morillo, juraj, lauer, mattbju2013, michael.adm, njc, sepherosa, whu
Priority: ---    
Version: 12.0-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
SCSI error during snapshot
none
proposed patch, ported from Linux none

Description Alex G 2019-02-25 23:29:38 UTC
Hi,

Currently running FreeBSD 12.0-RELEASE in a Hyper-V Gen 2 with a SCSI virtual disk.

When Veeam backup takes a hyperv snapshot of the running vm this is printed on the console.

Feb 21 08:51:40 8797web01 kernel: hvtimesync0: RTT
Feb 21 08:51:40 8797web01 kernel: (da0:storvsc0:0:0:0): WRITE(10). CDB: 2a 00 0a 6e 5a b0 00 00 08 00 
Feb 21 08:51:40 8797web01 kernel: (da0:storvsc0:0:0:0): CAM status: SCSI Status Error
Feb 21 08:51:40 8797web01 kernel: (da0:storvsc0:0:0:0): SCSI status: Check Condition
Feb 21 08:51:40 8797web01 kernel: (da0:storvsc0:0:0:0): SCSI sense: UNIT ATTENTION asc:3f,2 (Changed operating definition)
Feb 21 08:51:40 8797web01 kernel: (da0:storvsc0:0:0:0): Retrying command (per sense data)
Feb 21 08:54:33 8797web01 kernel: (da0:storvsc0:0:0:0): WRITE(10). CDB: 2a 00 03 b3 14 68 00 00 40 00 
Feb 21 08:54:33 8797web01 kernel: (da0:storvsc0:0:0:0): CAM status: SCSI Status Error
Feb 21 08:54:33 8797web01 kernel: (da0:storvsc0:0:0:0): SCSI status: Check Condition
Feb 21 08:54:33 8797web01 kernel: (da0:storvsc0:0:0:0): SCSI sense: UNIT ATTENTION asc:3f,2 (Changed operating definition)
Feb 21 08:54:33 8797web01 kernel: (da0:storvsc0:0:0:0): Retrying command (per sense data)

Could it be something is still trying to write to the disk when hyperv signals the HV VSS driver to freeze the file system?

The system has 1 virtual disk with 3 partitions.
* EFI boot
* swap
* UFS root file system

Thank you.
Cheers,
Alex.
Comment 1 Alex G 2019-02-25 23:34:14 UTC
I tried to email the authors of the hyerv integration drivers but I got a bounce back from Microsoft email server.

The response from the remote server was:
550 5.4.1 [bsdic@microsoft.com]: Recipient address rejected: Access denied [BL2NAM06FT011.Eop-nam06.prod.protection.outlook.com]
Comment 2 Yold 2019-03-19 13:08:09 UTC
I'm also having same problem with different setup:
- HyperV 2016 with OPNSense as Guest (FreeBSD 11.1-RELEASE-p17)
This VM has a replicate on another HyperV wich often failed (need to force sync again).
Here is FreeBSD log:

(da0:storvsc0:0:0:0): WRITE(10). CDB: 2a 00 00 cd ac a8 00 01 00 00
(da0:storvsc0:0:0:0): CAM status: SCSI Status Error
(da0:storvsc0:0:0:0): SCSI status: Check Condition
(da0:storvsc0:0:0:0): SCSI sense: UNIT ATTENTION asc:3f,2 (Changed operating definition)
(da0:storvsc0:0:0:0): Retrying command (per sense data)
(da0:storvsc0:0:0:0): WRITE(10). CDB: 2a 00 00 ca 90 68 00 00 40 00
(da0:storvsc0:0:0:0): CAM status: SCSI Status Error
(da0:storvsc0:0:0:0): SCSI status: Check Condition
(da0:storvsc0:0:0:0): SCSI sense: UNIT ATTENTION asc:3f,2 (Changed operating definition)
(da0:storvsc0:0:0:0): Retrying command (per sense data)
Comment 3 Gesture 2019-04-19 07:25:57 UTC
Created attachment 203786 [details]
SCSI error during snapshot
Comment 4 Gesture 2019-04-19 07:27:35 UTC
Hi,

I can confirm that I have the same SCSI errors during HyperV snapshots.

Greetings
Gesture
Comment 5 Gesture 2019-04-19 07:28:41 UTC
I'm using FreeBSD 11.2 and a pfSense 2.4.4 HyperV Virtual machine
Comment 6 thomaslauer 2019-05-20 19:05:15 UTC
Hi, i have 120 PFSense VMs from 2.3.4 to 2.4.4-2 all Hyperv VMs with GEN2 and UFS.
and some vms with Hyperv GEN2 and UFS. All this VM has the same issue.

I have only one VM with GEN2 and ZFS. This VM has no SCSI Errors during the snapshot.
Comment 7 Nick 2019-05-31 16:53:28 UTC
I am having this problem with PFSense (2.4.4-RELEASE-p3) running on Hyper-V (Windows 2012 R2).  In my case, replication might run fine for a while (hours, days) but at some point there is a SCSI Status Error during a WRITE operation and PFSense/FreeBSD will become locked up or partially working but eventually will not respond to network or UI requests.  I would love a resolution to this.  For the moment, I've disabled replication and it's been fine.

Perhaps useful, perhaps not:  I've been running PFSense for years as a replicating VM on Hyper-V W2K2012R2 without issues.  It was just this week when I started having problems.  PFSense was previously running many different versions (2.2, 2.3, 2.4.3).  When I started having problems this week I had not upgraded PFSsense or the hypervisor.  As far as I can tell "nothing changed".

Nick
Comment 8 Michael 2019-11-01 11:07:10 UTC
I can confirm that I have the same SCSI errors during HyperV snapshots.

FreeBSD from 11.2 to current 13.0 have this problem.
Also this problem and on ZFS too.
Comment 9 Michael 2019-11-01 11:44:42 UTC
Linux seems to have the same problem:
https://bugzilla.redhat.com/show_bug.cgi?id=1502601

I wonder if FreeBSD solved it or not?
...But the output of SCSI Status Error messages in FreeBSD is very similar to the already resolved problem in Linux.
Comment 10 Andriy Gapon freebsd_committer 2019-11-13 07:49:30 UTC
Created attachment 209124 [details]
proposed patch, ported from Linux

Could anyone observing the problem please test this patch?
Thanks!

P.S.
There is a review request for it as well:
https://reviews.freebsd.org/D22313
Comment 11 Michael 2019-11-14 20:15:03 UTC
(In reply to Andriy Gapon from comment #10)

Did not help. Messages one to one after applying this patch and
make cleanworld && make cleandir && make -j8 buildworld && make -j8 buildkernel KERNCONF=GENERIC
Comment 12 meichthys 2020-04-30 14:52:58 UTC
I am also seeing this when Veeam takes a snapshot of my HyperV VM running FreeNAS. FreeNAS seems to lock up for a short amount of time and then seems to recover after a few minutes.


(da1:storvsc1:0:0:0): WRITE(10). CDB: 2a 00 2c 10 8e 88 00 00 08 00
(da1:storvsc1:0:0:0): CAM status: SCSI Status Error
(da1:storvsc1:0:0:0): SCSI status: Check Condition
(da1:storvsc1:0:0:0): SCSI sense: UNIT ATTENTION asc:3f,2 (Changed operating definition)
(da1:storvsc1:0:0:0): Retrying command (per sense data)


Hyperv-2019
FreeNAS-11.3-U1
Veeam Backup & Replication 10 (For HyperV VM replication)
Comment 13 Dexuan Cui 2020-04-30 19:06:01 UTC
(In reply to Michael from comment #11)
When you see the SCSI errors, can the live backup procedure succeed? Do you notice any instability issue (e.g.hang/panic)? Do you notice any data corruption issue (e.g. the back-up procedure says it succeeded, but later you may find that the data is not really consistently backed up, i.e. 'fsck' may run while you think it should not)?

If the backup procedure still succeeds, and you never notice any real issue, then I think it should be safe to ignore the SCSI error messages -- if we want to get rid of the messages, it looks  sys/dev/hyperv/utilities/hv_snapshot.c is not an issue -- it looks  we need to improve sys/dev/hyperv/storvsc/hv_storvsc_drv_freebsd.c and/or the other part of SCSI subsystem.

Linux VM on Hyper-V shows the same/similar messages when Hyper-V live backup is being performed (at least this was the case in June 2017):

sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

These messages in a Linux VM can be seen even if the back-up is successful, so it looks people just ignore the messages in Linux. 

The messages are caused as a result of the way how Hyper-V live-backup works: usually the VM's .vhdx file has a block size of 32MB, and during the live-backup procedure IIRC it looks a temporary .avhdx of a 2MB block size is generated and the host returns sense key "Unit Attention" with asc 0x3f and ascq 0x2 (Changed Operating Definition); the host sends "Unit Attention" because the backing VHD block size has changed after checkpoint or backup and this results in a change in the granularity of the UNMAP.

Linux SCSI layer can deal with the following asc 0x3f on Unit Attention:
ascq 0x3 (Inquiry Data Has Changed)
ascq 0xe (Reported Luns Data Has Changed)

However ascq 0x2 is ignored. The SCSI won’t know the UNMAP granularity change, it will run at probably slower UNMAP but won’t affect other parts.

Note: I excerpted the above details from a 2017 email discussing the SCSI errors in a Linux VM on Hyper-V. My understanding is that the messages can be safely ignored in Linux, but I'm not sure about FreeBSD, as the SCSI subsystem in FreeBSD may handle the sense info in a different manner. 

I have moved on to different projects, so I am sorry I can not follow this up... I hope someone would find the info I shared here is useful, in case something has to be done in FreeBSD VM.
Comment 14 Juraj Lutter 2020-05-03 20:40:05 UTC
FWIW, I'm seeing this with recent 12-STABLE as well. I will review Andriy's patch.