Bug 193570 - File corruption during reboot on Virtio-based FreeBSD guest
Summary: File corruption during reboot on Virtio-based FreeBSD guest
Status: Closed Not A Bug
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.3-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Bryan Venteicher
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-11 18:48 UTC by Jan Siero
Modified: 2014-09-29 11:25 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Siero 2014-09-11 18:48:27 UTC
Host system: CentOS 6 with included KVM
Guest system: FreeBSD on Virtio drivers

System contains 4K / GPT disks with SSD Caching

System runs stable for about a month. After updating the system with freebsd-update install, or after upgrading a port using "portmaster" a reboot was issued.

After reboot, multiple library files in /usr/lib were corrupted.

The corruption manifests itself by messages like: 
/usr/lib/libpam.so.5 Shared object has no run-time symbol table

The MD5 hashes of the files differ with the version on the distribution DVD.

Comparing file contents with their originals using hexdiff, the corruption manifests itself with the following patterns:
 - Corrupted parts start and end exactly at 4K boundaries
 - A file can have multiple corrupted blocks
 - The corrupted 4K blocks are always exact copies of other (non-corrupted) 4K blocks of the same file.

The corruption was observed on one system three times over the last 4 months, on FreeBSD 10.0 RELEASE and on FREEBSD 9.3 RELEASE. After corruption the system was wiped clean and reinstalled.

/var/log/messages shows no entries explaining any system issues

running fsck on the affected partition: the file system is clean

File systems were created using the following commands (aligning them to 4K)

gpart create -s gpt vtbd0
gpart add -a 1M -s 512k -t freebsd-boot -l exboot vtbd0
gpart add -a 1M -s 5G -t freebsd-ufs -l exrootfs vtbd0
gpart add -a 1M -s 5G -t freebsd-swap -l exswap vtbd0
gpart add -a 1M -s 30G -t freebsd-ufs -l exvarfs vtbd0
gpart add -a 1M -s 4G -t freebsd-ufs -l extmpfs vtbd0
gpart add -a 1M -t freebsd-ufs -l exusrfs vtbd0

gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 vtbd0

newfs -U /dev/gpt/exrootfs
newfs -U /dev/gpt/exvarfs
newfs -U /dev/gpt/extmpfs
newfs -U /dev/gpt/exusrfs
Comment 1 Jan Siero 2014-09-11 19:04:28 UTC
The last time the corruption occurred, MD5 hashes of library files in /usr/lib were saved shortly before reboot (after running freebsd-update install). 

After reboot, MD5 hashes for some files were different compared to the ones before reboot. So the file corruption must have taken place during reboot.
Comment 2 Jan Siero 2014-09-16 19:01:16 UTC
Further Analysis of the last occurrence:

On the host system, backtraces were found that relate to the Broadcom (bnx2) driver of the host. The backtraces were bound to the process ID of the FreeBSD guest. Backtrace entries started about 50 minutes before reboot and stopped at the time of reboot.

The KVM FreeBSD guest uses the Virtio NIC network model. Linux KVM website however suggests using the e1000 network model:

http://www.linux-kvm.org/page/Guest_Support_Status#FreeBSD
Comment 3 Jan Siero 2014-09-21 18:40:51 UTC
VPS guest had been shut down for 4 days. After mounting a DVD and starting the FreeBSD VPS Guest (from hard disk) the corruption magically had disappeared: no errors, the MD5 hashes of previously corrupted files were OK. Host system was not restarted in the meantime.

Temporary corruption must have occurred on the KVM host system, possibly triggered by the virtio NIC driver.
Comment 4 Bryan Venteicher freebsd_committer 2014-09-28 00:50:36 UTC
That makes me suspect some HW problem, likely RAM. I don't see how the network driver could have caused some transient error like this either.
Comment 5 Jan Siero 2014-09-29 11:25:36 UTC
Corruption reoccurred after reinstallation of the FreeBSD guest system with the use of the e1000 NIC driver. Virtio NIC driver can be excluded as cause of the corruption.

Corruption followed the same pattern as previous occurrences: in a single file 4K blocks nr. 160 - 164 were corrupted during reboot, which caused them to be exact copies of the following 4K block (nr. 165)

Corruption is likely to be HW related.