Host system: CentOS 6 with included KVM
Guest system: FreeBSD on Virtio drivers
System contains 4K / GPT disks with SSD Caching
System runs stable for about a month. After updating the system with freebsd-update install, or after upgrading a port using "portmaster" a reboot was issued.
After reboot, multiple library files in /usr/lib were corrupted.
The corruption manifests itself by messages like:
/usr/lib/libpam.so.5 Shared object has no run-time symbol table
The MD5 hashes of the files differ with the version on the distribution DVD.
Comparing file contents with their originals using hexdiff, the corruption manifests itself with the following patterns:
- Corrupted parts start and end exactly at 4K boundaries
- A file can have multiple corrupted blocks
- The corrupted 4K blocks are always exact copies of other (non-corrupted) 4K blocks of the same file.
The corruption was observed on one system three times over the last 4 months, on FreeBSD 10.0 RELEASE and on FREEBSD 9.3 RELEASE. After corruption the system was wiped clean and reinstalled.
/var/log/messages shows no entries explaining any system issues
running fsck on the affected partition: the file system is clean
File systems were created using the following commands (aligning them to 4K)
gpart create -s gpt vtbd0
gpart add -a 1M -s 512k -t freebsd-boot -l exboot vtbd0
gpart add -a 1M -s 5G -t freebsd-ufs -l exrootfs vtbd0
gpart add -a 1M -s 5G -t freebsd-swap -l exswap vtbd0
gpart add -a 1M -s 30G -t freebsd-ufs -l exvarfs vtbd0
gpart add -a 1M -s 4G -t freebsd-ufs -l extmpfs vtbd0
gpart add -a 1M -t freebsd-ufs -l exusrfs vtbd0
gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 vtbd0
newfs -U /dev/gpt/exrootfs
newfs -U /dev/gpt/exvarfs
newfs -U /dev/gpt/extmpfs
newfs -U /dev/gpt/exusrfs
The last time the corruption occurred, MD5 hashes of library files in /usr/lib were saved shortly before reboot (after running freebsd-update install).
After reboot, MD5 hashes for some files were different compared to the ones before reboot. So the file corruption must have taken place during reboot.
Further Analysis of the last occurrence:
On the host system, backtraces were found that relate to the Broadcom (bnx2) driver of the host. The backtraces were bound to the process ID of the FreeBSD guest. Backtrace entries started about 50 minutes before reboot and stopped at the time of reboot.
The KVM FreeBSD guest uses the Virtio NIC network model. Linux KVM website however suggests using the e1000 network model:
VPS guest had been shut down for 4 days. After mounting a DVD and starting the FreeBSD VPS Guest (from hard disk) the corruption magically had disappeared: no errors, the MD5 hashes of previously corrupted files were OK. Host system was not restarted in the meantime.
Temporary corruption must have occurred on the KVM host system, possibly triggered by the virtio NIC driver.
That makes me suspect some HW problem, likely RAM. I don't see how the network driver could have caused some transient error like this either.
Corruption reoccurred after reinstallation of the FreeBSD guest system with the use of the e1000 NIC driver. Virtio NIC driver can be excluded as cause of the corruption.
Corruption followed the same pattern as previous occurrences: in a single file 4K blocks nr. 160 - 164 were corrupted during reboot, which caused them to be exact copies of the following 4K block (nr. 165)
Corruption is likely to be HW related.