Bug 227906

Summary: emulators/open-vm-tools: vmguestd stops communicating with host randomly
Product: Ports & Packages Reporter: Phillip R. Jaenke <prj>
Component: Individual Port(s)Assignee: Josh Paetzel <jpaetzel>
Status: New ---    
Severity: Affects Only Me CC: jwolfe, ler, prj
Priority: --- Flags: bugzilla: maintainer-feedback? (jpaetzel)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
vmware.log from ghanir.rootwyrm.pvt
none
vmware.log from runetotem.rootwyrm.com none

Description Phillip R. Jaenke 2018-05-01 18:57:14 UTC
Separate bug requested by John Wolfe. This build does not appear to fully address a repeating issue I have been running into. No error is reported by host or guest, but guestd appears to stop communicating, resulting in the classic 'VMware Tools Not Installed' message in vCenter. There isn't any clear trigger, and restarting guestd seems to fix it. 

Reproduces on three guests now, with the hosts running 6.5.0 7388607. This problem began with open-vm-tools-nox11-10.2.0 from portsnap and SVN, packages built on my in-house poudriere cluster and straight from ports. Since no clear trigger has been found, it's VERY difficult to test port-build vs poudriere built package, but it has reproduced with both. Versions prior to 10.2.0 did not exhibit this behavior.

This has occurred from about FreeBSD 11.1-RELEASE through -p11, exclusive to amd64 guests (I have not been able to reproduce on i386.) I had hoped to possibly find a commonality between guests such as high VMDK count or frequent normal high CPU or RAM, but no such luck. It shows up on 1vC/1GB/2 VMDK as well as 4vC/4GB/10 VMDK. 

In all guests, bouncing vmguestd with /usr/local/etc/rc.d/vmware-guestd restart will resolve it temporarily. Sometimes vmware-kmod also needs a restart, even though kernel modules are already loaded. 10.2.5 does seem to resolve it a bit longer - about 24-36 hours - but 10.2.0_2,2 will lose communications again usually after a few hours (generally less than 4.) 

All guests are using the same boot start in rc.conf:
vmware_guest_vmblock_enable="YES"
vmware_guest_vmhgfs_enable="YES"
vmware_guest_vmmemctl_enable="YES"
vmware_guest_vmxnet_enable="YES"
vmware_guestd_enable="YES"

This does also occur with both open-vm-tools and open-vm-tools-nox11 so it's not related to any X components. Since there's no clear trigger and I can't reproduce at will, this has proven extremely difficult to debug. There are no indications of problems on either guest or host side in the logs either. It just stops communicating.
Comment 1 John Wolfe 2018-05-03 19:26:27 UTC
Hi Phillip,

Can you provide a little more information which will help select an approach to investigating the issue(s) here.

1. What is the workload on your FBSD 11 system and does the vmtoolsd stoppage correlate to any fluctuations in that workload.

2. Are there any core files left on you system from the time of the vmtoolsd stoppage?

3. What are the characteristics of your virtual machine; what virtual network interface is in use - e1000, vmxnet2 or vmxnet3?

4. What is the version and build # of the host being used - ESXi, WorkStation,  or Fusion.

Along with this info, one of the first things we can look at is the VMX (vmware.log) file that covers the time when the vmtoolsd stopped running and past the point where you do a "/usr/local/etc/rc.d/vmware-guestd start".   Please note the local time on the FBSD system when you notice the stoppage, and when you do the "start".

The vmware.log file will be in the same directory containing the VM's .vmx file.

If you can provide the information and collect the vmware.log file from the host and any new core file from FreeBSD system, that will help investigating the issue.
Comment 2 Phillip R. Jaenke 2018-05-03 20:17:51 UTC
Created attachment 193030 [details]
vmware.log from ghanir.rootwyrm.pvt
Comment 3 Phillip R. Jaenke 2018-05-03 20:18:12 UTC
Created attachment 193031 [details]
vmware.log from runetotem.rootwyrm.com
Comment 4 Phillip R. Jaenke 2018-05-03 20:18:24 UTC
OK, for ease of tracking (since it obviously is a bit of a convoluted mess) I'm going to break the response down per-guest since I have two that repeat the problem. And for ease of telling them apart, I'll use the FQDN.

ghanir.rootwyrm.pvt - using 10.2.5 from jpaetzel@'s repository
1) This is a poudriere and release building system, so frequent high CPU, disk, and memory workloads. With 10.2.5, communications halt appears tied to periods of high memory activity. (That correlates to certain port builds.) High CPU activity and high disk activity doesn't appear to have any impact. The problem is that it also doesn't consistently happen under high memory load either. Sometimes it just goes poof.

2) No core files at all. In fact, guestd doesn't actually stop at any point. It's still running, no errors, just isn't communicating with the host as far as I can tell.

3) Allocation is non-reserved, 4 vCPU, 4GB, 7 disks  (LSI Parallel), single VMXNET3 with direct path enabled. I have tried with LSI SAS, no impact there. The VMDKs are on a mix of SSD (DRS), 10k (DRS), and NFS. VM version is 13.

4) Hosts are ESXi 6.5.0 7388607 under vCenter using HA and VUM. All hosts are in sync on version and identical hardware and BIOS. This behavior started around 10.2.0 and somewhere after 6765664. ghanir is on taunka.

vmware.log attached as ghanir_vmware.log

-------------------------------------------------------
runetotem.rootwyrm.com - using 10.2.0_2,5 from local poudriere (ghanir)
1) This is a database server running PostgreSQL and MariaDB. Load is extremely low as almost all work is directed to replicas. Communications halt here seems to be entirely random or possibly time based. This one still frequently drops out; usually I can't keep it happy more than a few hours.

2) No core files here either, same issue with guestd still running with no errors or crashes. Just no communications with the host.

3) Allocation is non-reserved, 2 vCPU, 4GB, 6 disks, 2x LSI SAS (no bus share), 2x VMXNET3 with direct path enabled. This one is actually the rare case where multiple controllers really are required. Every few months, I reload a 300GB database. The transaction log disk is on one controller, and that database tablespace resides on the other controller. 

4) Hosts are ESXi 6.5.0 7388607 under vCenter using HA and VUM. All hosts are in sync on version and identical hardware and BIOS. This behavior started around 10.2.0 and somewhere after 6765664. runetotem is on shuhalo.

runetotem is especially odd, as I have restarted vmware guestd and kmod multiple times and there's no evidence of it in vmware.log at all. It's like it never even manages to attempt host communications. Case in point, I restarted vmware tools on April 27, vCenter showed it as working, a few hours later it stopped communicating again, nothing in the logs at all on guest or host.

vmware.log attached as runetotem_vmware.log