Bug 219556

Summary: iSCSI subsystem blindly announces non-existing file (was: emulators/virtualbox-ose: VBoxSVC consumes 100% CPU and does not exit)
Product: Base System Reporter: Martin Birgmeier <d8zNeCFG>
Component: kernAssignee: Virtualbox Team (Nobody) <vbox>
Status: Closed Not A Bug    
Severity: Affects Only Me CC: emaste, trasz
Priority: --- Flags: bugzilla: maintainer-feedback? (vbox)
Version: 11.0-STABLE   
Hardware: Any   
OS: Any   

Description Martin Birgmeier 2017-05-25 20:54:26 UTC
Since a few days (I believe since the upgrade of mesa to 17.1.0) VBoxSVC continuously consumes 100% CPU and does not shut down after all sessions have closed.

This happens if any graphical window is used (VirtualBox). If using only VBoxHeadless, VBoxSvc behaves normally.

Also, this happens on one of two systems only: The affected one has an Intel CPU and uses an Nvidia graphics card + proprietary driver, whereas the normal one uses an AMD CPU and displays graphical output over the network (X11 connection).

# top -Utest
last pid:  9924;  load averages:  0.88,  0.90,  1.40    up 0+05:30:35  22:43:46
122 processes: 2 running, 119 sleeping, 1 zombie
CPU:  0.5% user, 12.5% nice,  0.4% system,  0.0% interrupt, 86.5% idle
Mem: 820M Active, 2747M Inact, 5324M Wired, 109M Buf, 15G Free
ARC: 2998M Total, 618M MFU, 2329M MRU, 444K Anon, 11M Header, 38M Other
Swap: 9216M Total, 9216M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 9916 test         14  25    5   172M 50820K select  5   0:15 103.65% VBoxSVC
 9911 test          9  26    5   394M   107M select  2   0:01   0.08% VirtualBo
 9917 test          1  20    0 30364K  7076K CPU5    5   0:00   0.05% top
 9914 test          1  25    5   123M 43400K select  0   0:00   0.02% VBoxXPCOM
 9898 test          1  20    0 75696K 13228K select  0   0:00   0.01% xterm
 9899 test          1  20    0 35068K  8796K pause   4   0:00   0.00% zsh

When exiting VirtualBox via the "exit" menu command, the window vanishes, but VBoxSVC continues to run at 100% CPU, and the VirtualBox process remains (at 0%). In this case, VBoxSVC can be killed -1, which will then also terminate VirtualBox and VBoxXPCOM.

All ports up-to-date as of now.
Comment 1 Martin Birgmeier 2017-05-26 08:12:10 UTC
The problem could lie with the preview in VirtualBox - it stays blank (black) on the machine where VBoxSVC consumes 100% CPU.
Comment 2 Martin Birgmeier 2017-05-26 14:00:53 UTC
After some experimentation, the problem turned out to have a quite different cause.

On both hosts, the same emulated harddisk is available: on the AMD machine directly via a vmdk file pointing, on the Intel machine via iSCSI.

However, the file holding the emulated harddisk was not available. On the AMD machine, this only caused VirtualBox to note that the machine cannot be started. On the Intel machine however, the file seems to get exported as a zero-sized disk; this in turn seems to trip up VBoxSVC.

So there are two issues: One, VBoxSVC gets confused about the announced but not really existing iSCSI target, and second, the iSCSI daemon blindly announces to export a file (with zero size) which does not exist (and cannot be created, its path not being available).

I think the bigger problem lies with the iSCSI subsystem announcing a target which does not exist. I'll try to reassing the issue to the base system.
Comment 3 Edward Tomasz Napierala freebsd_committer freebsd_triage 2017-05-27 07:09:36 UTC
I think it's a feature, not a bug.  A nonexisting file is a valid LUN, but without backing media (think an empty DVD drive).  SCSI compliant clients are supposed to know how to handle this.
Comment 4 Martin Birgmeier 2017-05-27 11:31:35 UTC
Hmmm... but if I then make the backing file available (by mounting the filesystem on which it resides), this fact is not reflected when I run 'ctladm devlist'. The same goes for the client: Attaching the disk using 'iscsictl -A' indeed first shows that the disk is not ready:

da0: Attempt to query device size failed: NOT READY, Logical unit not ready, manual

But after mounting the backing file, this does not change.

-- Martin
Comment 5 Edward Tomasz Napierala freebsd_committer freebsd_triage 2017-05-30 09:21:51 UTC
You need to notify CTL that something has changed by doing "service ctld reload".  Could you check whether it works and report back?  Thanks!
Comment 6 Martin Birgmeier 2017-05-31 17:54:01 UTC
This indeed works, I do not even need to rescan the bus on the client. The following message is logged when trying "dd if=/dev/da0 of=/dev/null bs=1m count=1" on the client after making the target available:

May 31 19:43:46 mizar kernel: (da0:iscsi1:0:0:0): Capacity data has changed

However, I then proceeded to remove the target from the configuration. On the client this leads to the message

May 31 17:44:41 mizar iscsid[4380]: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): target returned error: Not found
May 31 19:44:41 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): connection error; reconnecting
May 31 19:44:41 mizar iscsid[4220]: child process 4380 terminated with exit status 1

But device da0 on the client still exists. If I now do dd if=/dev/da0 of=/dev/null bs=1m count=1" on the client again, I get a non-interruptible hang until I make the target available again, and even then the hang only terminates after a timeout of 60 seconds:

May 31 19:45:42 mizar iscsid[4387]: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): target returned error: Not found
May 31 19:45:42 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): login timed out after 61 seconds; reconnecting
May 31 19:45:42 mizar iscsid[4220]: child process 4387 terminated with exit status 1
May 31 19:46:43 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): login timed out after 61 seconds; reconnecting

The dd then succeeds.
Comment 7 Edward Tomasz Napierala freebsd_committer freebsd_triage 2017-05-31 19:37:06 UTC
This too is expected: by default, the initiator will handle errors by just reconnecting until it succeeds.  You don't want your iSCSI disks to disappear because of transient network problems, right?  ;-)

You can disable this behaviour by changing the kern.iscsi.fail_on_disconnection sysctl to 1, as described in man.freebsd.org/iscsi.
Comment 8 Martin Birgmeier 2017-06-01 16:30:38 UTC
Thanks for all the pointers!

-- Martin