219556 – iSCSI subsystem blindly announces non-existing file (was: emulators/virtualbox-ose: VBoxSVC consumes 100% CPU and does not exit)

Bug 219556 - iSCSI subsystem blindly announces non-existing file (was: emulators/virtualbox-ose: VBoxSVC consumes 100% CPU and does not exit)

Summary: iSCSI subsystem blindly announces non-existing file (was: emulators/virtualbo...

Status:	Closed Not A Bug

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.0-STABLE
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Virtualbox Team (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-05-25 20:54 UTC by Martin Birgmeier
Modified:	2017-06-01 16:30 UTC (History)
CC List:	2 users (show)

See Also:

Flags:	bugzilla: maintainer-feedback? (vbox)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Birgmeier 2017-05-25 20:54:26 UTC

Since a few days (I believe since the upgrade of mesa to 17.1.0) VBoxSVC continuously consumes 100% CPU and does not shut down after all sessions have closed.

This happens if any graphical window is used (VirtualBox). If using only VBoxHeadless, VBoxSvc behaves normally.

Also, this happens on one of two systems only: The affected one has an Intel CPU and uses an Nvidia graphics card + proprietary driver, whereas the normal one uses an AMD CPU and displays graphical output over the network (X11 connection).

# top -Utest
last pid:  9924;  load averages:  0.88,  0.90,  1.40    up 0+05:30:35  22:43:46
122 processes: 2 running, 119 sleeping, 1 zombie
CPU:  0.5% user, 12.5% nice,  0.4% system,  0.0% interrupt, 86.5% idle
Mem: 820M Active, 2747M Inact, 5324M Wired, 109M Buf, 15G Free
ARC: 2998M Total, 618M MFU, 2329M MRU, 444K Anon, 11M Header, 38M Other
Swap: 9216M Total, 9216M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 9916 test         14  25    5   172M 50820K select  5   0:15 103.65% VBoxSVC
 9911 test          9  26    5   394M   107M select  2   0:01   0.08% VirtualBo
 9917 test          1  20    0 30364K  7076K CPU5    5   0:00   0.05% top
 9914 test          1  25    5   123M 43400K select  0   0:00   0.02% VBoxXPCOM
 9898 test          1  20    0 75696K 13228K select  0   0:00   0.01% xterm
 9899 test          1  20    0 35068K  8796K pause   4   0:00   0.00% zsh

When exiting VirtualBox via the "exit" menu command, the window vanishes, but VBoxSVC continues to run at 100% CPU, and the VirtualBox process remains (at 0%). In this case, VBoxSVC can be killed -1, which will then also terminate VirtualBox and VBoxXPCOM.

All ports up-to-date as of now.

Comment 1 Martin Birgmeier 2017-05-26 08:12:10 UTC

The problem could lie with the preview in VirtualBox - it stays blank (black) on the machine where VBoxSVC consumes 100% CPU.

Comment 2 Martin Birgmeier 2017-05-26 14:00:53 UTC

After some experimentation, the problem turned out to have a quite different cause.

On both hosts, the same emulated harddisk is available: on the AMD machine directly via a vmdk file pointing, on the Intel machine via iSCSI.

However, the file holding the emulated harddisk was not available. On the AMD machine, this only caused VirtualBox to note that the machine cannot be started. On the Intel machine however, the file seems to get exported as a zero-sized disk; this in turn seems to trip up VBoxSVC.

So there are two issues: One, VBoxSVC gets confused about the announced but not really existing iSCSI target, and second, the iSCSI daemon blindly announces to export a file (with zero size) which does not exist (and cannot be created, its path not being available).

I think the bigger problem lies with the iSCSI subsystem announcing a target which does not exist. I'll try to reassing the issue to the base system.

Comment 3 Edward Tomasz Napierala freebsd_committer

2017-05-27 07:09:36 UTC

I think it's a feature, not a bug.  A nonexisting file is a valid LUN, but without backing media (think an empty DVD drive).  SCSI compliant clients are supposed to know how to handle this.

Comment 4 Martin Birgmeier 2017-05-27 11:31:35 UTC

Hmmm... but if I then make the backing file available (by mounting the filesystem on which it resides), this fact is not reflected when I run 'ctladm devlist'. The same goes for the client: Attaching the disk using 'iscsictl -A' indeed first shows that the disk is not ready:

da0: Attempt to query device size failed: NOT READY, Logical unit not ready, manual

But after mounting the backing file, this does not change.

-- Martin

Comment 5 Edward Tomasz Napierala freebsd_committer

2017-05-30 09:21:51 UTC

You need to notify CTL that something has changed by doing "service ctld reload".  Could you check whether it works and report back?  Thanks!

Comment 6 Martin Birgmeier 2017-05-31 17:54:01 UTC

This indeed works, I do not even need to rescan the bus on the client. The following message is logged when trying "dd if=/dev/da0 of=/dev/null bs=1m count=1" on the client after making the target available:

May 31 19:43:46 mizar kernel: (da0:iscsi1:0:0:0): Capacity data has changed

However, I then proceeded to remove the target from the configuration. On the client this leads to the message

May 31 17:44:41 mizar iscsid[4380]: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): target returned error: Not found
May 31 19:44:41 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): connection error; reconnecting
May 31 19:44:41 mizar iscsid[4220]: child process 4380 terminated with exit status 1

But device da0 on the client still exists. If I now do dd if=/dev/da0 of=/dev/null bs=1m count=1" on the client again, I get a non-interruptible hang until I make the target available again, and even then the hang only terminates after a timeout of 60 seconds:

May 31 19:45:42 mizar iscsid[4387]: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): target returned error: Not found
May 31 19:45:42 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): login timed out after 61 seconds; reconnecting
May 31 19:45:42 mizar iscsid[4220]: child process 4387 terminated with exit status 1
May 31 19:46:43 mizar kernel: WARNING: hal.xyzzy (iqn.1995-06.xyzzy.hal:disk1): login timed out after 61 seconds; reconnecting

The dd then succeeds.

Comment 7 Edward Tomasz Napierala freebsd_committer

2017-05-31 19:37:06 UTC

This too is expected: by default, the initiator will handle errors by just reconnecting until it succeeds.  You don't want your iSCSI disks to disappear because of transient network problems, right?  ;-)

You can disable this behaviour by changing the kern.iscsi.fail_on_disconnection sysctl to 1, as described in man.freebsd.org/iscsi.

Comment 8 Martin Birgmeier 2017-06-01 16:30:38 UTC

Thanks for all the pointers!

-- Martin