Bug 229986 - Terminating virtualbox or ctld + zfs scrub + gstat leads to deadlock
Summary: Terminating virtualbox or ctld + zfs scrub + gstat leads to deadlock
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-fs (Nobody)
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2018-07-23 18:08 UTC by Martin Birgmeier
Modified: 2020-05-30 11:07 UTC (History)
3 users (show)

See Also:


Attachments
output of running "procstat -kk -a" (178.33 KB, text/plain)
2018-07-23 18:36 UTC, Martin Birgmeier
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Birgmeier 2018-07-23 18:08:07 UTC
Scenario:
- Machine recently upgrade to 11.2, but with https://reviews.freebsd.org/D7538 because of excessive swapping
- Machine with raidz2 pool with 6 1.5 TB partitions
- Machine acts as VirtualBox host to a FreeBSD 11.2 client
- Run "zpool scrub" on the pool
- Run vbox with 11.2 client
- Shut down the client with "shutdown -p now" (from within the client)
- In another terminal window, gstat is running

Result:
- Gstat hangs
- vbox client hangs at the very end (power down not executed)

Expected result:
- Gstat should continue running
- Vbox client should stop executing

Scenario continued:
- Pause the scrub with "zpool scrub -p"

Result:
- After a few seconds, gstat continues and the vbox client terminates

This behavior is 100% reproducible and did not happen with 11.1.
Comment 1 Martin Birgmeier 2018-07-23 18:36:18 UTC
Created attachment 195395 [details]
output of running "procstat -kk -a"

Just ran another VBox client with FreeBSD head@r334442. This time grabbed "procstat -kk -a".

I hope a knowledgeable person can spot the issue.

-- Martin
Comment 2 Martin Birgmeier 2018-07-23 19:03:33 UTC
I just noticed that the same hang happens when trying to stop the ctld.

Scenario:
- Machine recently upgrade to 11.2, but with https://reviews.freebsd.org/D7538 because of excessive swapping
- Machine with raidz2 pool with 6 1.5 TB partitions
- Machine acts as iSCSI server to remote VirtualBox instances
- Run "zpool scrub" on the pool
- In another terminal window, gstat is running
- Stop the iSCSI server using "service ctld stop"

Result:
- gstat hangs
- "service ctld stop" does not terminate

Expected result:
- gstat should continue running
- "service ctld stop" should return

Scenario continued:
- Pause the scrub with "zpool scrub -p"

Result:
- After a few seconds, gstat continues and "service ctld stop" terminates

Remark:
- Both the VBox client running directly on the machine as well as the ctld serve zvols, so maybe it is related to that.
Comment 3 Andriy Gapon freebsd_committer 2018-07-23 19:22:23 UTC
There are some interesting threads in the procstat output, but I do not see anything like a deadlock.

    0 100255 kernel              zio_write_issue_2   mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zio_wait+0x8b dmu_buf_hold_array_by_dnode+0x232 dmu_read_impl+0xa6 dmu_read+0x45 space_map_iterate+0x103 space_map_load+0x91 metaslab_load+0x3f metaslab_alloc_dva+0x91b metaslab_alloc+0x98 zio_dva_allocate+0x82 zio_execute+0xac taskqueue_run_locked+0x154 taskqueue_thread_loop+0x98 fork_exit+0x83 fork_trampoline+0xe
    9 100461 zfskern             txg_thread_enter    mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e zio_wait+0x8b dsl_pool_sync+0x3a4 spa_sync+0x9c5 txg_sync_thread+0x280 fork_exit+0x83 fork_trampoline+0xe

   12 100018 geom                g_event             mi_switch+0xe6 sleepq_wait+0x2c _cv_wait+0x16e txg_wait_synced+0xa5 zil_close+0xbf zvol_last_close+0x15 zvol_geom_access+0x17a g_access+0x1fd g_label_taste+0x415 g_new_provider_event+0xca g_run_events+0x13e fork_exit+0x83 fork_trampoline+0xe

49368 101268 VBoxHeadless        -                   mi_switch+0xe6 sleepq_timedwait+0x2f _sleep+0x21e g_waitidle+0xa1 userret+0x50 amd64_syscall+0x23a fast_syscall_common+0x101

 8922 101185 gstat               -                   mi_switch+0xe6 sleepq_timedwait+0x2f _sleep+0x21e g_waitfor_event+0xc3 sysctl_kern_geom_confxml+0x39 sysctl_root_handler_locked+0x8b sysctl_root+0x1c3 userland_sysctl+0x136 sys___sysctl+0x5f amd64_syscall+0xa1f fast_syscall_common+0x101

Thread 100018 is the most interesting.
We can see that there is a GEOM new provider event. It naturally leads to the GEOM tasting.  We see that g_label_taste has finished its tasting and closes the zvol, so zvol_last_close is called.  That leads to zil_close and txg_wait_synced.  I am not sure why the last call is made as I would not expect any writes to the volume after it's been closed by VBox.
Anyway, the thread waits on the sync thread to sync out the current TXG.

The sync thread, 100461, is waiting for an I/O request to complete.

gstat and VBoxHeadless are waiting for the GEOM event to be processed (by thread 100018).

So, this looks like the scrub I/O introduces a significant delay into the GEOM event handling via the ZFS sync thread.
Comment 4 Martin Birgmeier 2018-07-24 16:28:51 UTC
Thank you Andriy for this first analysis.

A few days ago I must have had a similar situation which I however did not notice quickly enough. After a while the complete system became unresponsive and I had to eventually hard reset it.

Which is not funny because right from the booting, the interrupted scrub continues and thereby increases the startup time to around 20 minutes - see bug #229829.

I have the impression that something is very fishy with 11.2, zfs, and memory management.

What should I make of the information posted in https://lists.freebsd.org/pipermail/freebsd-stable/2018-June/089151.html?

-- Martin
Comment 5 Martin Birgmeier 2018-07-24 16:34:58 UTC
I have one more question, being no expert in things kernel: As far as I understand, there are some consumers which compete for real memory, amongst them certainly running processes themselves, the unified buffer cache, ZFS ARC (not unified if I understand correctly), and processes with special demands like VirtualBox which wire real memory.

Shouldn't there be a central arbiter which apportions this real memory according to needs? It's certainly leading nowhere if every consumer computes their own view of memory needs which ultimately don't fit together, leading for example to the ARC consuming so much memory that other consumers need to start swapping, or worse, that the ARC itself is being swapped out (I guess this cannot really happen).

I'd really like to be enlightened to how this area of the kernel works.

-- Martin
Comment 6 Rodney W. Grimes freebsd_committer 2018-08-02 16:30:56 UTC
If you have not done so, I would highly encorage you to reduce the max size of the arc cache and see if your problems go away.  Reduce it by the size of any VM's your running in bhyve or virtual box.

Add something like this to /boot/loader.conf:
vfs.zfs.arc_max=6442450944
Comment 7 Martin Birgmeier 2018-08-02 16:37:26 UTC
I am already doing that, but the deadlock described here happens nonetheless.

What gets better is that the machine does not lock up completely (see comment #4).

-- Martin
Comment 8 Rodney W. Grimes freebsd_committer 2018-08-02 19:10:39 UTC
(In reply to Martin Birgmeier from comment #7)
Are you sure that you are not in a memory deadlock?
You say you did apply an arc cache limit, how much did you reduce it buy?  How big are your VM's?  If you run top what is your free memory space?  Is your system using any swap at all?
Comment 9 Martin Birgmeier 2018-08-03 08:57:05 UTC
Actually I am quite sure that the system *is* in a kind of memory deadlock at least in the case where it ultimately gets stuck completely (comment #4). In that case, if I remember correctly, it also used quite some swap (around 5% of 20G configured).

But for the issue with scrubbing interacting badly with shutting down VMs and stopping ctld, this happens without too much memory pressure.

Specifically, I have:
- 16G main memory; this initially gives vfs.zfs.arc_max: 15371956224, which is high, considering that dmesg.boot shows
real memory  = 17179869184 (16384 MB)
avail memory = 16344322048 (15587 MB)
- Then use sysctl vfs.zfs.arc_max=9000000000
- The VM run directly on this machine has 3G simulated physical memory

The scrub has completed in the meantime, and I don't really want to restart it, having taken 360 hours in total, and with it being very taxing on the disks. So it is difficult to replay the scenario now.

However, what is still happening and may be related to memory pressure again: When I shut down ctld after many hours of operation of this machine using 'service ctld onestop' (see comment #2), in contrast to 11.1 this operation may take several minutes. I believe this happens especially if I forget to reduce arc_max.

In this case again it looks to me like memory pressure and that the system takes a long time to get some free memory which seems to be required for this operation.

Maybe it is related to zvols, both the VM running directly on the machine as well as the ctld reference zvols.

-- Martin
Comment 10 Andriy Gapon freebsd_committer 2018-08-03 09:10:57 UTC
Martin,
by the way, do you have any non-default settings related to scrub performance?
Comment 11 Martin Birgmeier 2018-08-03 10:28:41 UTC
Andriy,

Not really. sysctl.conf is basically empty. Maybe these lines are of interest:

# PostgreSQL
# (see http://people.freebsd.org/~seanc/postgresql/postgresql-kern-autotune.sh)
# tuned to 8 GB for shared mem
kern.ipc.shmall=2097152
kern.ipc.shmmax=8589934592

But postgres is idle on this machine, and ipcs -a shows for shared memory

Shared Memory:
T           ID          KEY MODE        OWNER    GROUP    CREATOR  CGROUP         NATTCH        SEGSZ         CPID         LPID ATIME    DTIME    CTIME   
m        65536      5432001 --rw------- postgres postgres postgres postgres            5           48          727          727  8:00:08 12:27:13  8:00:08
m       196609            0 --rwa------ kdm      kdm      kdm      kdm                 2      7225344         2224         2196  8:00:30 no-entry  8:00:30
m        65538            0 --rwa------ kdm      kdm      kdm      kdm                 2      5242880         2250         2222  8:00:30 no-entry  8:00:30

-- Martin