Bug 194513 - zfs recv hangs in state kmem arena
Summary: zfs recv hangs in state kmem arena
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-fs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-21 12:21 UTC by James Van Artsdalen
Modified: 2017-03-17 12:40 UTC (History)
12 users (show)

See Also:


Attachments
program to allocate 24GB RAM temporarily (355 bytes, text/x-csrc)
2014-10-27 01:57 UTC, James Van Artsdalen
no flags Details
procstat -kk -a (119.99 KB, text/plain)
2014-11-05 17:14 UTC, James Van Artsdalen
no flags Details
lowmem patch I attempted (566 bytes, patch)
2014-11-23 19:30 UTC, James Van Artsdalen
no flags Details | Diff
zpool history hang kgdb bt (2.31 KB, text/plain)
2014-11-23 19:48 UTC, James Van Artsdalen
no flags Details
utility to allocate memory to force defragmentation (419 bytes, text/plain)
2015-12-01 11:29 UTC, Steven Hartland
no flags Details
reboot doesn't get past this (205.36 KB, image/png)
2016-06-23 10:46 UTC, Twingly
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description James Van Artsdalen 2014-10-21 12:21:11 UTC
I am having every attempt to replicate my ZFS pool fail due to "zfs recv" blocking on "kmem arena".

This happens on two systems, and with both 10-STABLE and -CURRENT.  It hangs within the first 5TB, usually by 2TB, last night it failed after 4.35 MB.

I tried using a memstick image of -CURRENT made from the release/
process and this also hangs in "kmem arena".

The two tested systems have 16GB and 32GB of ECC RAM.

An uninvolved server of mine hung Friday night in state"kmem arena"
during periodic's "zpool history".  After a reboot it did not hang
Saturday night.

The command that fails is

# zfs send -R BIGTEX/UNIX@snap | zfs recv -duvF BIGTOX
Comment 1 James Van Artsdalen 2014-10-27 01:57:07 UTC
Created attachment 148687 [details]
program to allocate 24GB RAM temporarily
Comment 2 James Van Artsdalen 2014-10-27 01:58:09 UTC
I was able to complete a ZFS replication by manually intervening each time "zfs recv" blocked on "kmem arena": running the attached program "eatmem" was sufficient to unblock zfs each of the 17 times zfs stalled.

Eatmem is intended to consume about 24GB RAM out of 32GB physical RAM, thereby pressuring the ARC and kernel cache to shrink: when the program exits it would leave plenty of free RAM for zfs or whatever else.  What actually happened is that every time, zfs unblocked as eatmem was growing: it was never necessary to wait for eatmem to exit and free memory before zfs unblocked.
Comment 3 James Van Artsdalen 2014-11-05 17:14:05 UTC
Created attachment 149075 [details]
procstat -kk -a


The same systems hang the same way on "zpool history", although the  attached EATMEM does  not seem tp help.  Attached is the full "procstat -kk -a" output with a "zpool history" stuck.

SUPERTEX:/root# uname -a
FreeBSD SUPERTEX.housenet.jrv 10.1-PRERELEASE FreeBSD 10.1-PRERELEASE #3
r273476M: Wed Oct 22 15:05:29 CDT 2014    
root@SUPERTEX.housenet.jrv:/usr/obj/usr/src/sys/GENERIC  amd64
SUPERTEX:/root# procstat -kk -a > kk
SUPERTEX:/root# grep zio_wait kk
SUPERTEX:/root# grep zio_done kk
SUPERTEX:/root# grep zio_int kk
SUPERTEX:/root# grep zfs_ kk
    0 100475 kernel           zfs_vn_rele_task mi_switch+0xe1
sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a
fork_trampoline+0xe
    0 101018 kernel           zfs_vn_rele_task mi_switch+0xe1
sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a
fork_trampoline+0xe
    0 101522 kernel           zfs_vn_rele_task mi_switch+0xe1
sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a
fork_trampoline+0xe
12105 101599 zpool            -                mi_switch+0xe1
sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d
kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43
zfs_ioc_pool_get_history+0xbd zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139
kern_ioctl+0x255 sys_ioctl+0x13c amd64_syscall+0x351 Xfast_syscall+0xfb
SUPERTEX:/root# grep dmu_ kk
SUPERTEX:/root# grep arc_ kk
    3 100125 zfskern          arc_reclaim_thre mi_switch+0xe1
sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18b arc_reclaim_thread+0x288
fork_exit+0x9a fork_trampoline+0xe
    3 100126 zfskern          l2arc_feed_threa mi_switch+0xe1
sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18b l2arc_feed_thread+0x19f
fork_exit+0x9a fork_trampoline+0xe
SUPERTEX:/root# grep buf_ kk
   20 100139 bufdaemon        -                mi_switch+0xe1
sleepq_timedwait+0x3a _sleep+0x26e buf_daemon+0x78 fork_exit+0x9a
fork_trampoline+0xe
SUPERTEX:/root# ps lp 12105
UID   PID  PPID CPU PRI NI    VSZ   RSS MWCHAN   STAT TT     TIME COMMAND
  0 12105 12104   0  20  0 105692 35984 kmem are D     -  0:03.76 zpool
history BI
SUPERTEX:/root#
Comment 4 Andriy Gapon freebsd_committer freebsd_triage 2014-11-05 19:08:38 UTC
My personal opinion that the problem is caused by a bug in the combination of the new vmem-based code and the changes in the page daemon code.

When there is not enough KVA the code wakes up the page daemon with expectation that it would make some more KVA available, but the pagedaemon may not actually take any action.

Previously the page daemon code used to check a return value from msleep and it made a page out pass if it was woken up.  Now the page daemon code performs a pass when it is woken up *and* vm_pages_needed is set.  As the comment before pagedaemon_wakeup() explains that function is not guaranteed to actually wake up the page daemon unless vm_page_queue_free_mtx is held.  And kmem_reclaim() does not hold vm_page_queue_free_mtx when it calls pagedaemon_wakeup().

Additionally, before the switch to the vmem kmem_malloc() used to directly invoke vm_lowmem hook and uma_reclaim() function as opposed to trying to wake up the page daemon.

So, the old could would reliably free some KVA if there is any that can be freed by vm_lowmem hook and uma_reclaim.
But the new code makes a lame attempt to wake up the page daemon.

I believe that the above explains why you sometimes see processes stuck in vmem_xalloc() and why your workaround work - when you really force the page daemon to make a page out pass it would finally free some KVA by invoking vm_lowmem hook and uma_reclaim.

I see two trivial possible solutions:
- hold vm_page_queue_free_mtx in kmem_reclaim() around pagedaemon_wakeup() call
- directly call vm_lowmem hook and uma_reclaim() instead of pagedaemon_wakeup() in kmem_reclaim()

Not sure which one would be better.  Maybe there is an even better solution.
Comment 5 James Van Artsdalen 2014-11-23 19:30:42 UTC
Created attachment 149752 [details]
lowmem patch I attempted

I attempted a naive patch based on avg's comments: patch attached.  I haven't seen any "zfs recv" issues since but I have seen a hang in "zfs send", and the "zfs history" hang still happens.
Comment 6 James Van Artsdalen 2014-11-23 19:48:09 UTC
Created attachment 149753 [details]
zpool history hang kgdb bt

Attached is a kgdb bt of the "zpool history" thread.  The alloc that hung is for just over 10 MB.  The system (32GB RAM)  had been idle but daily periodic was running when it hung on "kmem arena".
Comment 7 nagy.attila 2014-12-02 09:23:49 UTC
I've just found this entry. We're having the same issue, running stable/10@r274502

Any chance for a fix?
Comment 8 Alan Cox freebsd_committer freebsd_triage 2015-01-02 19:58:28 UTC
I suspect that there are two distinct issues at play here.  I believe that Andriy has correctly identified a synchronization error in kmem_reclaim().  However, the size of the failed allocations, 10 MB, is also interesting.  While vmem's allocation strategy is better at reducing external fragmentation than the simplistic "first-fit" strategy that we previously used, it is not immune to external fragmentation.  Thus, kmem arena may not have 10 MB of contiguous virtual address space even though there is plenty of free space.

I suggest that you try increasing the kmem arena size to 1.25 or even 1.5 times the physical memory size and report back on the effect.
Comment 9 Palle Girgensohn freebsd_committer freebsd_triage 2015-02-09 13:48:10 UTC
Hi, We're seeing this problem as well. Is it still a problem in CURRENT?

Alan's suggestion to "try increasing the kmem arena size", is that vm.kmem_size?

$ sysctl -h hw.physmem  vm.kmem_size vm.kmem_size_max
hw.physmem: 8 541 974 528
vm.kmem_size: 8 295 661 568
vm.kmem_size_max: 1 319 413 950 874
$

Does that mean setting vm.kmem_size = 1.5 * hw.physmem ?

Palle
Comment 10 Steven Hartland freebsd_committer freebsd_triage 2015-04-03 15:38:30 UTC
The following commit may have a positive effect for this issue:
https://svnweb.freebsd.org/base?view=revision&revision=281026
Comment 11 nagy.attila 2015-04-03 20:27:07 UTC
(In reply to Palle Girgensohn from comment #9)
BTW, yes, vm.kmem_size. Raising that seemed to help here.
Comment 12 Steven Hartland freebsd_committer freebsd_triage 2015-11-30 12:33:53 UTC
Also likely that the following commit should have helped prevent this issue:
https://svnweb.freebsd.org/base?view=revision&revision=282690

I'm closing this for now, if anyone sees this in Stable 10 > r283310 or 10.2-RELEASE or above feel free to reopen.
Comment 13 mgoroff 2015-12-01 01:24:30 UTC
We are encountering what appears to be the same issue on a large production server running 10.2-RELEASE. The machine has 128G of RAM and vfs.zfs.arc_max is set to 100G. I have a zfs recv currently stuck waiting on kmem arena. When this problem occurs, the ARC starts rapidly shrinking down to vfs.zfs.arc_min (13G in our case) but the process will remain blocked for a long period of time (hours) even though top shows free memory at 104G.

ps -l for the process shows:

UID   PID  PPID CPU PRI NI   VSZ  RSS MWCHAN   STAT TT     TIME COMMAND
  0 37892 37814   0  20  0 42248 3428 kmem are D     1  0:04.16 zfs recv -u -d ezdata2

while procstat -kk -p 37892 shows:

  PID    TID COMM             TDNAME           KSTACK                       
37892 101149 zfs              -                mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43 dmu_recv_stream+0x114 zfs_ioc_recv+0x955 zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139 kern_ioctl+0x255 sys_ioctl+0x140 amd64_syscall+0x357 Xfast_syscall+0xfb
Comment 14 Steven Hartland freebsd_committer freebsd_triage 2015-12-01 08:59:31 UTC
Thanks for the update Marc, reopened.
Comment 15 Steven Hartland freebsd_committer freebsd_triage 2015-12-01 11:29:40 UTC
Created attachment 163702 [details]
utility to allocate memory to force defragmentation

This is a little utility that allocates memory, which creates memory pressure in an attempt to force de-fragmentation as ARC free's up aggressively.

Its based on James Van Artsdalen initial utility with fixes for compilation as well as adding progress output and unbounding the allocation which is now done in 1GB chunks.
Comment 16 Palle Girgensohn freebsd_committer freebsd_triage 2016-05-31 10:07:14 UTC
We are seeing this regularly on a machine with
34 GB RAM
34 TB storage JBOD large array (ZFS)
*no* SSD ZIL or L2Arc.

ssh othermachine zfs send fs | zfs receive fs

sometimes hangs. Not often, perhaps once a month or so.

Always in kmem arena just like mgoroff@quorum.net describes above.

I tried Stevens program to allocate memory, but it still stuck.

This is backup machine, so I can always just reboot, unless someone has an idea about something to try first, that could perhaps explain more about what's happening?

Palle
Comment 17 Twingly 2016-06-23 10:09:40 UTC
We are also seeing this on our backup server
64 GB RAM
4x6TB (zpool with 2 mirrors)
*no* SSD ZIL or L2Arc

zfs send ... | ssh backupserver zfs receive zfs receive -u ...

often hangs, and I have to reboot the server to get rid of the hang process and to be able to receive datasets again

no problem with 'zpool history', it runs fine


$ uname -a
FreeBSD hodor.live.lkp.primelabs.se 10.2-RELEASE-p18 FreeBSD 10.2-RELEASE-p18 #0: Sat May 28 08:53:43 UTC 2016     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

$ freebsd-version
10.2-RELEASE-p19

$ ps -l -U backup
 UID   PID PPID CPU PRI NI   VSZ  RSS MWCHAN   STAT TT     TIME COMMAND
1006 59745    1   0  20  0 42248 3164 kmem are Ds    -  0:03.81 zfs receive -u storage/backup/mysql/leaf/data

$ procstat 59745
  PID  PPID  PGID   SID  TSID THR LOGIN    WCHAN     EMUL          COMM
59745     1 59745 59745     0   1 backup   kmem are  FreeBSD ELF64 zfs

$ sudo procstat -kk 59745
Password:
  PID    TID COMM             TDNAME           KSTACK
59745 100809 zfs              -                mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43 dmu_recv_stream+0x114 zfs_ioc_recv+0x955 zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139 kern_ioctl+0x255 sys_ioctl+0x140 amd64_syscall+0x357 Xfast_syscall+0xfb

$ sysctl -h hw.physmem  vm.kmem_size vm.kmem_size_max
hw.physmem: 68,578,963,456
vm.kmem_size: 66,774,925,312
vm.kmem_size_max: 1,319,413,950,874


I will look into raising vm.kmem_size, but right now I have to reboot hand hopefully get some backups going...
Comment 18 Twingly 2016-06-23 10:46:33 UTC
Created attachment 171704 [details]
reboot doesn't get past this

When zfs recive has hung, 'reboot' won't work to reboot the machine, it just hangs and I have to reset the machine via remote management.

'reboot' works fine on this machine when there is no hung zfs receive processes
Comment 19 Twingly 2016-07-06 15:01:19 UTC
Some additional information: I left 4 hanged zfs receive overnight and in the morning they had completed. My shell reported two of them running for 10h 35m 26s and 10h 38m 50s. Note that this was for newly created datasets without any data on them, so shouldn't they have completed instantly?
Comment 20 Palle Girgensohn freebsd_committer freebsd_triage 2016-09-05 14:45:19 UTC
Hi,

I just saw this problem again. zfs recv hang in kmem arena.

This was a very common problem for us in that past, but after upgrading to 10.3 and getting SSD:s for ZIL and L2Arc, this problem has drastically diminished. Apparently, it still happens occasionally, but it is rare.

FreeBSD 10.3
Comment 21 Dave Baukus 2017-03-02 00:09:38 UTC
(In reply to Steven Hartland from comment #10)

Is it possible that the referenced change does not have enough of a positive effect with respect to zfs receive pending in kmem_arena because zfs receive is attempting to allocate SPA_MAXBLOCKSIZE (16M) and the fragmentation check is looking for fragmentation below zfs_max_recordsize (1M) ?

/*
 * Above limits know nothing about real level of KVA fragmentation.
 * Start aggressive reclamation if too little sequential KVA left.
 */
if (vmem_size(heap_arena, VMEM_MAXFREE) < zfs_max_recordsize)
       return (1);
Comment 22 Steven Hartland freebsd_committer freebsd_triage 2017-03-09 16:23:09 UTC
(In reply to Dave Baukus from comment #21)
This is possible if you are seeing this issue do try increasing it to SPA_MAXBLOCKSIZE to see if it helps.
Comment 23 commit-hook freebsd_committer freebsd_triage 2017-03-17 12:36:06 UTC
A commit references this bug:

Author: smh
Date: Fri Mar 17 12:34:57 UTC 2017
New revision: 315449
URL: https://svnweb.freebsd.org/changeset/base/315449

Log:
  Reduce ARC fragmentation threshold

  As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv,
  update the threshold at which we start agressive reclamation to use
  SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which
  defaults to 1M.

  PR:		194513
  Reviewed by:	avg, mav
  MFC after:	1 month
  Sponsored by:	Multiplay
  Differential Revision:	https://reviews.freebsd.org/D10012

Changes:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
Comment 24 Steven Hartland freebsd_committer freebsd_triage 2017-03-17 12:40:09 UTC
Closing for now as I've committed the change to reduce ARC fragmentation threshold.

If this is still reproducible with this we can reopen.