Summary: | zfs recv hangs in state kmem arena | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | James Van Artsdalen <james-freebug> | ||||||||||||||
Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||||||||||||
Status: | Closed FIXED | ||||||||||||||||
Severity: | Affects Some People | CC: | alc, avg, daveb, delphij, girgen, jeffr, mgoroff, nagy.attila, ota, pi, shared+bugs.freebsd.org, smh | ||||||||||||||
Priority: | --- | ||||||||||||||||
Version: | 10.0-STABLE | ||||||||||||||||
Hardware: | Any | ||||||||||||||||
OS: | Any | ||||||||||||||||
Attachments: |
|
Description
James Van Artsdalen
2014-10-21 12:21:11 UTC
Created attachment 148687 [details]
program to allocate 24GB RAM temporarily
I was able to complete a ZFS replication by manually intervening each time "zfs recv" blocked on "kmem arena": running the attached program "eatmem" was sufficient to unblock zfs each of the 17 times zfs stalled. Eatmem is intended to consume about 24GB RAM out of 32GB physical RAM, thereby pressuring the ARC and kernel cache to shrink: when the program exits it would leave plenty of free RAM for zfs or whatever else. What actually happened is that every time, zfs unblocked as eatmem was growing: it was never necessary to wait for eatmem to exit and free memory before zfs unblocked. Created attachment 149075 [details] procstat -kk -a The same systems hang the same way on "zpool history", although the attached EATMEM does not seem tp help. Attached is the full "procstat -kk -a" output with a "zpool history" stuck. SUPERTEX:/root# uname -a FreeBSD SUPERTEX.housenet.jrv 10.1-PRERELEASE FreeBSD 10.1-PRERELEASE #3 r273476M: Wed Oct 22 15:05:29 CDT 2014 root@SUPERTEX.housenet.jrv:/usr/obj/usr/src/sys/GENERIC amd64 SUPERTEX:/root# procstat -kk -a > kk SUPERTEX:/root# grep zio_wait kk SUPERTEX:/root# grep zio_done kk SUPERTEX:/root# grep zio_int kk SUPERTEX:/root# grep zfs_ kk 0 100475 kernel zfs_vn_rele_task mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a fork_trampoline+0xe 0 101018 kernel zfs_vn_rele_task mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a fork_trampoline+0xe 0 101522 kernel zfs_vn_rele_task mi_switch+0xe1 sleepq_wait+0x3a _sleep+0x287 taskqueue_thread_loop+0xd5 fork_exit+0x9a fork_trampoline+0xe 12105 101599 zpool - mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43 zfs_ioc_pool_get_history+0xbd zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139 kern_ioctl+0x255 sys_ioctl+0x13c amd64_syscall+0x351 Xfast_syscall+0xfb SUPERTEX:/root# grep dmu_ kk SUPERTEX:/root# grep arc_ kk 3 100125 zfskern arc_reclaim_thre mi_switch+0xe1 sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18b arc_reclaim_thread+0x288 fork_exit+0x9a fork_trampoline+0xe 3 100126 zfskern l2arc_feed_threa mi_switch+0xe1 sleepq_timedwait+0x3a _cv_timedwait_sbt+0x18b l2arc_feed_thread+0x19f fork_exit+0x9a fork_trampoline+0xe SUPERTEX:/root# grep buf_ kk 20 100139 bufdaemon - mi_switch+0xe1 sleepq_timedwait+0x3a _sleep+0x26e buf_daemon+0x78 fork_exit+0x9a fork_trampoline+0xe SUPERTEX:/root# ps lp 12105 UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 0 12105 12104 0 20 0 105692 35984 kmem are D - 0:03.76 zpool history BI SUPERTEX:/root# My personal opinion that the problem is caused by a bug in the combination of the new vmem-based code and the changes in the page daemon code. When there is not enough KVA the code wakes up the page daemon with expectation that it would make some more KVA available, but the pagedaemon may not actually take any action. Previously the page daemon code used to check a return value from msleep and it made a page out pass if it was woken up. Now the page daemon code performs a pass when it is woken up *and* vm_pages_needed is set. As the comment before pagedaemon_wakeup() explains that function is not guaranteed to actually wake up the page daemon unless vm_page_queue_free_mtx is held. And kmem_reclaim() does not hold vm_page_queue_free_mtx when it calls pagedaemon_wakeup(). Additionally, before the switch to the vmem kmem_malloc() used to directly invoke vm_lowmem hook and uma_reclaim() function as opposed to trying to wake up the page daemon. So, the old could would reliably free some KVA if there is any that can be freed by vm_lowmem hook and uma_reclaim. But the new code makes a lame attempt to wake up the page daemon. I believe that the above explains why you sometimes see processes stuck in vmem_xalloc() and why your workaround work - when you really force the page daemon to make a page out pass it would finally free some KVA by invoking vm_lowmem hook and uma_reclaim. I see two trivial possible solutions: - hold vm_page_queue_free_mtx in kmem_reclaim() around pagedaemon_wakeup() call - directly call vm_lowmem hook and uma_reclaim() instead of pagedaemon_wakeup() in kmem_reclaim() Not sure which one would be better. Maybe there is an even better solution. Created attachment 149752 [details]
lowmem patch I attempted
I attempted a naive patch based on avg's comments: patch attached. I haven't seen any "zfs recv" issues since but I have seen a hang in "zfs send", and the "zfs history" hang still happens.
Created attachment 149753 [details]
zpool history hang kgdb bt
Attached is a kgdb bt of the "zpool history" thread. The alloc that hung is for just over 10 MB. The system (32GB RAM) had been idle but daily periodic was running when it hung on "kmem arena".
I've just found this entry. We're having the same issue, running stable/10@r274502 Any chance for a fix? I suspect that there are two distinct issues at play here. I believe that Andriy has correctly identified a synchronization error in kmem_reclaim(). However, the size of the failed allocations, 10 MB, is also interesting. While vmem's allocation strategy is better at reducing external fragmentation than the simplistic "first-fit" strategy that we previously used, it is not immune to external fragmentation. Thus, kmem arena may not have 10 MB of contiguous virtual address space even though there is plenty of free space. I suggest that you try increasing the kmem arena size to 1.25 or even 1.5 times the physical memory size and report back on the effect. Hi, We're seeing this problem as well. Is it still a problem in CURRENT? Alan's suggestion to "try increasing the kmem arena size", is that vm.kmem_size? $ sysctl -h hw.physmem vm.kmem_size vm.kmem_size_max hw.physmem: 8 541 974 528 vm.kmem_size: 8 295 661 568 vm.kmem_size_max: 1 319 413 950 874 $ Does that mean setting vm.kmem_size = 1.5 * hw.physmem ? Palle The following commit may have a positive effect for this issue: https://svnweb.freebsd.org/base?view=revision&revision=281026 (In reply to Palle Girgensohn from comment #9) BTW, yes, vm.kmem_size. Raising that seemed to help here. Also likely that the following commit should have helped prevent this issue: https://svnweb.freebsd.org/base?view=revision&revision=282690 I'm closing this for now, if anyone sees this in Stable 10 > r283310 or 10.2-RELEASE or above feel free to reopen. We are encountering what appears to be the same issue on a large production server running 10.2-RELEASE. The machine has 128G of RAM and vfs.zfs.arc_max is set to 100G. I have a zfs recv currently stuck waiting on kmem arena. When this problem occurs, the ARC starts rapidly shrinking down to vfs.zfs.arc_min (13G in our case) but the process will remain blocked for a long period of time (hours) even though top shows free memory at 104G. ps -l for the process shows: UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 0 37892 37814 0 20 0 42248 3428 kmem are D 1 0:04.16 zfs recv -u -d ezdata2 while procstat -kk -p 37892 shows: PID TID COMM TDNAME KSTACK 37892 101149 zfs - mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43 dmu_recv_stream+0x114 zfs_ioc_recv+0x955 zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139 kern_ioctl+0x255 sys_ioctl+0x140 amd64_syscall+0x357 Xfast_syscall+0xfb Thanks for the update Marc, reopened. Created attachment 163702 [details]
utility to allocate memory to force defragmentation
This is a little utility that allocates memory, which creates memory pressure in an attempt to force de-fragmentation as ARC free's up aggressively.
Its based on James Van Artsdalen initial utility with fixes for compilation as well as adding progress output and unbounding the allocation which is now done in 1GB chunks.
We are seeing this regularly on a machine with 34 GB RAM 34 TB storage JBOD large array (ZFS) *no* SSD ZIL or L2Arc. ssh othermachine zfs send fs | zfs receive fs sometimes hangs. Not often, perhaps once a month or so. Always in kmem arena just like mgoroff@quorum.net describes above. I tried Stevens program to allocate memory, but it still stuck. This is backup machine, so I can always just reboot, unless someone has an idea about something to try first, that could perhaps explain more about what's happening? Palle We are also seeing this on our backup server 64 GB RAM 4x6TB (zpool with 2 mirrors) *no* SSD ZIL or L2Arc zfs send ... | ssh backupserver zfs receive zfs receive -u ... often hangs, and I have to reboot the server to get rid of the hang process and to be able to receive datasets again no problem with 'zpool history', it runs fine $ uname -a FreeBSD hodor.live.lkp.primelabs.se 10.2-RELEASE-p18 FreeBSD 10.2-RELEASE-p18 #0: Sat May 28 08:53:43 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 $ freebsd-version 10.2-RELEASE-p19 $ ps -l -U backup UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 1006 59745 1 0 20 0 42248 3164 kmem are Ds - 0:03.81 zfs receive -u storage/backup/mysql/leaf/data $ procstat 59745 PID PPID PGID SID TSID THR LOGIN WCHAN EMUL COMM 59745 1 59745 59745 0 1 backup kmem are FreeBSD ELF64 zfs $ sudo procstat -kk 59745 Password: PID TID COMM TDNAME KSTACK 59745 100809 zfs - mi_switch+0xe1 sleepq_wait+0x3a _cv_wait+0x16d vmem_xalloc+0x568 vmem_alloc+0x3d kmem_malloc+0x33 uma_large_malloc+0x49 malloc+0x43 dmu_recv_stream+0x114 zfs_ioc_recv+0x955 zfsdev_ioctl+0x5ca devfs_ioctl_f+0x139 kern_ioctl+0x255 sys_ioctl+0x140 amd64_syscall+0x357 Xfast_syscall+0xfb $ sysctl -h hw.physmem vm.kmem_size vm.kmem_size_max hw.physmem: 68,578,963,456 vm.kmem_size: 66,774,925,312 vm.kmem_size_max: 1,319,413,950,874 I will look into raising vm.kmem_size, but right now I have to reboot hand hopefully get some backups going... Created attachment 171704 [details]
reboot doesn't get past this
When zfs recive has hung, 'reboot' won't work to reboot the machine, it just hangs and I have to reset the machine via remote management.
'reboot' works fine on this machine when there is no hung zfs receive processes
Some additional information: I left 4 hanged zfs receive overnight and in the morning they had completed. My shell reported two of them running for 10h 35m 26s and 10h 38m 50s. Note that this was for newly created datasets without any data on them, so shouldn't they have completed instantly? Hi, I just saw this problem again. zfs recv hang in kmem arena. This was a very common problem for us in that past, but after upgrading to 10.3 and getting SSD:s for ZIL and L2Arc, this problem has drastically diminished. Apparently, it still happens occasionally, but it is rare. FreeBSD 10.3 (In reply to Steven Hartland from comment #10) Is it possible that the referenced change does not have enough of a positive effect with respect to zfs receive pending in kmem_arena because zfs receive is attempting to allocate SPA_MAXBLOCKSIZE (16M) and the fragmentation check is looking for fragmentation below zfs_max_recordsize (1M) ? /* * Above limits know nothing about real level of KVA fragmentation. * Start aggressive reclamation if too little sequential KVA left. */ if (vmem_size(heap_arena, VMEM_MAXFREE) < zfs_max_recordsize) return (1); (In reply to Dave Baukus from comment #21) This is possible if you are seeing this issue do try increasing it to SPA_MAXBLOCKSIZE to see if it helps. A commit references this bug: Author: smh Date: Fri Mar 17 12:34:57 UTC 2017 New revision: 315449 URL: https://svnweb.freebsd.org/changeset/base/315449 Log: Reduce ARC fragmentation threshold As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv, update the threshold at which we start agressive reclamation to use SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which defaults to 1M. PR: 194513 Reviewed by: avg, mav MFC after: 1 month Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D10012 Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Closing for now as I've committed the change to reduce ARC fragmentation threshold. If this is still reproducible with this we can reopen. |