I think the patch is not complete. It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work.
Created attachment 216478 [details] Ryan Libby's suggestion #2 way to fix this When doing a kernel build over NFS, I could frequently get the system hung with many processes sleeping on btalloc. Ryan Libby made the following suggestions: I think the patch is not complete. It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work. This patch implements #2 and seems to fix the problem. The problem was not reproducible on an amd64 system with memory set to 1Gbyte.
Sorry, the comments got messed up. I guess I shouldn't have done the attachment before submitting or something.. Anyhow, here is what has gone on. Post r358097 (it took a while to bisect to this commit), running a kernel build over NFS on an i386 system (I also got it once doing the build on UFS), I've gotten hangs with many processes sleeping on "btalloc". Ryan Libby made the following suggestions and I implemented #2, which fixed the hangs. I think the patch is not complete. It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work.
^Triage: Request feedback from base r358097 committer
Just fyi, the email discussion that preceeded this PR is here: http://docs.FreeBSD.org/cgi/mid.cgi?QB1PR01MB3364709D9D16CD0E550DAC7CDD6A0
I think the suggested patch is ok, but not for the reason stated. On platforms without a direct map the problem is: to allocate btags we need a slab, and to allocate a slab we need to map a page, and to map a page we need to allocate btags. We handle this recursion using a custom slab allocator which specifies M_USE_RESERVE, allowing it to dip into a reserve of free btags. Because the returned slab can be used to keep the reserve populated, this ensures that there are always enough free btags available to handle the recursion. UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone. However, when it was removed, an apparent bug in UMA was exposed: keg_drain() ignores the reservation set by uma_zone_reserve() in vmem_startup(). So under memory pressure we reclaim the free btags that are needed to break the recursion. That's why adding _NOFREE back fixes the problem: it disables the reclamation. We could perhaps fix it more cleverly, by modifying keg_drain() to always leave uk_reserve slabs available.
Created attachment 217173 [details] keg_drain patch Here is a lightly-tested patch that will fix the hangs without adding _NOFREE back, assuming I did not misunderstand the problem. Rick, if you are interested in verifying that it fixes the bug, please give it a try. If you prefer to simply re-add _NOFREE on !UMA_MD_SMALL_ALLOC platforms, that is ok with me too. (i.e., feel free to add me to Reviewed by:).
I tested the keg_drain patch and got a panic. The panic string appears to be just "page fault"? It happens in: zone_reclaim(0) at zone_reclaim+0x1ee/frame 0xe596a9c Does this mean that the "struct uma_zone *" argument was NULL? It took a couple of hours of kernel building to cause the panic.
Created attachment 217196 [details] keg_drain patch Fix a couple of bugs in the initial patch.
The updated keg_drain patch hangs in the same way as an unpatched kernel from head. Lots of processes stuck in btalloc.
(In reply to Rick Macklem from comment #9) My apologies, I see another bug: I'm decrementing by uk_ppera, not uk_ipers, so the logic isn't correct and keg_drain() will free more items than it should.
Created attachment 217268 [details] keg_drain patch Updated patch.
It looks like the keg_drain patch is what was committed to head as r364302? If so, it does not fix the problem. A post-r364302 kernel still hangs.
(In reply to Rick Macklem from comment #12) No, that revision is unrelated. I'm not sure why the patch isn't working, I have to think about it some more. If you wish to commit the original patch in the meantime, please go ahead.
Well, I know nothing about the VM. However, if you think that putting back the _NOFREE is safe to do for the non-UMA_MD_SMALL_ALLOC case, I will retest the patch on the current head and then commit it with reviewed by markj@.
You can add me to the reviewed-by too. _NOFREE for non-UMA_MD_SMALL_ALLOC should be safe, and no worse than before r358097. Please reference that revision in the commit log for archaeology.
A commit references this bug: Author: rmacklem Date: Tue Aug 25 00:58:14 UTC 2020 New revision: 364744 URL: https://svnweb.freebsd.org/changeset/base/364744 Log: Fix hangs with processes stuck sleeping on btalloc on i386. r358097 introduced a problem for i386, where kernel builds will intermittently get hung, typically with many processes sleeping on "btalloc". I know nothing about VM, but received assistance from rlibby@ and markj@. rlibby@ stated the following: It looks like the problem is that for systems that do not have UMA_MD_SMALL_ALLOC, we do uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc); but we haven't set an appropriate free function. This is probably why UMA_ZONE_NOFREE was originally there. When NOFREE was removed, it was appropriate for systems with uma_small_alloc. So by default we get page_free as our free function. That calls kmem_free, which calls vmem_free ... but we do our allocs with vmem_xalloc. I'm not positive, but I think the problem is that in effect we vmem_xalloc -> vmem_free, not vmem_xfree. Three possible fixes: 1: The one you tested, but this is not best for systems with uma_small_alloc. 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC. 3: Actually provide an appropriate vmem_bt_free function. I think we should just do option 2 with a comment, it's simple and it's what we used to do. I'm not sure how much benefit we would see from option 3, but it's more work. This patch implements #2. I haven't done a comment, since I don't know what the problem is. markj@ noted the following: I think the suggested patch is ok, but not for the reason stated. On platforms without a direct map the problem is: to allocate btags we need a slab, and to allocate a slab we need to map a page, and to map a page we need to allocate btags. We handle this recursion using a custom slab allocator which specifies M_USE_RESERVE, allowing it to dip into a reserve of free btags. Because the returned slab can be used to keep the reserve populated, this ensures that there are always enough free btags available to handle the recursion. UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone. However, when it was removed, an apparent bug in UMA was exposed: keg_drain() ignores the reservation set by uma_zone_reserve() in vmem_startup(). So under memory pressure we reclaim the free btags that are needed to break the recursion. That's why adding _NOFREE back fixes the problem: it disables the reclamation. We could perhaps fix it more cleverly, by modifying keg_drain() to always leave uk_reserve slabs available. markj@'s initial patch failed testing, so committing this patch was agreed upon as the interim solution. Either rlibby@ or markj@ might choose to add a comment to it. PR: 248008 Reviewed by: rlibby, markj Changes: head/sys/kern/subr_vmem.c
I have committed the first attachment patch to head. I will leave this PR open for a while, in case markj@ chooses to commit a different patch.
(In reply to Mark Johnston from comment #13) I see now that I had uploaded the wrong patch. :( I spent some time debugging a similar hang on arm and posted patches for a couple of other bugs found by that process: https://reviews.freebsd.org/D26769 https://reviews.freebsd.org/D26770 https://reviews.freebsd.org/D26771 https://reviews.freebsd.org/D26772 Once those have been committed perhaps we can try reverting r364744.
(In reply to Rick Macklem from comment #17) I've committed a number of fixes for the boundary tag allocator, mentioned in comment 18. Rick, would you be willing to try reverting r364744 and seeing if any hangs still occur?
Sure, I can test it. It will take a few days. I saw the first two patches get committed, but missed the others. I'll comment here after testing.
I did about ten test cycles with the patch (r364744>) and no hangs occurred, so I think it can be reverted. (The hangs normally occurred within the first 3 cycles.) markj@, do you want to do the revert, or would you prefer I do it?
(In reply to Rick Macklem from comment #21) I can do the revert if you don't mind. I'll try to explain in the commit log what the recent commits did to address the hangs you were seeing.
Sounds good. Thanks for doing the revert.
A commit references this bug: Author: markj Date: Tue Dec 1 16:06:31 UTC 2020 New revision: 368236 URL: https://svnweb.freebsd.org/changeset/base/368236 Log: vmem: Revert r364744 A pair of bugs are believed to have caused the hangs described in the commit log message for r364744: 1. uma_reclaim() could trigger reclamation of the reserve of boundary tags used to avoid deadlock. This was fixed by r366840. 2. The loop in vmem_xalloc() would in some cases try to allocate more boundary tags than the expected upper bound of BT_MAXALLOC. The reserve is sized based on the value BT_MAXMALLOC, so this behaviour could deplete the reserve without guaranteeing a successful allocation, resulting in a hang. This was fixed by r366838. PR: 248008 Tested by: rmacklem Changes: head/sys/kern/subr_vmem.c