Bug 248008 - i386 system can hang with many processes sleeping on btalloc post base r358097
Summary: i386 system can hang with many processes sleeping on btalloc post base r358097
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: i386 Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2020-07-16 02:30 UTC by Rick Macklem
Modified: 2020-10-13 21:02 UTC (History)
5 users (show)

See Also:
koobs: maintainer-feedback? (jeff)


Attachments
Ryan Libby's suggestion #2 way to fix this (718 bytes, patch)
2020-07-16 02:34 UTC, Rick Macklem
no flags Details | Diff
keg_drain patch (3.50 KB, patch)
2020-08-12 17:23 UTC, Mark Johnston
no flags Details | Diff
keg_drain patch (3.36 KB, patch)
2020-08-13 15:19 UTC, Mark Johnston
no flags Details | Diff
keg_drain patch (10.12 KB, patch)
2020-08-16 21:15 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Rick Macklem freebsd_committer 2020-07-16 02:30:08 UTC
I think the patch is not complete.  It looks like the problem is that
for systems that do not have UMA_MD_SMALL_ALLOC, we do
        uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
but we haven't set an appropriate free function.  This is probably why
UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
appropriate for systems with uma_small_alloc.

So by default we get page_free as our free function.  That calls
kmem_free, which calls vmem_free ... but we do our allocs with
vmem_xalloc.  I'm not positive, but I think the problem is that in
effect we vmem_xalloc -> vmem_free, not vmem_xfree.

Three possible fixes:
 1: The one you tested, but this is not best for systems with
    uma_small_alloc.
 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
 3: Actually provide an appropriate vmem_bt_free function.

I think we should just do option 2 with a comment, it's simple and it's
what we used to do.  I'm not sure how much benefit we would see from
option 3, but it's more work.
Comment 1 Rick Macklem freebsd_committer 2020-07-16 02:34:39 UTC
Created attachment 216478 [details]
Ryan Libby's suggestion #2 way to fix this

When doing a kernel build over NFS, I could frequently get the system
hung with many processes sleeping on btalloc.
Ryan Libby made the following suggestions:
I think the patch is not complete.  It looks like the problem is that
for systems that do not have UMA_MD_SMALL_ALLOC, we do
        uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
but we haven't set an appropriate free function.  This is probably why
UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
appropriate for systems with uma_small_alloc.

So by default we get page_free as our free function.  That calls
kmem_free, which calls vmem_free ... but we do our allocs with
vmem_xalloc.  I'm not positive, but I think the problem is that in
effect we vmem_xalloc -> vmem_free, not vmem_xfree.

Three possible fixes:
 1: The one you tested, but this is not best for systems with
    uma_small_alloc.
 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
 3: Actually provide an appropriate vmem_bt_free function.

I think we should just do option 2 with a comment, it's simple and it's
what we used to do.  I'm not sure how much benefit we would see from
option 3, but it's more work.

This patch implements #2 and seems to fix the problem. The problem
was not reproducible on an amd64 system with memory set to 1Gbyte.
Comment 2 Rick Macklem freebsd_committer 2020-07-16 02:43:37 UTC
Sorry, the comments got messed up. I guess I shouldn't have done the
attachment before submitting or something..

Anyhow, here is what has gone on.
Post r358097 (it took a while to bisect to this commit), running a kernel
build over NFS on an i386 system (I also got it once doing the build on
UFS), I've gotten hangs with many processes sleeping on "btalloc".

Ryan Libby made the following suggestions and I implemented #2, which
fixed the hangs.

   I think the patch is not complete.  It looks like the problem is that
   for systems that do not have UMA_MD_SMALL_ALLOC, we do
        uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
   but we haven't set an appropriate free function.  This is probably why
   UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
   appropriate for systems with uma_small_alloc.

   So by default we get page_free as our free function.  That calls
   kmem_free, which calls vmem_free ... but we do our allocs with
   vmem_xalloc.  I'm not positive, but I think the problem is that in
   effect we vmem_xalloc -> vmem_free, not vmem_xfree.

   Three possible fixes:
    1: The one you tested, but this is not best for systems with
    uma_small_alloc.
    2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
    3: Actually provide an appropriate vmem_bt_free function.

   I think we should just do option 2 with a comment, it's simple and it's
   what we used to do.  I'm not sure how much benefit we would see from
   option 3, but it's more work.
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2020-07-16 03:55:46 UTC
^Triage: Request feedback from base r358097 committer
Comment 4 Rick Macklem freebsd_committer 2020-07-26 22:46:41 UTC
Just fyi, the email discussion that preceeded this PR is here:
http://docs.FreeBSD.org/cgi/mid.cgi?QB1PR01MB3364709D9D16CD0E550DAC7CDD6A0
Comment 5 Mark Johnston freebsd_committer 2020-08-12 17:14:38 UTC
I think the suggested patch is ok, but not for the reason stated.  On platforms without a direct map the problem is: to allocate btags we need a slab, and to allocate a slab we need to map a page, and to map a page we need to allocate btags.

We handle this recursion using a custom slab allocator which specifies M_USE_RESERVE, allowing it to dip into a reserve of free btags.  Because the returned slab can be used to keep the reserve populated, this ensures that there are always enough free btags available to handle the recursion.

UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone.  However, when it was removed, an apparent bug in UMA was exposed: keg_drain() ignores the reservation set by uma_zone_reserve() in vmem_startup().  So under memory pressure we reclaim the free btags that are needed to break the recursion.  That's why adding _NOFREE back fixes the problem: it disables the reclamation.

We could perhaps fix it more cleverly, by modifying keg_drain() to always leave uk_reserve slabs available.
Comment 6 Mark Johnston freebsd_committer 2020-08-12 17:23:24 UTC
Created attachment 217173 [details]
keg_drain patch

Here is a lightly-tested patch that will fix the hangs without adding _NOFREE back, assuming I did not misunderstand the problem.

Rick, if you are interested in verifying that it fixes the bug, please give it a try.  If you prefer to simply re-add _NOFREE on !UMA_MD_SMALL_ALLOC platforms, that is ok with me too.  (i.e., feel free to add me to Reviewed by:).
Comment 7 Rick Macklem freebsd_committer 2020-08-13 02:53:47 UTC
I tested the keg_drain patch and got a panic.
The panic string appears to be just "page fault"?
It happens in:
zone_reclaim(0) at zone_reclaim+0x1ee/frame 0xe596a9c
Does this mean that the "struct uma_zone *" argument was NULL?

It took a couple of hours of kernel building to cause the panic.
Comment 8 Mark Johnston freebsd_committer 2020-08-13 15:19:29 UTC
Created attachment 217196 [details]
keg_drain patch

Fix a couple of bugs in the initial patch.
Comment 9 Rick Macklem freebsd_committer 2020-08-16 20:59:02 UTC
The updated keg_drain patch hangs in the same way as an unpatched
kernel from head.
Lots of processes stuck in btalloc.
Comment 10 Mark Johnston freebsd_committer 2020-08-16 21:14:37 UTC
(In reply to Rick Macklem from comment #9)
My apologies, I see another bug: I'm decrementing by uk_ppera, not uk_ipers, so the logic isn't correct and keg_drain() will free more items than it should.
Comment 11 Mark Johnston freebsd_committer 2020-08-16 21:15:01 UTC
Created attachment 217268 [details]
keg_drain patch

Updated patch.
Comment 12 Rick Macklem freebsd_committer 2020-08-19 20:27:04 UTC
It looks like the keg_drain patch is what was committed
to head as r364302?

If so, it does not fix the problem. A post-r364302 kernel
still hangs.
Comment 13 Mark Johnston freebsd_committer 2020-08-19 20:35:45 UTC
(In reply to Rick Macklem from comment #12)
No, that revision is unrelated.

I'm not sure why the patch isn't working, I have to think about it some more.  If you wish to commit the original patch in the meantime, please go ahead.
Comment 14 Rick Macklem freebsd_committer 2020-08-19 20:46:08 UTC
Well, I know nothing about the VM. However, if you think that putting
back the _NOFREE is safe to do for the non-UMA_MD_SMALL_ALLOC case,
I will retest the patch on the current head and then commit it
with reviewed by markj@.
Comment 15 Ryan Libby freebsd_committer 2020-08-19 20:54:39 UTC
You can add me to the reviewed-by too.  _NOFREE for
non-UMA_MD_SMALL_ALLOC should be safe, and no worse than before
r358097.  Please reference that revision in the commit log for
archaeology.
Comment 16 commit-hook freebsd_committer 2020-08-25 00:58:51 UTC
A commit references this bug:

Author: rmacklem
Date: Tue Aug 25 00:58:14 UTC 2020
New revision: 364744
URL: https://svnweb.freebsd.org/changeset/base/364744

Log:
  Fix hangs with processes stuck sleeping on btalloc on i386.

  r358097 introduced a problem for i386, where kernel builds will intermittently
  get hung, typically with many processes sleeping on "btalloc".
  I know nothing about VM, but received assistance from rlibby@ and markj@.

  rlibby@ stated the following:
     It looks like the problem is that
     for systems that do not have UMA_MD_SMALL_ALLOC, we do
             uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
     but we haven't set an appropriate free function.  This is probably why
     UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
     appropriate for systems with uma_small_alloc.

     So by default we get page_free as our free function.  That calls
     kmem_free, which calls vmem_free ... but we do our allocs with
     vmem_xalloc.  I'm not positive, but I think the problem is that in
     effect we vmem_xalloc -> vmem_free, not vmem_xfree.

     Three possible fixes:
      1: The one you tested, but this is not best for systems with
         uma_small_alloc.
      2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
      3: Actually provide an appropriate vmem_bt_free function.

     I think we should just do option 2 with a comment, it's simple and it's
     what we used to do.  I'm not sure how much benefit we would see from
     option 3, but it's more work.

  This patch implements #2. I haven't done a comment, since I don't know
  what the problem is.

  markj@ noted the following:
     I think the suggested patch is ok, but not for the reason stated.
     On platforms without a direct map the problem is:
     to allocate btags we need a slab,
     and to allocate a slab we need to map a page, and to map a page we need
     to allocate btags.

     We handle this recursion using a custom slab allocator which specifies
     M_USE_RESERVE, allowing it to dip into a reserve of free btags.
     Because the returned slab can be used to keep the reserve populated,
     this ensures that there are always enough free btags available to
     handle the recursion.

     UMA_ZONE_NOFREE ensures that we never reclaim free slabs from the zone.
     However, when it was removed, an apparent bug in UMA was exposed:
     keg_drain() ignores the reservation set by uma_zone_reserve()
     in vmem_startup().
     So under memory pressure we reclaim the free btags that are needed to
     break the recursion.
     That's why adding _NOFREE back fixes the problem: it disables the
     reclamation.

     We could perhaps fix it more cleverly, by modifying keg_drain() to always
     leave uk_reserve slabs available.

  markj@'s initial patch failed testing, so committing this patch was agreed
  upon as the interim solution.
  Either rlibby@ or markj@ might choose to add a comment to it.

  PR:		248008
  Reviewed by:	rlibby, markj

Changes:
  head/sys/kern/subr_vmem.c
Comment 17 Rick Macklem freebsd_committer 2020-08-25 01:46:37 UTC
I have committed the first attachment patch to head.
I will leave this PR open for a while, in case markj@ chooses
to commit a different patch.
Comment 18 Mark Johnston freebsd_committer 2020-10-13 21:02:09 UTC
(In reply to Mark Johnston from comment #13)
I see now that I had uploaded the wrong patch. :(

I spent some time debugging a similar hang on arm and posted patches for a couple of other bugs found by that process:

https://reviews.freebsd.org/D26769
https://reviews.freebsd.org/D26770
https://reviews.freebsd.org/D26771
https://reviews.freebsd.org/D26772

Once those have been committed perhaps we can try reverting r364744.