Summary: | ufsdirhash_build() sleeps with directory vnode locked due to M_WAITOK | ||
---|---|---|---|
Product: | Base System | Reporter: | dgmorris <dgmorris> |
Component: | kern | Assignee: | Mark Johnston <markj> |
Status: | Closed FIXED | ||
Severity: | Affects Only Me | CC: | chris, kib, markj |
Priority: | --- | ||
Version: | 12.1-RELEASE | ||
Hardware: | Any | ||
OS: | Any |
Description
dgmorris@earthlink.net
2021-03-03 18:18:46 UTC
Generally, I agree that we should not do malloc(M_WAITOK) under vnode lock. The main reason for this is that we might own vnode lock for the vnode which owns too many pageable pages and that prevents pagedaemon from making progress, or at least makes it life significantly harder. That said, in your case there are two things that are questionable: - we default to round-robin policy for zones which do not specify policy explicitly (what is isi_kern_openat, is it for isilon?) - this is not real OOM, it is due to partitioning and strange policy that allocator has to wait for pagedaemon. And still, pagedaemon must be able to make some progress, for overloaded domain. I.e. I am leery of the necessity of the proposed change. Actually, we don't default to RR. From an email with Mark Johnston (and looking over the code paths): > More to the point, I believe the zones are marked as FirstTouch because > they pass "0" as the flags argument to uma_zcreate() -- and the > keg_ctor() ends up setting the keg flags based on: > > /* > * Use a first-touch NUMA policy for kegs that pmap_extract() will > * work on. Use round-robin for everything else. > * > * Zones may override the default by specifying either. > */ > #ifdef NUMA > if ((keg->uk_flags & > (UMA_ZONE_ROUNDROBIN | UMA_ZFLAG_CACHE | UMA_ZONE_NOTPAGE)) == 0) > keg->uk_flags |= UMA_ZONE_FIRSTTOUCH; > else if ((keg->uk_flags & UMA_ZONE_FIRSTTOUCH) == 0) > keg->uk_flags |= UMA_ZONE_ROUNDROBIN; > #endif > > Then the zone flags inherit from the keg flags and there we are. Yes, FIRSTTOUCH is the default policy for regular zones. -- End email snippet -- And yes, sorry -- didn't realize I had an isi* function still in there. Yes, this is Isilon. I tried to stay general about things. Waiting on a domain specific scan in this case (again, because of First Touch) may help -- but there may not be much the scan can do. In this case, there was a separate kernel memory leak that had most of the domain tied up as "in use", obviating the ability for the pagedaemon to free pages for the domain, but one can speculate a spike in load localized within one domain that provides a similar situation for long enough for the directory lock waiters to be noticed. Right, UMA has defaulted to the FIRSTTOUCH policy since https://reviews.freebsd.org/D22831 . Indeed, the policy is kind of questionable. It is strict first-touch, i.e., always returns local memory. First-touch policies implemented using domainsets do not have this property unless explicitly requested, so if you allocate a page with e.g., DOMAINSET_PREF(0), we will return memory from other domains if necessary to avoid sleeping. I am not sure why FIRSTTOUCH has this strict implementation. Originally it was used for bucket zones, which are always NOWAIT. And until recently UMA did not have special handling for cross-domain frees. We could consider adding a new (default) policy which is not strict. It is likely necessary for ZFS to work well on NUMA systems, since its buffer cache memory is allocated from UMA. That is a larger problem though. In this case though the pressure is apparently caused by a kernel memory leak. https://reviews.freebsd.org/D29046 Untested so far. Jeff said there is no particular reason not to implement the policy this way, so I will go ahead and keep working on the patch. But I suspect that the dirhash allocation policy should be changed as well. I opened https://reviews.freebsd.org/D29045 for the proposed dirhash policy change, btw. I'm certainly cool with the NUMA policy changes as well, just putting this out there. Split the proposed diff into: https://reviews.freebsd.org/D29104 https://reviews.freebsd.org/D29105 A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f17a5900850b4d449a91cbe9c61dfb62373c3cd1 commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1 Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-20 15:25:45 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29045 sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=640c2ff8ae9bdbad91aa25094123bb97e5a92bbf commit 640c2ff8ae9bdbad91aa25094123bb97e5a92bbf Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-27 13:05:50 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1) sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8340548bc84bf1cbfa678febb08e0aa0293fd935 commit 8340548bc84bf1cbfa678febb08e0aa0293fd935 Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-27 13:06:15 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1) sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) |