On a system where a timeout panic has been added for attempts to acquire a UFS root vnode lock, such a panic was observed. The holder of the lock was attempting to do: ched_switch() at sched_switch+0xbcc/frame 0xfffffe885e4e5e10 mi_switch() at mi_switch+0x29b/frame 0xfffffe885e4e5e60 sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe885e4e5e90 _sleep() at _sleep+0x265/frame 0xfffffe885e4e5f10 vm_wait_doms() at vm_wait_doms+0xfd/frame 0xfffffe885e4e5f40 vm_wait_domain() at vm_wait_domain+0x50/frame 0xfffffe885e4e5f70 vm_domain_alloc_fail() at vm_domain_alloc_fail+0x91/frame 0xfffffe885e4e5fb0 vm_page_alloc_domain_after() at vm_page_alloc_domain_after+0x243/frame 0xfffffe885e4e6030 uma_small_alloc() at uma_small_alloc+0x9b/frame 0xfffffe885e4e6080 keg_alloc_slab() at keg_alloc_slab+0x103/frame 0xfffffe885e4e60d0 zone_import() at zone_import+0x106/frame 0xfffffe885e4e6170 zone_alloc_item() at zone_alloc_item+0x6e/frame 0xfffffe885e4e61b0 cache_alloc_retry() at cache_alloc_retry+0x13c/frame 0xfffffe885e4e6210 uma_zalloc_arg() at uma_zalloc_arg+0x114/frame 0xfffffe885e4e6270 ufsdirhash_build() at ufsdirhash_build+0x850/frame 0xfffffe885e4e6310 ufs_lookup_ino() at ufs_lookup_ino+0x1a2/frame 0xfffffe885e4e6420 VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x8a/frame 0xfffffe885e4e6450 vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe885e4e64b0 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8a/frame 0xfffffe885e4e64e0 lookup() at lookup+0x756/frame 0xfffffe885e4e6590 namei() at namei+0x46c/frame 0xfffffe885e4e6660 _vn_open_cred() at _vn_open_cred+0x1e1/frame 0xfffffe885e4e67f0 _vn_open() at _vn_open+0x1f/frame 0xfffffe885e4e6820 isi_kern_openat() at isi_kern_openat+0x3b4/frame 0xfffffe885e4e6a10 sys_openat() at sys_openat+0x5a/frame 0xfffffe885e4e6a70 amd64_syscall() at amd64_syscall+0x381/frame 0xfffffe885e4e6bf0 fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe885e4e6bf0 — syscall (499, FreeBSD ELF64, sys_openat), rip = 0x8005b0d3a, rsp = 0x7fffffffe368, rbp = 0x7fffffffe3e0 — The system is ccNUMA, with 2 domains. One domain is OOM, the other is not: db> show pageq pq_free 1887153 dom 0 page_cnt 8102741 free 2 pq_act 57 pq_inact 1 pq_laund 0 pq_unsw 311370 dom 1 page_cnt 8133281 free 1887151 pq_act 17391 pq_inact 4868531 pq_laund 0 pq_unsw 186545 The DIRHASH uma zone is created with flags argument of 0, so acquires the default NUMA policy (FIRSTTOUCH). The problem allocation is on line 437: dh->dh_hash = malloc(narrays * sizeof(dh->dh_hash[0]), M_DIRHASH, M_NOWAIT | M_ZERO); if (dh->dh_hash == NULL) goto fail; dh->dh_blkfree = malloc(nblocks * sizeof(dh->dh_blkfree[0]), M_DIRHASH, M_NOWAIT); if (dh->dh_blkfree == NULL) goto fail; for (i = 0; i < narrays; i++) { if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL) goto fail; for (j = 0; j < DH_NBLKOFF; j++) dh->dh_hash[i][j] = DIRHASH_EMPTY; } Where: #define DIRHASH_BLKALLOC_WAITOK() uma_zalloc(ufsdirhash_zone, M_WAITOK) from line 73. As can be seen, the code is even assuming the allocation can fail as if it were NOWAIT, but is instead doing M_WAITOK while holding the directory vnode lock per sys/ufs/ufs/ufs_lookup.c line 254: #ifdef DEBUG_VFS_LOCKS /* * Assert that the directory vnode is locked, and locked * exclusively for the last component lookup for modifying * operations. * * The directory-modifying operations need to save * intermediate state in the inode between namei() call and * actual directory manipulations. See fields in the struct * inode marked as 'used during directory lookup'. We must * ensure that upgrade in namei() does not happen, since * upgrade might need to unlock vdp. If quotas are enabled, * getinoquota() also requires exclusive lock to modify inode. */ ASSERT_VOP_LOCKED(vdp, "ufs_lookup1"); if ((nameiop == CREATE || nameiop == DELETE || nameiop == RENAME) && (flags & (LOCKPARENT | ISLASTCN)) == (LOCKPARENT | ISLASTCN)) ASSERT_VOP_ELOCKED(vdp, "ufs_lookup2"); #endif restart: bp = NULL; slotoffset = -1; /* * We now have a segment name to search for, and a directory to search. * * Suppress search for slots unless creating * file and at end of pathname, in which case * we watch for a place to put the new file in * case it doesn't already exist. */ ino = 0; i_diroff = dp->i_diroff; slotstatus = FOUND; slotfreespace = slotsize = slotneeded = 0; if ((nameiop == CREATE || nameiop == RENAME) && (flags & ISLASTCN)) { slotstatus = NONE; slotneeded = DIRECTSIZ(cnp->cn_namelen); } #ifdef UFS_DIRHASH /* * Use dirhash for fast operations on large directories. The logic * to determine whether to hash the directory is contained within * ufsdirhash_build(); a zero return means that it decided to hash * this directory and it successfully built up the hash table. */ if (ufsdirhash_build(dp) == 0) { The NUMA system may be more likely to hit this in the case of imbalanced load (as the system as a whole may not be OOM, just the domain), but the risk is there on any system. I propose the following diff as the WAITOK allocation is only used in one place: dmorris@wkstn-dmorris2:freebsd_ufs $ git diff -p origin/main diff --git a/sys/ufs/ufs/ufs_dirhash.c b/sys/ufs/ufs/ufs_dirhash.c index 13094b5a1c97..72f38be5eea4 100644 --- a/sys/ufs/ufs/ufs_dirhash.c +++ b/sys/ufs/ufs/ufs_dirhash.c @@ -108,7 +108,7 @@ static uma_zone_t ufsdirhash_zone; #define DIRHASHLIST_LOCK() mtx_lock(&ufsdirhash_mtx) #define DIRHASHLIST_UNLOCK() mtx_unlock(&ufsdirhash_mtx) -#define DIRHASH_BLKALLOC_WAITOK() uma_zalloc(ufsdirhash_zone, M_WAITOK) +#define DIRHASH_BLKALLOC_NOWAIT() uma_zalloc(ufsdirhash_zone, M_NOWAIT) #define DIRHASH_BLKFREE(ptr) uma_zfree(ufsdirhash_zone, (ptr)) #define DIRHASH_ASSERT_LOCKED(dh) \ sx_assert(&(dh)->dh_lock, SA_LOCKED) @@ -425,7 +425,7 @@ ufsdirhash_build(struct inode *ip) if (dh->dh_blkfree == NULL) goto fail; for (i = 0; i < narrays; i++) { - if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL) + if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_NOWAIT()) == NULL) goto fail; for (j = 0; j < DH_NBLKOFF; j++) dh->dh_hash[i][j] = DIRHASH_EMPTY;
Generally, I agree that we should not do malloc(M_WAITOK) under vnode lock. The main reason for this is that we might own vnode lock for the vnode which owns too many pageable pages and that prevents pagedaemon from making progress, or at least makes it life significantly harder. That said, in your case there are two things that are questionable: - we default to round-robin policy for zones which do not specify policy explicitly (what is isi_kern_openat, is it for isilon?) - this is not real OOM, it is due to partitioning and strange policy that allocator has to wait for pagedaemon. And still, pagedaemon must be able to make some progress, for overloaded domain. I.e. I am leery of the necessity of the proposed change.
Actually, we don't default to RR. From an email with Mark Johnston (and looking over the code paths): > More to the point, I believe the zones are marked as FirstTouch because > they pass "0" as the flags argument to uma_zcreate() -- and the > keg_ctor() ends up setting the keg flags based on: > > /* > * Use a first-touch NUMA policy for kegs that pmap_extract() will > * work on. Use round-robin for everything else. > * > * Zones may override the default by specifying either. > */ > #ifdef NUMA > if ((keg->uk_flags & > (UMA_ZONE_ROUNDROBIN | UMA_ZFLAG_CACHE | UMA_ZONE_NOTPAGE)) == 0) > keg->uk_flags |= UMA_ZONE_FIRSTTOUCH; > else if ((keg->uk_flags & UMA_ZONE_FIRSTTOUCH) == 0) > keg->uk_flags |= UMA_ZONE_ROUNDROBIN; > #endif > > Then the zone flags inherit from the keg flags and there we are. Yes, FIRSTTOUCH is the default policy for regular zones. -- End email snippet -- And yes, sorry -- didn't realize I had an isi* function still in there. Yes, this is Isilon. I tried to stay general about things. Waiting on a domain specific scan in this case (again, because of First Touch) may help -- but there may not be much the scan can do. In this case, there was a separate kernel memory leak that had most of the domain tied up as "in use", obviating the ability for the pagedaemon to free pages for the domain, but one can speculate a spike in load localized within one domain that provides a similar situation for long enough for the directory lock waiters to be noticed.
Right, UMA has defaulted to the FIRSTTOUCH policy since https://reviews.freebsd.org/D22831 . Indeed, the policy is kind of questionable. It is strict first-touch, i.e., always returns local memory. First-touch policies implemented using domainsets do not have this property unless explicitly requested, so if you allocate a page with e.g., DOMAINSET_PREF(0), we will return memory from other domains if necessary to avoid sleeping. I am not sure why FIRSTTOUCH has this strict implementation. Originally it was used for bucket zones, which are always NOWAIT. And until recently UMA did not have special handling for cross-domain frees. We could consider adding a new (default) policy which is not strict. It is likely necessary for ZFS to work well on NUMA systems, since its buffer cache memory is allocated from UMA. That is a larger problem though. In this case though the pressure is apparently caused by a kernel memory leak.
https://reviews.freebsd.org/D29046 Untested so far.
Jeff said there is no particular reason not to implement the policy this way, so I will go ahead and keep working on the patch. But I suspect that the dirhash allocation policy should be changed as well.
I opened https://reviews.freebsd.org/D29045 for the proposed dirhash policy change, btw. I'm certainly cool with the NUMA policy changes as well, just putting this out there.
Split the proposed diff into: https://reviews.freebsd.org/D29104 https://reviews.freebsd.org/D29105
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f17a5900850b4d449a91cbe9c61dfb62373c3cd1 commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1 Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-20 15:25:45 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29045 sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=640c2ff8ae9bdbad91aa25094123bb97e5a92bbf commit 640c2ff8ae9bdbad91aa25094123bb97e5a92bbf Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-27 13:05:50 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1) sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8340548bc84bf1cbfa678febb08e0aa0293fd935 commit 8340548bc84bf1cbfa678febb08e0aa0293fd935 Author: Don Morris <dgmorris@earthlink.net> AuthorDate: 2021-05-20 14:54:38 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-27 13:06:15 +0000 ufs: Avoid M_WAITOK allocations when building a dirhash At this point the directory's vnode lock is held, so blocking while waiting for free pages makes the system more susceptible to deadlock in low memory conditions. This is particularly problematic on NUMA systems as UMA currently implements a strict first-touch policy. ufsdirhash_build() already uses M_NOWAIT for other allocations and already handled failures for the block array allocation, so just convert to M_NOWAIT. PR: 253992 Reviewed by: markj, mckusick, vangyzen (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1) sys/ufs/ufs/ufs_dirhash.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)