Bug 253992 - ufsdirhash_build() sleeps with directory vnode locked due to M_WAITOK
Summary: ufsdirhash_build() sleeps with directory vnode locked due to M_WAITOK
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-03 18:18 UTC by dgmorris@earthlink.net
Modified: 2021-05-27 13:07 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dgmorris@earthlink.net 2021-03-03 18:18:46 UTC
On a system where a timeout panic has been added for attempts to acquire a UFS root vnode lock, such a panic was observed. The holder of the lock was attempting to do:

ched_switch() at sched_switch+0xbcc/frame 0xfffffe885e4e5e10
mi_switch() at mi_switch+0x29b/frame 0xfffffe885e4e5e60
sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe885e4e5e90
_sleep() at _sleep+0x265/frame 0xfffffe885e4e5f10
vm_wait_doms() at vm_wait_doms+0xfd/frame 0xfffffe885e4e5f40
vm_wait_domain() at vm_wait_domain+0x50/frame 0xfffffe885e4e5f70
vm_domain_alloc_fail() at vm_domain_alloc_fail+0x91/frame 0xfffffe885e4e5fb0
vm_page_alloc_domain_after() at vm_page_alloc_domain_after+0x243/frame 
0xfffffe885e4e6030
uma_small_alloc() at uma_small_alloc+0x9b/frame 0xfffffe885e4e6080
keg_alloc_slab() at keg_alloc_slab+0x103/frame 0xfffffe885e4e60d0
zone_import() at zone_import+0x106/frame 0xfffffe885e4e6170
zone_alloc_item() at zone_alloc_item+0x6e/frame 0xfffffe885e4e61b0
cache_alloc_retry() at cache_alloc_retry+0x13c/frame 0xfffffe885e4e6210
uma_zalloc_arg() at uma_zalloc_arg+0x114/frame 0xfffffe885e4e6270
ufsdirhash_build() at ufsdirhash_build+0x850/frame 0xfffffe885e4e6310
ufs_lookup_ino() at ufs_lookup_ino+0x1a2/frame 0xfffffe885e4e6420
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x8a/frame 0xfffffe885e4e6450
vfs_cache_lookup() at vfs_cache_lookup+0xd6/frame 0xfffffe885e4e64b0
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8a/frame 0xfffffe885e4e64e0
lookup() at lookup+0x756/frame 0xfffffe885e4e6590
namei() at namei+0x46c/frame 0xfffffe885e4e6660
_vn_open_cred() at _vn_open_cred+0x1e1/frame 0xfffffe885e4e67f0
_vn_open() at _vn_open+0x1f/frame 0xfffffe885e4e6820
isi_kern_openat() at isi_kern_openat+0x3b4/frame 0xfffffe885e4e6a10
sys_openat() at sys_openat+0x5a/frame 0xfffffe885e4e6a70
amd64_syscall() at amd64_syscall+0x381/frame 0xfffffe885e4e6bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe885e4e6bf0
— syscall (499, FreeBSD ELF64, sys_openat), rip = 0x8005b0d3a, rsp = 
0x7fffffffe368, rbp = 0x7fffffffe3e0 —

The system is ccNUMA, with 2 domains. One domain is OOM, the other is not:
db> show pageq
pq_free 1887153
dom 0 page_cnt 8102741 free 2 pq_act 57 pq_inact 1 pq_laund 0 pq_unsw 311370
dom 1 page_cnt 8133281 free 1887151 pq_act 17391 pq_inact 4868531 
pq_laund 0 pq_unsw 186545

The DIRHASH uma zone is created with flags argument of 0, so acquires the default NUMA policy (FIRSTTOUCH). The problem allocation is on line 437:

	dh->dh_hash = malloc(narrays * sizeof(dh->dh_hash[0]),
	    M_DIRHASH, M_NOWAIT | M_ZERO);
	if (dh->dh_hash == NULL)
		goto fail;
	dh->dh_blkfree = malloc(nblocks * sizeof(dh->dh_blkfree[0]),
	    M_DIRHASH, M_NOWAIT);
	if (dh->dh_blkfree == NULL)
		goto fail;
	for (i = 0; i < narrays; i++) {
		if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL)
			goto fail;
		for (j = 0; j < DH_NBLKOFF; j++)
			dh->dh_hash[i][j] = DIRHASH_EMPTY;
	}

Where:
#define DIRHASH_BLKALLOC_WAITOK() 	uma_zalloc(ufsdirhash_zone, M_WAITOK)
 from line 73.

As can be seen, the code is even assuming the allocation can fail as if it were NOWAIT, but is instead doing M_WAITOK while holding the directory vnode lock per sys/ufs/ufs/ufs_lookup.c line 254:
#ifdef DEBUG_VFS_LOCKS
	/*
	 * Assert that the directory vnode is locked, and locked
	 * exclusively for the last component lookup for modifying
	 * operations.
	 *
	 * The directory-modifying operations need to save
	 * intermediate state in the inode between namei() call and
	 * actual directory manipulations.  See fields in the struct
	 * inode marked as 'used during directory lookup'.  We must
	 * ensure that upgrade in namei() does not happen, since
	 * upgrade might need to unlock vdp.  If quotas are enabled,
	 * getinoquota() also requires exclusive lock to modify inode.
	 */
	ASSERT_VOP_LOCKED(vdp, "ufs_lookup1");
	if ((nameiop == CREATE || nameiop == DELETE || nameiop == RENAME) &&
	    (flags & (LOCKPARENT | ISLASTCN)) == (LOCKPARENT | ISLASTCN))
		ASSERT_VOP_ELOCKED(vdp, "ufs_lookup2");
#endif

restart:
	bp = NULL;
	slotoffset = -1;

	/*
	 * We now have a segment name to search for, and a directory to search.
	 *
	 * Suppress search for slots unless creating
	 * file and at end of pathname, in which case
	 * we watch for a place to put the new file in
	 * case it doesn't already exist.
	 */
	ino = 0;
	i_diroff = dp->i_diroff;
	slotstatus = FOUND;
	slotfreespace = slotsize = slotneeded = 0;
	if ((nameiop == CREATE || nameiop == RENAME) &&
	    (flags & ISLASTCN)) {
		slotstatus = NONE;
		slotneeded = DIRECTSIZ(cnp->cn_namelen);
	}

#ifdef UFS_DIRHASH
	/*
	 * Use dirhash for fast operations on large directories. The logic
	 * to determine whether to hash the directory is contained within
	 * ufsdirhash_build(); a zero return means that it decided to hash
	 * this directory and it successfully built up the hash table.
	 */
	if (ufsdirhash_build(dp) == 0) {

The NUMA system may be more likely to hit this in the case of imbalanced load (as the system as a whole may not be OOM, just the domain), but the risk is there on any system.

I propose the following diff as the WAITOK allocation is only used in one place:

dmorris@wkstn-dmorris2:freebsd_ufs $ git diff -p origin/main
diff --git a/sys/ufs/ufs/ufs_dirhash.c b/sys/ufs/ufs/ufs_dirhash.c
index 13094b5a1c97..72f38be5eea4 100644
--- a/sys/ufs/ufs/ufs_dirhash.c
+++ b/sys/ufs/ufs/ufs_dirhash.c
@@ -108,7 +108,7 @@ static uma_zone_t   ufsdirhash_zone;
 
 #define DIRHASHLIST_LOCK()             mtx_lock(&ufsdirhash_mtx)
 #define DIRHASHLIST_UNLOCK()           mtx_unlock(&ufsdirhash_mtx)
-#define DIRHASH_BLKALLOC_WAITOK()      uma_zalloc(ufsdirhash_zone, M_WAITOK)
+#define DIRHASH_BLKALLOC_NOWAIT()      uma_zalloc(ufsdirhash_zone, M_NOWAIT)
 #define DIRHASH_BLKFREE(ptr)           uma_zfree(ufsdirhash_zone, (ptr))
 #define        DIRHASH_ASSERT_LOCKED(dh)                                       \
     sx_assert(&(dh)->dh_lock, SA_LOCKED)
@@ -425,7 +425,7 @@ ufsdirhash_build(struct inode *ip)
        if (dh->dh_blkfree == NULL)
                goto fail;
        for (i = 0; i < narrays; i++) {
-               if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_WAITOK()) == NULL)
+               if ((dh->dh_hash[i] = DIRHASH_BLKALLOC_NOWAIT()) == NULL)
                        goto fail;
                for (j = 0; j < DH_NBLKOFF; j++)
                        dh->dh_hash[i][j] = DIRHASH_EMPTY;
Comment 1 Konstantin Belousov freebsd_committer 2021-03-03 18:34:43 UTC
Generally, I agree that we should not do malloc(M_WAITOK) under vnode lock.
The main reason for this is that we might own vnode lock for the vnode which
owns too many pageable pages and that prevents pagedaemon from making progress,
or at least makes it life significantly harder.

That said, in your case there are two things that are questionable:
- we default to round-robin policy for zones which do not specify policy
  explicitly (what is isi_kern_openat, is it for isilon?)
- this is not real OOM, it is due to partitioning and strange policy that
  allocator has to wait for pagedaemon.  And still, pagedaemon must be able
  to make some progress, for overloaded domain.

I.e. I am leery of the necessity of the proposed change.
Comment 2 dgmorris@earthlink.net 2021-03-03 18:44:02 UTC
Actually, we don't default to RR.

From an email with Mark Johnston (and looking over the code paths):
> More to the point, I believe the zones are marked as FirstTouch because
> they pass "0" as the flags argument to uma_zcreate() -- and the
> keg_ctor() ends up setting the keg flags based on:
>
> 	/*
> 	 * Use a first-touch NUMA policy for kegs that pmap_extract() will
> 	 * work on.  Use round-robin for everything else.
> 	 *
> 	 * Zones may override the default by specifying either.
> 	 */
> #ifdef NUMA
> 	if ((keg->uk_flags &
> 	    (UMA_ZONE_ROUNDROBIN | UMA_ZFLAG_CACHE | UMA_ZONE_NOTPAGE)) == 0)
> 		keg->uk_flags |= UMA_ZONE_FIRSTTOUCH;
> 	else if ((keg->uk_flags & UMA_ZONE_FIRSTTOUCH) == 0)
> 		keg->uk_flags |= UMA_ZONE_ROUNDROBIN;
> #endif
>
> Then the zone flags inherit from the keg flags and there we are.

Yes, FIRSTTOUCH is the default policy for regular zones.

-- End email snippet --

And yes, sorry -- didn't realize I had an isi* function still in there. Yes, this is Isilon. I tried to stay general about things.

Waiting on a domain specific scan in this case (again, because of First Touch) may help -- but there may not be much the scan can do. In this case, there was a separate kernel memory leak that had most of the domain tied up as "in use", obviating the ability for the pagedaemon to free pages for the domain, but one can speculate a spike in load localized within one domain that provides a similar situation for long enough for the directory lock waiters to be noticed.
Comment 3 Mark Johnston freebsd_committer 2021-03-03 19:11:03 UTC
Right, UMA has defaulted to the FIRSTTOUCH policy since https://reviews.freebsd.org/D22831 .  Indeed, the policy is kind of questionable.  It is strict first-touch, i.e., always returns local memory.  First-touch policies implemented using domainsets do not have this property unless explicitly requested, so if you allocate a page with e.g., DOMAINSET_PREF(0), we will return memory from other domains if necessary to avoid sleeping.

I am not sure why FIRSTTOUCH has this strict implementation.  Originally it was used for bucket zones, which are always NOWAIT.  And until recently UMA did not have special handling for cross-domain frees.

We could consider adding a new (default) policy which is not strict.  It is likely necessary for ZFS to work well on NUMA systems, since its buffer cache memory is allocated from UMA.  That is a larger problem though.  In this case though the pressure is apparently caused by a kernel memory leak.
Comment 4 Mark Johnston freebsd_committer 2021-03-03 19:33:13 UTC
https://reviews.freebsd.org/D29046

Untested so far.
Comment 5 Mark Johnston freebsd_committer 2021-03-03 19:41:26 UTC
Jeff said there is no particular reason not to implement the policy this way, so I will go ahead and keep working on the patch.  But I suspect that the dirhash allocation policy should be changed as well.
Comment 6 dgmorris@earthlink.net 2021-03-03 19:48:56 UTC
I opened https://reviews.freebsd.org/D29045 for the proposed dirhash policy change, btw. I'm certainly cool with the NUMA policy changes as well, just putting this out there.
Comment 7 Mark Johnston freebsd_committer 2021-03-05 22:54:13 UTC
Split the proposed diff into:
https://reviews.freebsd.org/D29104
https://reviews.freebsd.org/D29105
Comment 8 commit-hook freebsd_committer 2021-05-20 15:28:18 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f17a5900850b4d449a91cbe9c61dfb62373c3cd1

commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1
Author:     Don Morris <dgmorris@earthlink.net>
AuthorDate: 2021-05-20 14:54:38 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-05-20 15:25:45 +0000

    ufs: Avoid M_WAITOK allocations when building a dirhash

    At this point the directory's vnode lock is held, so blocking while
    waiting for free pages makes the system more susceptible to deadlock in
    low memory conditions.  This is particularly problematic on NUMA systems
    as UMA currently implements a strict first-touch policy.

    ufsdirhash_build() already uses M_NOWAIT for other allocations and
    already handled failures for the block array allocation, so just convert
    to M_NOWAIT.

    PR:             253992
    Reviewed by:    markj, mckusick, vangyzen
    MFC after:      1 week
    Differential Revision:  https://reviews.freebsd.org/D29045

 sys/ufs/ufs/ufs_dirhash.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Comment 9 commit-hook freebsd_committer 2021-05-27 13:06:50 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=640c2ff8ae9bdbad91aa25094123bb97e5a92bbf

commit 640c2ff8ae9bdbad91aa25094123bb97e5a92bbf
Author:     Don Morris <dgmorris@earthlink.net>
AuthorDate: 2021-05-20 14:54:38 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-05-27 13:05:50 +0000

    ufs: Avoid M_WAITOK allocations when building a dirhash

    At this point the directory's vnode lock is held, so blocking while
    waiting for free pages makes the system more susceptible to deadlock in
    low memory conditions.  This is particularly problematic on NUMA systems
    as UMA currently implements a strict first-touch policy.

    ufsdirhash_build() already uses M_NOWAIT for other allocations and
    already handled failures for the block array allocation, so just convert
    to M_NOWAIT.

    PR:             253992
    Reviewed by:    markj, mckusick, vangyzen

    (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1)

 sys/ufs/ufs/ufs_dirhash.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
Comment 10 commit-hook freebsd_committer 2021-05-27 13:06:51 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8340548bc84bf1cbfa678febb08e0aa0293fd935

commit 8340548bc84bf1cbfa678febb08e0aa0293fd935
Author:     Don Morris <dgmorris@earthlink.net>
AuthorDate: 2021-05-20 14:54:38 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-05-27 13:06:15 +0000

    ufs: Avoid M_WAITOK allocations when building a dirhash

    At this point the directory's vnode lock is held, so blocking while
    waiting for free pages makes the system more susceptible to deadlock in
    low memory conditions.  This is particularly problematic on NUMA systems
    as UMA currently implements a strict first-touch policy.

    ufsdirhash_build() already uses M_NOWAIT for other allocations and
    already handled failures for the block array allocation, so just convert
    to M_NOWAIT.

    PR:             253992
    Reviewed by:    markj, mckusick, vangyzen

    (cherry picked from commit f17a5900850b4d449a91cbe9c61dfb62373c3cd1)

 sys/ufs/ufs/ufs_dirhash.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)