Bug 274252

Summary: sys/vm: less-than-ideal handling of memory requests that cannot be fulfilled
Product: Base System Reporter: Kyle Evans <kevans>
Component: kernAssignee: Jason A. Harmening <jah>
Status: Closed FIXED    
Severity: Affects Only Me CC: emaste, jah
Priority: ---    
Version: Unspecified   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Avoid page waits or rescans of domains that can't satisfy an allocation request none

Description Kyle Evans freebsd_committer freebsd_triage 2023-10-03 20:28:34 UTC
Splitting off from PR 274237, because I haven't actually created a PR for this previously, but it'd be nice to track.

With at least some ARM machines, it's possible to get stuck in a nice loop in xhci attach because the VM bits don't handle some class of requests that cannot be satisfied very well. In particular, consider this system:

Physical memory chunk(s):                                                                                     
0x000008010a8000 - 0x00000802313fff, 19316736 bytes (1179 pages)                                              
0x000008023d8000 - 0x0000080389bfff, 21774336 bytes (1329 pages)                                              
0x000008038b8000 - 0x00000808f97fff, 91095040 bytes (5560 pages)                                              
0x00000808fb8000 - 0x0000080ba03fff, 44351488 bytes (2707 pages)                                              
0x0000080c12c000 - 0x000009d036ffff, 7585677312 bytes (462993 pages)                                          
0x000009d4f68000 - 0x000009db93bfff, 110968832 bytes (6773 pages)                                             
0x000009db944000 - 0x000009e096ffff, 84066304 bytes (5131 pages)                                              
0x000009e0980000 - 0x000009e0a37fff, 753664 bytes (46 pages)                                                  
avail memory = 7955300352 (7586 MB)                                                                           

Note that there's absolutely no RAM in the lower 4G of the address space. There's an XHCI controller that can only do 32-bit DMA (allegedly) and it has an associated IOMMU that isn't hooked up just yet.

Right now, busdma will request some pages below 4G (IIRC, it's with kmem_alloc_contig here[0]), but that request cannot be satisfied -- there's absolutely no memory there. Instead, it ends up hanging in the VM layer trying to fulfill an allocation that isn't physically possible.

I think it'd be better to fail the request and let busdma kick back an ENOMEM. The XHCI controller will not be functional, but that's both expected and not a deal-breaker for getting the machine into a usable state.

[0] https://cgit.freebsd.org/src/tree/sys/arm64/arm64/busdma_bounce.c#n572
Comment 1 Jason A. Harmening freebsd_committer freebsd_triage 2023-10-06 22:19:09 UTC
Where exactly is the contig alloc attempt hanging?
Based on code inspection I might guess the vm_wait_domain() call from kmem_alloc_contig_pages(), but it'd be good to get a backtrace.
Comment 2 Kyle Evans freebsd_committer freebsd_triage 2023-10-06 23:46:42 UTC
(In reply to Jason A. Harmening from comment #1)

IIRC from the last time I debugged this, we actually get stuck just inside @ kmem_alloc_contig_domainset. kmem_alloc_contig_domain() does fail, but it's an M_WAITOK allocation so vm_domainset_iter_policy() just keeps restarting the search and we never break out. There's currently no way for, e.g., kmem_alloc_contig_domain() -> kmem_alloc_contig_pages() -> vm_page_alloc_contig_domain() to differentiate between a transient failure condition and an impossible request.
Comment 3 Jason A. Harmening freebsd_committer freebsd_triage 2023-10-07 03:09:44 UTC
(In reply to Kyle Evans from comment #2)

That makes more sense actually.  The reclaim wait in kmem_alloc_contig_pages() might block unnecessarily for some time, but probably not indefinitely.

Do you mind if I take this one?  I've been wanting to get more familiar with the various bits of the VM subsystem, and this seems like as good a place to start as any.
Comment 4 Kyle Evans freebsd_committer freebsd_triage 2023-10-07 04:57:44 UTC
(In reply to Jason A. Harmening from comment #3)

> Do you mind if I take this one?  I've been wanting to get more familiar with the various bits of the VM subsystem, and this seems like as good a place to start as any.

Feel free... I'm too far in the weeds on many other projects to stop and take a look, though I'm more than happy to try patches or probe around a bit more on this other affected system that I have if it'd help.
Comment 5 Jason A. Harmening freebsd_committer freebsd_triage 2023-10-16 04:01:52 UTC
Created attachment 245664 [details]
Avoid page waits or rescans of domains that can't satisfy an allocation request

Here's a somewhat naive first take on the problem; in local testing it eliminates the hang for a kmod rigged to attempt an impossible contigmalloc.  Can you test it out on -current?
Comment 6 Jason A. Harmening freebsd_committer freebsd_triage 2023-11-19 23:35:35 UTC
ping - Kyle, will you be able to test this sometime soon-ish?
Comment 7 Kyle Evans freebsd_committer freebsd_triage 2023-11-19 23:56:11 UTC
(In reply to Jason A. Harmening from comment #6)

Sorry, missed the first e-mail... I'll get my m1 branch updated and take it for a test spin sometime this week. Thanks!
Comment 8 Kyle Evans freebsd_committer freebsd_triage 2023-11-20 06:44:09 UTC
(In reply to Jason A. Harmening from comment #6)

Yup, that's a major quality of life improvement:

snps_dwc3_fdt0: <Synopsys Designware DWC3> mem 0x382280000-0x38237ffff irq 51 on simplebus0                   
snps_dwc3_fdt0: SNPS Version: DWC3.1 (3331 3139302a 736f3035)                                                 
snps_dwc3_fdt0: enabling power domain                                                                         
snps_dwc3_fdt0: 64 bytes context size, 32-bit DMA                                                             
snps_dwc3_fdt0: Failed to init XHCI, with error 12                                                            
device_attach: snps_dwc3_fdt0 attach returned 6                                                               
simplebus0: <iommu@382f00000> mem 0x382f00000-0x382f03fff irq 52 compat apple,t8103-dart (no driver attached) 
simplebus0: <iommu@382f80000> mem 0x382f80000-0x382f83fff irq 53 compat apple,t8103-dart (no driver attached) 
snps_dwc3_fdt0: <Synopsys Designware DWC3> mem 0x502280000-0x50237ffff irq 54 on simplebus0                   
snps_dwc3_fdt0: SNPS Version: DWC3.1 (3331 3139302a 736f3035)                                                 
snps_dwc3_fdt0: enabling power domain                                                                         
snps_dwc3_fdt0: 64 bytes context size, 32-bit DMA                                                             
snps_dwc3_fdt0: Failed to init XHCI, with error 12                                                            
device_attach: snps_dwc3_fdt0 attach returned 6                                                               
<boot continues>

Thanks! The patch looks generally sane, though I only gave it a cursory read-through.
Comment 9 commit-hook freebsd_committer freebsd_triage 2023-12-24 05:40:18 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=2619c5ccfe1f7889f0241916bd17d06340142b05

commit 2619c5ccfe1f7889f0241916bd17d06340142b05
Author:     Jason A. Harmening <jah@FreeBSD.org>
AuthorDate: 2023-11-20 23:23:58 +0000
Commit:     Jason A. Harmening <jah@FreeBSD.org>
CommitDate: 2023-12-24 05:01:40 +0000

    Avoid waiting on physical allocations that can't possibly be satisfied

    - Change vm_page_reclaim_contig[_domain] to return an errno instead
      of a boolean.  0 indicates a successful reclaim, ENOMEM indicates
      lack of available memory to reclaim, with any other error (currently
      only ERANGE) indicating that reclamation is impossible for the
      specified address range.  Change all callers to only follow
      up with vm_page_wait* in the ENOMEM case.

    - Introduce vm_domainset_iter_ignore(), which marks the specified
      domain as unavailable for further use by the iterator.  Use this
      function to ignore domains that can't possibly satisfy a physical
      allocation request.  Since WAITOK allocations run the iterators
      repeatedly, this avoids the possibility of infinitely spinning
      in domain iteration if no available domain can satisfy the
      allocation request.

    PR:             274252
    Reported by:    kevans
    Tested by:      kevans
    Reviewed by:    markj
    Differential Revision: https://reviews.freebsd.org/D42706

 sys/arm/nvidia/drm2/tegra_bo.c              |  9 +++--
 sys/compat/linuxkpi/common/src/linux_page.c |  8 ++--
 sys/dev/drm2/ttm/ttm_bo.c                   |  4 +-
 sys/dev/drm2/ttm/ttm_page_alloc.c           |  9 +++--
 sys/kern/uipc_ktls.c                        |  5 ++-
 sys/kern/uipc_shm.c                         |  5 ++-
 sys/vm/vm_domainset.c                       | 32 +++++++++++++---
 sys/vm/vm_domainset.h                       |  2 +
 sys/vm/vm_kern.c                            | 24 +++++++++++-
 sys/vm/vm_page.c                            | 58 ++++++++++++++++++++++-------
 sys/vm/vm_page.h                            |  6 +--
 11 files changed, 123 insertions(+), 39 deletions(-)