277476 – graphics/drm-515-kmod: amdgpu periodic hangs due to phys contig allocations

Bug 277476 - graphics/drm-515-kmod: amdgpu periodic hangs due to phys contig allocations

Summary: graphics/drm-515-kmod: amdgpu periodic hangs due to phys contig allocations

Status:	In Progress

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Olivier Certner

URL:	https://github.com/freebsd/drm-kmod/i...
Keywords:

Depends on:
Blocks:

Reported:	2024-03-04 14:12 UTC by Josef 'Jeff' Sipek
Modified:	2025-03-26 16:17 UTC (History)
CC List:	15 users (show)

See Also:

Flags:	linimon: maintainer-feedback? (x11) olce: mfc-stable14+

Attachments
PR277476 fix (2.87 KB, patch) 2024-11-14 00:43 UTC, sigsys	no flags	Details \| Diff
dtrace profile (215.94 KB, text/plain) 2025-03-12 20:09 UTC, Ivan Rozhuk	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Josef 'Jeff' Sipek 2024-03-04 14:12:20 UTC

Two weeks ago I replaced an ancient nvidia graphics card with an AMD RX580 card to run open source drivers. Everything works fine most of the time, but occasionally the system hangs for a few seconds (5-10, usually).  The longer the system has been up, the worse it gets.

Digging into it a bit, it is because userspace (it always looks like X) does an ioctl into drm which then tries to allocate a large-ish piece of physically contiguous memory.  This explains why it gets worse as uptime increases (free physical memory fragmentation) and when running firefox (the most memory hungry application I use).  I know nothing about graphics cards, the software stack supporting them, or the linux kernel API compatibility layer, but clearly it'd be beneficial if amdgpu/drm/whatever could make use of *virtually* contiguous pages or some kind of allocation caching/reuse to avoid repeatedly asking the vm code for physically contiguous ranges.

To conclude the above, I did a handful of dtrace-based experiments.

While one of the "temporary hangs" was happening, the following was the most common (non-idle) profiler stack:

# dtrace -n 'profile-97{@[stack()]=count()}'
...
              kernel`vm_phys_alloc_contig+0x11d
              kernel`linux_alloc_pages+0x8f
              ttm.ko`ttm_pool_alloc+0x2cb
              ttm.ko`ttm_tt_populate+0xc5
              ttm.ko`ttm_bo_handle_move_mem+0xc3
              ttm.ko`ttm_bo_validate+0xb4
              ttm.ko`ttm_bo_init_reserved+0x199
              amdgpu.ko`amdgpu_bo_create+0x1eb
              amdgpu.ko`amdgpu_bo_create_user+0x21
              amdgpu.ko`amdgpu_gem_create_ioctl+0x1e2
              drm.ko`drm_ioctl_kernel+0xc6
              drm.ko`drm_ioctl+0x2b5
              kernel`linux_file_ioctl+0x312
              kernel`kern_ioctl+0x255
              kernel`sys_ioctl+0x123
              kernel`amd64_syscall+0x109
              kernel`0xffffffff80fe43eb

The latency of vm_phys_alloc_contig (entry to return) is bimodal - with latencies in the single digit *milli*seconds during the "temporary hangs":

# dtrace -n 'fbt::vm_phys_alloc_contig:entry{self->ts=timestamp}' -n 'fbt::vm_phys_alloc_contig:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
...
           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@                                        2606     
            1024 |@@@@@@@@@                                18207    
            2048 |@                                        2534     
            4096 |                                         894      
            8192 |                                         34       
           16384 |                                         78       
           32768 |                                         58       
           65536 |                                         219      
          131072 |                                         306      
          262144 |                                         310      
          524288 |                                         735      
         1048576 |                                         174      
         2097152 |@@                                       4364     
         4194304 |@@@@@@@@@@@@@@@@@@@@@@@@                 47475    
         8388608 |@                                        1546     
        16777216 |                                         2        
        33554432 |                                         0     

The number of pages being allocated:

# dtrace -n 'fbt::vm_phys_alloc_contig:entry/arg1>1/{@=quantize(arg1)}' -n 'tick-1sec{printa(@)}'
...
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |@@@                                      15       
               4 |@                                        7        
               8 |@@@                                      16       
              16 |@@                                       10       
              32 |@@                                       10       
              64 |@                                        7        
             128 |@@@                                      12       
             256 |@@@                                      12       
             512 |@@@@@@@@@@@@@@                           68       
            1024 |@@@@@@@                                  32       
            2048 |                                         0      

I did a few more dtrace experiments, but they all point to the same thing - a drm/amdgpu related ioctl wants 4MB of physically contiguous memory often enough to become a headache.  4MB isn't too much given than the system has 32GB of RAM, but physically contiguous takes a while to fulfill sometimes.


The card:

vgapci0@pci0:1:0:0:     class=0x030000 rev=0xe7 hdr=0x00 vendor=0x1002 device=0x67df subvendor=0x1da2 subdevice=0xe353
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]'
    class      = display
    subclass   = VGA

$ pkg info|grep -i amd             
gpu-firmware-amd-kmod-aldebaran-20230625 Firmware modules for aldebaran AMD GPUs
gpu-firmware-amd-kmod-arcturus-20230625 Firmware modules for arcturus AMD GPUs
gpu-firmware-amd-kmod-banks-20230625 Firmware modules for banks AMD GPUs
gpu-firmware-amd-kmod-beige-goby-20230625 Firmware modules for beige_goby AMD GPUs
gpu-firmware-amd-kmod-bonaire-20230625 Firmware modules for bonaire AMD GPUs
gpu-firmware-amd-kmod-carrizo-20230625 Firmware modules for carrizo AMD GPUs
gpu-firmware-amd-kmod-cyan-skillfish2-20230625 Firmware modules for cyan_skillfish2 AMD GPUs
gpu-firmware-amd-kmod-dimgrey-cavefish-20230625 Firmware modules for dimgrey_cavefish AMD GPUs
gpu-firmware-amd-kmod-fiji-20230625 Firmware modules for fiji AMD GPUs
gpu-firmware-amd-kmod-green-sardine-20230625 Firmware modules for green_sardine AMD GPUs
gpu-firmware-amd-kmod-hainan-20230625 Firmware modules for hainan AMD GPUs
gpu-firmware-amd-kmod-hawaii-20230625 Firmware modules for hawaii AMD GPUs
gpu-firmware-amd-kmod-kabini-20230625 Firmware modules for kabini AMD GPUs
gpu-firmware-amd-kmod-kaveri-20230625 Firmware modules for kaveri AMD GPUs
gpu-firmware-amd-kmod-mullins-20230625 Firmware modules for mullins AMD GPUs
gpu-firmware-amd-kmod-navi10-20230625 Firmware modules for navi10 AMD GPUs
gpu-firmware-amd-kmod-navi12-20230625 Firmware modules for navi12 AMD GPUs
gpu-firmware-amd-kmod-navi14-20230625 Firmware modules for navi14 AMD GPUs
gpu-firmware-amd-kmod-navy-flounder-20230625 Firmware modules for navy_flounder AMD GPUs
gpu-firmware-amd-kmod-oland-20230625 Firmware modules for oland AMD GPUs
gpu-firmware-amd-kmod-picasso-20230625 Firmware modules for picasso AMD GPUs
gpu-firmware-amd-kmod-pitcairn-20230625 Firmware modules for pitcairn AMD GPUs
gpu-firmware-amd-kmod-polaris10-20230625 Firmware modules for polaris10 AMD GPUs
gpu-firmware-amd-kmod-polaris11-20230625 Firmware modules for polaris11 AMD GPUs
gpu-firmware-amd-kmod-polaris12-20230625 Firmware modules for polaris12 AMD GPUs
gpu-firmware-amd-kmod-raven-20230625 Firmware modules for raven AMD GPUs
gpu-firmware-amd-kmod-raven2-20230625 Firmware modules for raven2 AMD GPUs
gpu-firmware-amd-kmod-renoir-20230625 Firmware modules for renoir AMD GPUs
gpu-firmware-amd-kmod-si58-20230625 Firmware modules for si58 AMD GPUs
gpu-firmware-amd-kmod-sienna-cichlid-20230625 Firmware modules for sienna_cichlid AMD GPUs
gpu-firmware-amd-kmod-stoney-20230625 Firmware modules for stoney AMD GPUs
gpu-firmware-amd-kmod-tahiti-20230625 Firmware modules for tahiti AMD GPUs
gpu-firmware-amd-kmod-tonga-20230625 Firmware modules for tonga AMD GPUs
gpu-firmware-amd-kmod-topaz-20230625 Firmware modules for topaz AMD GPUs
gpu-firmware-amd-kmod-vangogh-20230625 Firmware modules for vangogh AMD GPUs
gpu-firmware-amd-kmod-vega10-20230625 Firmware modules for vega10 AMD GPUs
gpu-firmware-amd-kmod-vega12-20230625 Firmware modules for vega12 AMD GPUs
gpu-firmware-amd-kmod-vega20-20230625 Firmware modules for vega20 AMD GPUs
gpu-firmware-amd-kmod-vegam-20230625 Firmware modules for vegam AMD GPUs
gpu-firmware-amd-kmod-verde-20230625 Firmware modules for verde AMD GPUs
gpu-firmware-amd-kmod-yellow-carp-20230625 Firmware modules for yellow_carp AMD GPUs
suitesparse-amd-3.3.0          Symmetric approximate minimum degree
suitesparse-camd-3.3.0         Symmetric approximate minimum degree
suitesparse-ccolamd-3.3.0      Constrained column approximate minimum degree ordering
suitesparse-colamd-3.3.0       Column approximate minimum degree ordering algorithm
webcamd-5.17.1.2_1             Port of Linux USB webcam and DVB drivers into userspace
xf86-video-amdgpu-22.0.0_1     X.Org amdgpu display driver
$ pkg info|grep -i drm
drm-515-kmod-5.15.118_3        DRM drivers modules
drm-kmod-20220907_1            Metaport of DRM modules for the linuxkpi-based KMS components
gpu-firmware-kmod-20230210_1,1 Firmware modules for the drm-kmod drivers
libdrm-2.4.120_1,1             Direct Rendering Manager library and headers

Comment 1 Josef 'Jeff' Sipek 2024-03-04 14:28:43 UTC

While gathering all the dtrace data, I was so distracted I forgot to mention:

$ freebsd-version -kru
14.0-RELEASE-p5
14.0-RELEASE-p5
14.0-RELEASE-p5

Comment 2 Josef 'Jeff' Sipek 2024-04-06 14:22:49 UTC

I dug a bit more into this.  It looks like the drm code has provisions for
allocating memory via dma APIs.  The FreeBSD port doesn't implement those.

Specifically, looking at drm-kmod-drm_v5.15.25_5 source:

drivers/gpu/drm/amd/amdgpu/gmc_v*.c sets adev->need_swiotlb to
drm_need_swiotlb(...).  drm_need_swiotlb is implemented in
drivers/gpu/drm/drm_cache.c as a 'return false' on FreeBSD.

Later on, amdgpu_ttm_init calls ttm_device_init with the use_dma_alloc
argument equal to adev->need_swiotlb (IOW, false).

Much later on, ttm_pool_alloc is called to allocate a buffer.  That in turn
calls ttm_pool_alloc_page which amounts to:

	if (!use_dma_alloc)
		return alloc_pages(...);
	
	panic("ttm_pool.c: use_dma_alloc not implemented");

So, because of the 'return false' during initialization, we always call
alloc_pages (aka. linux_alloc_pages) which tries to allocate physically
contiguous memory.

As I said before, I don't know anything about the graphics stack, so it is
possible that this dma API is completely irrelevant.


Looking at ttm_pool_alloc some more, it immediately turns the physically
contiguous allocation into an array of struct page pointers (tt->pagse).
So, depending on how the rest of the module uses the buffer & pages, it
may be relatively easy to switch to a virtually-contiguous allocation.

Comment 3 Tomasz "CeDeROM" CEDRO 2024-05-15 23:23:56 UTC

I also have RX580. After upgrading from 13.2 to 14.0 I got frequent kernel panics. 

Related reports:
* https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276985
* https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278212

I noticed that setting manual xorg.conf admgpu DRI=2 make kernel panic less frequent but instead system got slower and slower until unresponsive, if I managed to kill xorg in time it could work for a while again until I had to kill xorg again.

I could not work out fallback with safe non accelerated xorg.conf using scfb that would allow dual screen setup with one screen rotated. Dual monitor is only possible with amdgpu loaded. It was also not possible to disable acceleration in amdgpu and have xrandr (secondary screen rotation).

Rolled back to 13.2. DRM 5.15 / AMDGPU / LinuxKPI makes 14.0 unreliable.

Comment 4 Olivier Certner freebsd_committer

2024-09-02 19:04:09 UTC

(In reply to Josef 'Jeff' Sipek from comment #0)

I reported this independently in the drm-kmod GitHub project as https://github.com/freebsd/drm-kmod/issues/302.  Going to also cross-link this PR there.

I'd really like to get to the bottom of this.  However, I don't plan to have the time to do so before end of month at the very least.

drm-61-kmod exhibits the same problem.  However, drm-510-kmod works fine for me.

(In reply to Tomasz "CeDeROM" CEDRO from comment #3)

Please see my comment 8 in bug #278212.

Comment 5 sigsys 2024-11-08 09:04:51 UTC

Yeah so this problem was super annoying. But thanks to the information already posted here, seems like it wasn't too hard to fix.

IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization. When allocation fails, it fallsback to asking for less and less contiguous pages (eventually only asking for one page at a time). When ttm_pool_alloc_page() asks for more than one page, it passes alloc_pages() some extra flags (__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM).

What's expensive is the vm_page_reclaim_contig() in linux_alloc_pages(). The function tries too hard to find contiguous memory (that the drm code doesn't even require) and as physical memory gets too fragmented it becomes very slow.

So, very simple fix, make linux_alloc_pages() react to one of the flag passed by the drm code:

diff --git a/sys/compat/linuxkpi/common/include/linux/gfp.h b/sys/compat/linuxkpi/common/include/linux/gfp.h
index 2fcc0dc05f29..58a021086c98 100644
--- a/sys/compat/linuxkpi/common/include/linux/gfp.h
+++ b/sys/compat/linuxkpi/common/include/linux/gfp.h
@@ -44,7 +44,6 @@
 #define	__GFP_NOWARN	0
 #define	__GFP_HIGHMEM	0
 #define	__GFP_ZERO	M_ZERO
-#define	__GFP_NORETRY	0
 #define	__GFP_NOMEMALLOC 0
 #define	__GFP_RECLAIM   0
 #define	__GFP_RECLAIMABLE   0
@@ -58,7 +57,8 @@
 #define	__GFP_KSWAPD_RECLAIM	0
 #define	__GFP_WAIT	M_WAITOK
 #define	__GFP_DMA32	(1U << 24) /* LinuxKPI only */
-#define	__GFP_BITS_SHIFT 25
+#define	__GFP_NORETRY	(1U << 25) /* LinuxKPI only */
+#define	__GFP_BITS_SHIFT 26
 #define	__GFP_BITS_MASK	((1 << __GFP_BITS_SHIFT) - 1)
 #define	__GFP_NOFAIL	M_WAITOK
 
diff --git a/sys/compat/linuxkpi/common/src/linux_page.c b/sys/compat/linuxkpi/common/src/linux_page.c
index 18b90b5e3d73..71a6890a3795 100644
--- a/sys/compat/linuxkpi/common/src/linux_page.c
+++ b/sys/compat/linuxkpi/common/src/linux_page.c
@@ -118,7 +118,7 @@ linux_alloc_pages(gfp_t flags, unsigned int order)
 			page = vm_page_alloc_noobj_contig(req, npages, 0, pmax,
 			    PAGE_SIZE, 0, VM_MEMATTR_DEFAULT);
 			if (page == NULL) {
-				if (flags & M_WAITOK) {
+				if ((flags & (M_WAITOK | __GFP_NORETRY)) == M_WAITOK) {
 					int err = vm_page_reclaim_contig(req,
 					    npages, 0, pmax, PAGE_SIZE, 0);
 					if (err == ENOMEM)

Been working fine here with amdgpu for about 3 weeks.

(The drm modules need to be recompiled with the modified kernel header.)

Comment 6 Emmanuel Vadot freebsd_committer

2024-11-10 11:00:44 UTC

Intersting find, I've never could reproduce this bug on my RX550, Olivier can you test if the code (which looks ok to me) fixes the issue for you ?

Comment 7 rk 2024-11-10 19:55:42 UTC

(In reply to sigsys from comment #5)

I've been suffering from this issue on a Ryzen 9 4900H with embedded
Renoir graphics using drm-61-kmod-6.1.92_2 on stable/14-n268738-048132192698.
Playing videos using mpv easily triggered the slowdown after some time
(especially 4K videos).
I've implemented the suggested fix and now I cannot reproduce the behaviour
anymore (tested for 2 days now). Even playing multiple 4K videos in
parallel does not cause the problem.
Thanks for the fix.

Comment 8 Olivier Certner freebsd_committer

2024-11-12 09:43:41 UTC

(In reply to sigsys from comment #5)

> IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization.

That would be very good news (at least from the users' point of view).

Have not spent time on this issue since my last posts.  I had naively thought that the new DRM ports really needed contiguous allocation for whatever reason, and should probably have looked a bit further instead of assuming this would need some deep and highly time consuming analysis.

(In reply to Emmanuel Vadot from comment #6)

Will test that soon and report.

Comment 9 Emmanuel Vadot freebsd_committer

2024-11-12 12:23:19 UTC

(In reply to sigsys from comment #5)

Waiting for more people to test but in the meantime could you add a git-format patch to this bug please ? (So with full commit message and correct authorship).

Comment 10 Pierre Pronchery 2024-11-13 19:05:38 UTC

(In reply to Emmanuel Vadot from comment #6)
The patch also works well for me, no slowdowns to report after 24 hours.

Comment 11 Eirik Oeverby 2024-11-13 20:49:14 UTC

(In reply to sigsys from comment #5)
Has this patch landed already? I'm eager to test on my threadripper with Navi 24 [Radeon PRO W6400]; it's borderline useless after ~48h uptime and needs frequent reboots to fix. At least it's better than before I clamped the ARC to 8GB to slow the process down..

Comment 12 sigsys 2024-11-14 00:43:11 UTC

Created attachment 255155 [details]
PR277476 fix

Comment 13 sigsys 2024-11-14 00:48:43 UTC

(In reply to Emmanuel Vadot from comment #9)
Alright here it is.

Is it already too late to have this merged in 14.2?

I'm pretty sure this patch is safe. GFP_NORETRY isn't used in-tree at all right now. And this patch makes it do pretty much what it says. It doesn't retry. You'd hope that any code using this flag would expect allocations to fail...

The problem doesn't always happen for everyone but when it does man it's rough. After a week or two I was getting hangs that lasted 15 seconds sometimes. Restarting firefox would fix it for a while but eventually it becomes unusable.

Even if this made it in 14.2 IIUC it would take a while before the 14.X packages would be compiled against the new kernel headers, but it would already be useful to have it in base so that you could get the fix by compiling drm-kmod from ports.

Comment 14 fullermd 2024-11-14 02:08:03 UTC

FWIW, I definitely ran into what sounds just like this (with several different cards, on both 515 and 61; 510 was always rock solid).  After a few days, I'd sometimes get freezes lasting a minute or more.

A workaround that seems to work for me has been switching from the amdgpu to the modesetting X driver; I still occasionally see little blips, but they resolve and don't seem to pile up the way they did on amdgpu, even after months of uptime.

Comment 15 Olivier Certner freebsd_committer

2024-11-14 09:24:12 UTC

(In reply to sigsys from comment #13)

It seems manu@ is having a crash with the patch applied on 5.15.  So while it seems safe, we have to rule out some possible impacts in certain situations.

I'm afraid it is too late to have it merged in 14.2 anyway, so let's be sure we are not regressing anything while fixing the problem.

Comment 16 Ivan Rozhuk 2025-03-09 21:03:17 UTC

Never see this issue on "AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx", but always on
5950x + RX 5600 XT.
Software config identical, FBSD 14/stable.
xorg + amdgpu x driver.


To reduce freezes I use:
- picom
- cpuset to leave 8 cores free while build ports
- script that creates 16gb file on tmpfs and remove it

Script makes free ~20gb and almost no freezes until freemem > 3gb.
But after ~1 month of uptime I start see freezes even with freemem > 20gb.
Even without debug tools this was looks like memory fragmentation issue. :)


Is it possible to implement some memory defrag code in pagedaemon?
vm_page_reclaim_contig() also used by iommu, ktls and shm so improving it will make FBSD better even for server roles.



(In reply to Josef 'Jeff' Sipek from comment #0)

Thanks for debugging!


(In reply to sigsys from comment #5)

Thanks for patch, I will test it, but it requires at least 2 weeks to make sure that freezes go away.



(In reply to Emmanuel Vadot from comment #6)

Do you use xorg + amdgpu x driver?

Comment 17 Olivier Certner freebsd_committer

2025-03-10 10:33:28 UTC

(In reply to Ivan Rozhuk from comment #16)

> Thanks for patch, I will test it, but it requires at least 2 weeks to make sure that freezes go away.

Yes, and please report about your experience.

I'll do an extra test on my side.

Unless something goes wrong, I'd like to move this forward soon, and foremost before we start the release process for 14.3.

Comment 18 Evgenii Khramtsov 2025-03-10 13:45:22 UTC

(In reply to Olivier Certner from comment #17)

> Yes, and please report about your experience.

My 2 cents: I have used the patch for months with 6.1 back when it was tip of drm-kmod, then with graphics/drm-66-kmod, and the patch doesn't panic my desktop, and resolves the issue for me. I have never used it with anything less than 6.1.

Comment 19 fullermd 2025-03-10 15:11:04 UTC

I've run with the patch (slightly massaged to fit stable/14) with amdgpu and 6.1 for a month without any hint of the freezes showing up, so it certainly feels like a fix here.

Comment 20 Ivan Rozhuk 2025-03-12 20:09:54 UTC

Created attachment 258607 [details]
dtrace profile

Patch did not help, at least in case: xorg + amdgpu xdriver.
This how it was landed on 14/stable: https://github.com/rozhuk-im/freebsd/commit/b739c10c50aa37e247dc95f7b93f6fe58d86016d


I have attached dtrace profile output that captured while freezes happen.
I do not see here vm_phys_alloc_contig() after ttm_pool_alloc(), probably -O2/-O3 opt level "optimize" out it.

Here few new things that show increased latency on freezes:
(I do not collect many freezes, in some tests only few freezes collected)

              kernel`lock_delay+0x12
              amdgpu.ko`amdgpu_gem_fault+0x86
              kernel`linux_cdev_pager_populate+0x128
              kernel`vm_fault_allocate+0x185
              kernel`vm_fault+0x39c
              kernel`vm_fault_trap+0x4c
              kernel`trap_pfault+0x20a
              kernel`trap+0x4a8
              kernel`0xffffffff80a11ca8
               20
dtrace -n 'fbt::amdgpu_gem_fault:entry{self->ts=timestamp}' -n 'fbt::amdgpu_gem_fault:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66064                       :tick-1sec 

           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |                                         1        
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       1357     
            4096 |@@@                                      110      
            8192 |@@@                                      103      
           16384 |                                         4        
           32768 |                                         3        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         0        
         1048576 |                                         0        
         2097152 |                                         1        
         4194304 |                                         3        
         8388608 |                                         2        
        16777216 |                                         0        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         0        
       268435456 |                                         1        
       536870912 |                                         1        
      1073741824 |                                         1        
      2147483648 |                                         0        



              kernel`lock_delay+0x14
              kernel`malloc_large+0x2c
              kernel`lkpi_kmalloc_cb+0x44
              kernel`lkpi_kmalloc+0x27
              amdgpu.ko`dc_create_state+0x18
              amdgpu.ko`amdgpu_dm_atomic_commit_tail+0xd4
              drm.ko`commit_tail+0xa7
              kernel`linux_work_fn+0xed
              kernel`taskqueue_run_locked+0x187
              kernel`taskqueue_thread_loop+0xc2
              kernel`fork_exit+0x86
              kernel`0xffffffff80a12d0e
               88
dtrace -n 'fbt::dc_create_state:entry{self->ts=timestamp}' -n 'fbt::dc_create_state:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66064                       :tick-1sec 

           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |                                         2        
           16384 |@@@@@@@@@@@@@@@@@@@@@                    1271     
           32768 |@@@@@@@@@@@@@@@@@@                       1087     
           65536 |                                         30       
          131072 |                                         1        
          262144 |                                         0        
          524288 |                                         3        
         1048576 |                                         4        
         2097152 |                                         2        
         4194304 |                                         0        
         8388608 |                                         0        
        16777216 |                                         0        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         1        
       268435456 |                                         1        
       536870912 |                                         5        
      1073741824 |                                         4        
      2147483648 |                                         0        


              kernel`lock_delay+0x14
              kernel`free+0x9b
              amdgpu.ko`amdgpu_dm_atomic_commit_tail+0x2f9a
              drm.ko`commit_tail+0xa7
              kernel`linux_work_fn+0xed
              kernel`taskqueue_run_locked+0x187
              kernel`taskqueue_thread_loop+0xc2
              kernel`fork_exit+0x86
              kernel`0xffffffff809aaf6e
              399
dtrace -n 'fbt::amdgpu_dm_atomic_commit_tail:entry{self->ts=timestamp}' -n 'fbt::amdgpu_dm_atomic_commit_tail:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66190                       :tick-1sec 

           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |                                         4        
           65536 |                                         6        
          131072 |                                         2        
          262144 |                                         0        
          524288 |                                         6        
         1048576 |                                         15       
         2097152 |@                                        29       
         4194304 |@@@                                      106      
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       1323     
        16777216 |@                                        44       
        33554432 |                                         1        
        67108864 |                                         2        
       134217728 |                                         4        
       268435456 |                                         8        
       536870912 |                                         5        
      1073741824 |                                         5        
      2147483648 |                                         0        


              kernel`lock_delay+0x14
              kernel`zone_import+0xf2
              kernel`cache_alloc+0x309
              kernel`cache_alloc_retry+0x2c
              kernel`malloc+0x48
              ttm.ko`ttm_sg_tt_init+0x61
              amdgpu.ko`amdgpu_ttm_tt_create+0x4a
              ttm.ko`ttm_tt_create+0x4e
              ttm.ko`ttm_bo_validate+0x60
              ttm.ko`ttm_bo_init_reserved+0x194
              amdgpu.ko`amdgpu_bo_create+0x295
              amdgpu.ko`amdgpu_bo_create_user+0x21
              amdgpu.ko`amdgpu_gem_userptr_ioctl+0x82
              drm.ko`drm_ioctl_kernel+0xbc
              drm.ko`drm_ioctl+0x25e
              kernel`linux_file_ioctl+0x30f
              kernel`kern_ioctl+0x1b0
              kernel`sys_ioctl+0x117
              kernel`amd64_syscall+0xeb
              kernel`0xffffffff809aa81b
               46
dtrace -n 'fbt::amdgpu_ttm_tt_create:entry{self->ts=timestamp}' -n 'fbt::amdgpu_ttm_tt_create:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66190                       :tick-1sec 

           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |                                         4        
             512 |@@@@@@@@@@                               5764     
            1024 |@@@@@@@@@@@@                             6635     
            2048 |@@@@@@@@@                                5087     
            4096 |@@@@@@@@                                 4334     
            8192 |@@                                       875      
           16384 |                                         72       
           32768 |                                         9        
           65536 |                                         3        
          131072 |                                         0        
(this looks ok)


dtrace -n 'fbt::amdgpu_bo_create:entry{self->ts=timestamp}' -n 'fbt::amdgpu_bo_create:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66190                       :tick-1sec 

           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |                                         2        
            1024 |@@@@@@                                   2303     
            2048 |@@@@@@@@@@                               4190     
            4096 |@@@@@@@@@@@@                             4800     
            8192 |@@@@@@@@@@                               4002     
           16384 |@@                                       845      
           32768 |                                         124      
           65536 |                                         39       
          131072 |                                         20       
          262144 |                                         4        
          524288 |                                         9        
         1048576 |                                         2        
         2097152 |                                         3        
         4194304 |                                         8        
         8388608 |                                         5        
        16777216 |                                         0        
        33554432 |                                         1        
        67108864 |                                         2        
       134217728 |                                         1        
       268435456 |                                         0        
       536870912 |                                         2        
      1073741824 |                                         0        
      2147483648 |                                         1        
      4294967296 |                                         0        

dtrace -n 'fbt::add_hole:entry{self->ts=timestamp}' -n 'fbt::add_hole:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66190                       :tick-1sec 

           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |@@@@@@@@@@@@                             5762     
             512 |@@@@@@@@@@@@@                            6548     
            1024 |@@@@@@@                                  3287     
            2048 |@@@@@                                    2648     
            4096 |@@@                                      1508     
            8192 |                                         105      
           16384 |                                         11       
           32768 |                                         4        
           65536 |                                         1        
          131072 |                                         0        
(this looks ok)

dtrace -n 'fbt::ttm_pool_alloc:entry{self->ts=timestamp}' -n 'fbt::ttm_pool_alloc:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}'
  0  66190                       :tick-1sec 

           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |@@                                       29       
             512 |@@@@@@@@                                 96       
            1024 |@@@@@@                                   81       
            2048 |@@@@@                                    67       
            4096 |@@@@@@@@                                 106      
            8192 |@@@                                      33       
           16384 |                                         5        
           32768 |                                         1        
           65536 |@                                        17       
          131072 |@                                        12       
          262144 |                                         3        
          524288 |                                         6        
         1048576 |@                                        10       
         2097152 |@                                        13       
         4194304 |                                         2        
         8388608 |                                         2        
        16777216 |                                         3        
        33554432 |                                         5        
        67108864 |                                         3        
       134217728 |                                         4        
       268435456 |@                                        7        
       536870912 |                                         5        
      1073741824 |                                         1        
      2147483648 |                                         1        
      4294967296 |                                         0        


If some one have ideas - I can play more with dtrace and test other patches/settings.

Comment 21 Olivier Certner freebsd_committer

2025-03-12 20:53:37 UTC

(In reply to Ivan Rozhuk from comment #20)

Very important: Once the patch has been applied, you both have to rebuild your kernel *and* the drm-kmod modules with the patched 'gfp.h' header.

Given previous analysis, it seems unlikely at this stage that the patch wouldn't fix what you're observing, but let's see.

Comment 22 Olivier Certner freebsd_committer

2025-03-12 20:59:26 UTC

(In reply to Evgenii Khramtsov from comment #18)
(In reply to fullermd from comment #19)

I hear you.  I have been running stable/14 for months and drm-61-kmod, and it works like a charm here.

I have no doubt that this fixes the main problem, the only pending thing was to be sure that the change causes no new crashes, as manu@ hinted at that on -CURRENT.  I still have to try with a recent -CURRENT (this is the "extra test" I mentioned above).

Comment 23 Ivan Rozhuk 2025-03-12 21:02:43 UTC

(In reply to Olivier Certner from comment #21)

Is is done in auto mode by my build scripts.
I use PORTS_MODULES+= to make sure that kernel modu;es from ports auto rebuild+install with systems.


# ls /boot/kernel
...
-r--r--r--   1 root wheel   29K Mar 12 17:26:39 2025 dtaudit.ko
-r--r--r--   1 root wheel   19K Mar 12 17:26:39 2025 dtmalloc.ko
-r--r--r--   1 root wheel   30K Mar 12 17:26:39 2025 dtnfscl.ko
-r--r--r--   1 root wheel   27K Mar 12 17:26:39 2025 dtrace_test.ko
-r--r--r--   1 root wheel  374K Mar 12 17:26:39 2025 dtrace.ko
-r--r--r--   1 root wheel   16K Mar 12 17:26:39 2025 dtraceall.ko
...
-r--r--r--   1 root wheel   15M Mar 12 17:26:32 2025 kernel
-r--r--r--   1 root wheel   44K Mar 12 17:26:40 2025 kinst.ko
-r--r--r--   1 root wheel  206K Mar 12 17:26:39 2025 krpc.ko
-r--r--r--   1 root wheel   30K Mar 12 17:26:39 2025 ksyms.ko
-r--r--r--   1 root wheel   21K Mar 12 17:26:39 2025 libmchain.ko
-r--r--r--   1 root wheel   59K Mar 12 17:26:39 2025 lindebugfs.ko
-rw-r--r--   1 root wheel  125K Mar 12 17:26:41 2025 linker.hints
-r--r--r--   1 root wheel  165K Mar 12 17:26:39 2025 linux_common.ko
-r--r--r--   1 root wheel  449K Mar 12 17:26:39 2025 linux.ko
-r--r--r--   1 root wheel  414K Mar 12 17:26:39 2025 linux64.ko
-r--r--r--   1 root wheel   46K Mar 12 17:26:39 2025 linuxkpi_hdmi.ko
-r--r--r--   1 root wheel   57K Mar 12 17:26:39 2025 linuxkpi_video.ko
-r--r--r--   1 root wheel  335K Mar 12 17:26:39 2025 linuxkpi.ko
...


# ls /boot/modules/
...
-r--r--r--   1 root wheel  369K Mar 12 17:26:50 2025 amdgpu_raven_vcn_bin.ko
-r--r--r--   1 root wheel   10M Mar 12 17:26:44 2025 amdgpu.ko
...
-r--r--r--   1 root wheel  2.0M Mar 12 17:26:44 2025 radeonkms.ko
-r--r--r--   1 root wheel  100K Mar 12 17:26:44 2025 ttm.ko

Comment 24 Evgenii Khramtsov 2025-03-12 21:15:08 UTC

(In reply to Olivier Certner from comment #22)

> I still have to try with a recent -CURRENT (this is the "extra test" I mentioned above).

See bug 282605 to avoid unrelated crashes on main.

FWIW, mine "for months" means that I've been running main not older than a week with this patch, during all the mentioned time, e.g. my main now is as of base 717adecbbb52.

Comment 25 commit-hook freebsd_committer

2025-03-25 09:19:59 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=718d1928f8748fe4429c011296f94f194d63c695

commit 718d1928f8748fe4429c011296f94f194d63c695
Author:     Mathieu <sigsys@gmail.com>
AuthorDate: 2024-11-14 00:24:02 +0000
Commit:     Olivier Certner <olce@FreeBSD.org>
CommitDate: 2025-03-25 08:41:44 +0000

    LinuxKPI: make linux_alloc_pages() honor __GFP_NORETRY

    This is to fix slowdowns with drm-kmod that get worse over time as
    physical memory become more fragmented (and probably also depending on
    other factors).

    Based on information posted in this bug report:
    https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277476

    By default, linux_alloc_pages() retries failed allocations by calling
    vm_page_reclaim_contig() to attempt to free contiguous physical memory
    pages. vm_page_reclaim_contig() does not always succeed and calling it
    can be very slow even when it fails. When physical memory is very
    fragmented, vm_page_reclaim_contig() can end up being called (and
    failing) after every allocation attempt. This could cause very
    noticeable graphical desktop hangs (which could last seconds).

    The drm-kmod code in question attempts to allocate multiple contiguous
    pages at once but does not actually require them to be contiguous. It
    can fallback to doing multiple smaller allocations when larger
    allocations fail. It passes alloc_pages() the __GFP_NORETRY flag in this
    case.

    This patch makes linux_alloc_pages() fail early (without retrying) when
    this flag is passed.

    [olce: The problem this patch fixes is longer and longer GUI freezes as
    a machine's memory gets filled and becomes fragmented, when using amdgpu
    from DRM kmod 5.15 and DRM kmod 6.1 (DRM kmod 5.10 is unaffected; newer
    Linux kernel introduced an "optimization" by which a pool of pages is
    filled preferentially with contiguous pages, which triggered the problem
    for us).  The original commit message above evokes freezes lasting
    seconds, but I occasionally witnessed some lasting tens of minutes,
    rendering a machine completely useless.

    The patch has been reviewed for its potential impacts to other LinuxKPI
    parts and our existing DRM kmods' code.  In particular, there is no
    other user of __GFP_NORETRY/GFP_NORETRY with Linux's alloc_pages*()
    functions in our tree or DRM kmod ports.

    It has also been tested extensively, by me for months against 14-STABLE
    and sporadically on -CURRENT on a RX580, and by several others as
    reported below and as is visible in more details in the quoted bugzilla
    PR and in the initial drm-kmod issue at
    https://github.com/freebsd/drm-kmod/issues/302, on a variety of other
    AMD GPUs (several RX580, RX570, Radeon Pro WX5100, Green Sardine 5600G,
    Ryzen 9 4900H with embedded Renoir).]

    PR:             277476
    Reported by:    Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
    Reviewed by:    olce
    Tested by:      many (olce, Pierre Pronchery, Evgenii Khramtsov, chaplina, rk)
    MFC after:      2 weeks
    Relnotes:       yes
    Sponsored by:   The FreeBSD Foundation (review and part of testing)

 sys/compat/linuxkpi/common/include/linux/gfp.h | 4 ++--
 sys/compat/linuxkpi/common/src/linux_page.c    | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)