Two weeks ago I replaced an ancient nvidia graphics card with an AMD RX580 card to run open source drivers. Everything works fine most of the time, but occasionally the system hangs for a few seconds (5-10, usually). The longer the system has been up, the worse it gets. Digging into it a bit, it is because userspace (it always looks like X) does an ioctl into drm which then tries to allocate a large-ish piece of physically contiguous memory. This explains why it gets worse as uptime increases (free physical memory fragmentation) and when running firefox (the most memory hungry application I use). I know nothing about graphics cards, the software stack supporting them, or the linux kernel API compatibility layer, but clearly it'd be beneficial if amdgpu/drm/whatever could make use of *virtually* contiguous pages or some kind of allocation caching/reuse to avoid repeatedly asking the vm code for physically contiguous ranges. To conclude the above, I did a handful of dtrace-based experiments. While one of the "temporary hangs" was happening, the following was the most common (non-idle) profiler stack: # dtrace -n 'profile-97{@[stack()]=count()}' ... kernel`vm_phys_alloc_contig+0x11d kernel`linux_alloc_pages+0x8f ttm.ko`ttm_pool_alloc+0x2cb ttm.ko`ttm_tt_populate+0xc5 ttm.ko`ttm_bo_handle_move_mem+0xc3 ttm.ko`ttm_bo_validate+0xb4 ttm.ko`ttm_bo_init_reserved+0x199 amdgpu.ko`amdgpu_bo_create+0x1eb amdgpu.ko`amdgpu_bo_create_user+0x21 amdgpu.ko`amdgpu_gem_create_ioctl+0x1e2 drm.ko`drm_ioctl_kernel+0xc6 drm.ko`drm_ioctl+0x2b5 kernel`linux_file_ioctl+0x312 kernel`kern_ioctl+0x255 kernel`sys_ioctl+0x123 kernel`amd64_syscall+0x109 kernel`0xffffffff80fe43eb The latency of vm_phys_alloc_contig (entry to return) is bimodal - with latencies in the single digit *milli*seconds during the "temporary hangs": # dtrace -n 'fbt::vm_phys_alloc_contig:entry{self->ts=timestamp}' -n 'fbt::vm_phys_alloc_contig:return/self->ts/{this->delta=timestamp-self->ts; @=quantize(this->delta);}' -n 'tick-1sec{printa(@)}' ... value ------------- Distribution ------------- count 256 | 0 512 |@ 2606 1024 |@@@@@@@@@ 18207 2048 |@ 2534 4096 | 894 8192 | 34 16384 | 78 32768 | 58 65536 | 219 131072 | 306 262144 | 310 524288 | 735 1048576 | 174 2097152 |@@ 4364 4194304 |@@@@@@@@@@@@@@@@@@@@@@@@ 47475 8388608 |@ 1546 16777216 | 2 33554432 | 0 The number of pages being allocated: # dtrace -n 'fbt::vm_phys_alloc_contig:entry/arg1>1/{@=quantize(arg1)}' -n 'tick-1sec{printa(@)}' ... value ------------- Distribution ------------- count 1 | 0 2 |@@@ 15 4 |@ 7 8 |@@@ 16 16 |@@ 10 32 |@@ 10 64 |@ 7 128 |@@@ 12 256 |@@@ 12 512 |@@@@@@@@@@@@@@ 68 1024 |@@@@@@@ 32 2048 | 0 I did a few more dtrace experiments, but they all point to the same thing - a drm/amdgpu related ioctl wants 4MB of physically contiguous memory often enough to become a headache. 4MB isn't too much given than the system has 32GB of RAM, but physically contiguous takes a while to fulfill sometimes. The card: vgapci0@pci0:1:0:0: class=0x030000 rev=0xe7 hdr=0x00 vendor=0x1002 device=0x67df subvendor=0x1da2 subdevice=0xe353 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]' class = display subclass = VGA $ pkg info|grep -i amd gpu-firmware-amd-kmod-aldebaran-20230625 Firmware modules for aldebaran AMD GPUs gpu-firmware-amd-kmod-arcturus-20230625 Firmware modules for arcturus AMD GPUs gpu-firmware-amd-kmod-banks-20230625 Firmware modules for banks AMD GPUs gpu-firmware-amd-kmod-beige-goby-20230625 Firmware modules for beige_goby AMD GPUs gpu-firmware-amd-kmod-bonaire-20230625 Firmware modules for bonaire AMD GPUs gpu-firmware-amd-kmod-carrizo-20230625 Firmware modules for carrizo AMD GPUs gpu-firmware-amd-kmod-cyan-skillfish2-20230625 Firmware modules for cyan_skillfish2 AMD GPUs gpu-firmware-amd-kmod-dimgrey-cavefish-20230625 Firmware modules for dimgrey_cavefish AMD GPUs gpu-firmware-amd-kmod-fiji-20230625 Firmware modules for fiji AMD GPUs gpu-firmware-amd-kmod-green-sardine-20230625 Firmware modules for green_sardine AMD GPUs gpu-firmware-amd-kmod-hainan-20230625 Firmware modules for hainan AMD GPUs gpu-firmware-amd-kmod-hawaii-20230625 Firmware modules for hawaii AMD GPUs gpu-firmware-amd-kmod-kabini-20230625 Firmware modules for kabini AMD GPUs gpu-firmware-amd-kmod-kaveri-20230625 Firmware modules for kaveri AMD GPUs gpu-firmware-amd-kmod-mullins-20230625 Firmware modules for mullins AMD GPUs gpu-firmware-amd-kmod-navi10-20230625 Firmware modules for navi10 AMD GPUs gpu-firmware-amd-kmod-navi12-20230625 Firmware modules for navi12 AMD GPUs gpu-firmware-amd-kmod-navi14-20230625 Firmware modules for navi14 AMD GPUs gpu-firmware-amd-kmod-navy-flounder-20230625 Firmware modules for navy_flounder AMD GPUs gpu-firmware-amd-kmod-oland-20230625 Firmware modules for oland AMD GPUs gpu-firmware-amd-kmod-picasso-20230625 Firmware modules for picasso AMD GPUs gpu-firmware-amd-kmod-pitcairn-20230625 Firmware modules for pitcairn AMD GPUs gpu-firmware-amd-kmod-polaris10-20230625 Firmware modules for polaris10 AMD GPUs gpu-firmware-amd-kmod-polaris11-20230625 Firmware modules for polaris11 AMD GPUs gpu-firmware-amd-kmod-polaris12-20230625 Firmware modules for polaris12 AMD GPUs gpu-firmware-amd-kmod-raven-20230625 Firmware modules for raven AMD GPUs gpu-firmware-amd-kmod-raven2-20230625 Firmware modules for raven2 AMD GPUs gpu-firmware-amd-kmod-renoir-20230625 Firmware modules for renoir AMD GPUs gpu-firmware-amd-kmod-si58-20230625 Firmware modules for si58 AMD GPUs gpu-firmware-amd-kmod-sienna-cichlid-20230625 Firmware modules for sienna_cichlid AMD GPUs gpu-firmware-amd-kmod-stoney-20230625 Firmware modules for stoney AMD GPUs gpu-firmware-amd-kmod-tahiti-20230625 Firmware modules for tahiti AMD GPUs gpu-firmware-amd-kmod-tonga-20230625 Firmware modules for tonga AMD GPUs gpu-firmware-amd-kmod-topaz-20230625 Firmware modules for topaz AMD GPUs gpu-firmware-amd-kmod-vangogh-20230625 Firmware modules for vangogh AMD GPUs gpu-firmware-amd-kmod-vega10-20230625 Firmware modules for vega10 AMD GPUs gpu-firmware-amd-kmod-vega12-20230625 Firmware modules for vega12 AMD GPUs gpu-firmware-amd-kmod-vega20-20230625 Firmware modules for vega20 AMD GPUs gpu-firmware-amd-kmod-vegam-20230625 Firmware modules for vegam AMD GPUs gpu-firmware-amd-kmod-verde-20230625 Firmware modules for verde AMD GPUs gpu-firmware-amd-kmod-yellow-carp-20230625 Firmware modules for yellow_carp AMD GPUs suitesparse-amd-3.3.0 Symmetric approximate minimum degree suitesparse-camd-3.3.0 Symmetric approximate minimum degree suitesparse-ccolamd-3.3.0 Constrained column approximate minimum degree ordering suitesparse-colamd-3.3.0 Column approximate minimum degree ordering algorithm webcamd-5.17.1.2_1 Port of Linux USB webcam and DVB drivers into userspace xf86-video-amdgpu-22.0.0_1 X.Org amdgpu display driver $ pkg info|grep -i drm drm-515-kmod-5.15.118_3 DRM drivers modules drm-kmod-20220907_1 Metaport of DRM modules for the linuxkpi-based KMS components gpu-firmware-kmod-20230210_1,1 Firmware modules for the drm-kmod drivers libdrm-2.4.120_1,1 Direct Rendering Manager library and headers
While gathering all the dtrace data, I was so distracted I forgot to mention: $ freebsd-version -kru 14.0-RELEASE-p5 14.0-RELEASE-p5 14.0-RELEASE-p5
I dug a bit more into this. It looks like the drm code has provisions for allocating memory via dma APIs. The FreeBSD port doesn't implement those. Specifically, looking at drm-kmod-drm_v5.15.25_5 source: drivers/gpu/drm/amd/amdgpu/gmc_v*.c sets adev->need_swiotlb to drm_need_swiotlb(...). drm_need_swiotlb is implemented in drivers/gpu/drm/drm_cache.c as a 'return false' on FreeBSD. Later on, amdgpu_ttm_init calls ttm_device_init with the use_dma_alloc argument equal to adev->need_swiotlb (IOW, false). Much later on, ttm_pool_alloc is called to allocate a buffer. That in turn calls ttm_pool_alloc_page which amounts to: if (!use_dma_alloc) return alloc_pages(...); panic("ttm_pool.c: use_dma_alloc not implemented"); So, because of the 'return false' during initialization, we always call alloc_pages (aka. linux_alloc_pages) which tries to allocate physically contiguous memory. As I said before, I don't know anything about the graphics stack, so it is possible that this dma API is completely irrelevant. Looking at ttm_pool_alloc some more, it immediately turns the physically contiguous allocation into an array of struct page pointers (tt->pagse). So, depending on how the rest of the module uses the buffer & pages, it may be relatively easy to switch to a virtually-contiguous allocation.
I also have RX580. After upgrading from 13.2 to 14.0 I got frequent kernel panics. Related reports: * https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276985 * https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278212 I noticed that setting manual xorg.conf admgpu DRI=2 make kernel panic less frequent but instead system got slower and slower until unresponsive, if I managed to kill xorg in time it could work for a while again until I had to kill xorg again. I could not work out fallback with safe non accelerated xorg.conf using scfb that would allow dual screen setup with one screen rotated. Dual monitor is only possible with amdgpu loaded. It was also not possible to disable acceleration in amdgpu and have xrandr (secondary screen rotation). Rolled back to 13.2. DRM 5.15 / AMDGPU / LinuxKPI makes 14.0 unreliable.
(In reply to Josef 'Jeff' Sipek from comment #0) I reported this independently in the drm-kmod GitHub project as https://github.com/freebsd/drm-kmod/issues/302. Going to also cross-link this PR there. I'd really like to get to the bottom of this. However, I don't plan to have the time to do so before end of month at the very least. drm-61-kmod exhibits the same problem. However, drm-510-kmod works fine for me. (In reply to Tomasz "CeDeROM" CEDRO from comment #3) Please see my comment 8 in bug #278212.
Yeah so this problem was super annoying. But thanks to the information already posted here, seems like it wasn't too hard to fix. IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization. When allocation fails, it fallsback to asking for less and less contiguous pages (eventually only asking for one page at a time). When ttm_pool_alloc_page() asks for more than one page, it passes alloc_pages() some extra flags (__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM). What's expensive is the vm_page_reclaim_contig() in linux_alloc_pages(). The function tries too hard to find contiguous memory (that the drm code doesn't even require) and as physical memory gets too fragmented it becomes very slow. So, very simple fix, make linux_alloc_pages() react to one of the flag passed by the drm code: diff --git a/sys/compat/linuxkpi/common/include/linux/gfp.h b/sys/compat/linuxkpi/common/include/linux/gfp.h index 2fcc0dc05f29..58a021086c98 100644 --- a/sys/compat/linuxkpi/common/include/linux/gfp.h +++ b/sys/compat/linuxkpi/common/include/linux/gfp.h @@ -44,7 +44,6 @@ #define __GFP_NOWARN 0 #define __GFP_HIGHMEM 0 #define __GFP_ZERO M_ZERO -#define __GFP_NORETRY 0 #define __GFP_NOMEMALLOC 0 #define __GFP_RECLAIM 0 #define __GFP_RECLAIMABLE 0 @@ -58,7 +57,8 @@ #define __GFP_KSWAPD_RECLAIM 0 #define __GFP_WAIT M_WAITOK #define __GFP_DMA32 (1U << 24) /* LinuxKPI only */ -#define __GFP_BITS_SHIFT 25 +#define __GFP_NORETRY (1U << 25) /* LinuxKPI only */ +#define __GFP_BITS_SHIFT 26 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) #define __GFP_NOFAIL M_WAITOK diff --git a/sys/compat/linuxkpi/common/src/linux_page.c b/sys/compat/linuxkpi/common/src/linux_page.c index 18b90b5e3d73..71a6890a3795 100644 --- a/sys/compat/linuxkpi/common/src/linux_page.c +++ b/sys/compat/linuxkpi/common/src/linux_page.c @@ -118,7 +118,7 @@ linux_alloc_pages(gfp_t flags, unsigned int order) page = vm_page_alloc_noobj_contig(req, npages, 0, pmax, PAGE_SIZE, 0, VM_MEMATTR_DEFAULT); if (page == NULL) { - if (flags & M_WAITOK) { + if ((flags & (M_WAITOK | __GFP_NORETRY)) == M_WAITOK) { int err = vm_page_reclaim_contig(req, npages, 0, pmax, PAGE_SIZE, 0); if (err == ENOMEM) Been working fine here with amdgpu for about 3 weeks. (The drm modules need to be recompiled with the modified kernel header.)
Intersting find, I've never could reproduce this bug on my RX550, Olivier can you test if the code (which looks ok to me) fixes the issue for you ?
(In reply to sigsys from comment #5) I've been suffering from this issue on a Ryzen 9 4900H with embedded Renoir graphics using drm-61-kmod-6.1.92_2 on stable/14-n268738-048132192698. Playing videos using mpv easily triggered the slowdown after some time (especially 4K videos). I've implemented the suggested fix and now I cannot reproduce the behaviour anymore (tested for 2 days now). Even playing multiple 4K videos in parallel does not cause the problem. Thanks for the fix.
(In reply to sigsys from comment #5) > IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization. That would be very good news (at least from the users' point of view). Have not spent time on this issue since my last posts. I had naively thought that the new DRM ports really needed contiguous allocation for whatever reason, and should probably have looked a bit further instead of assuming this would need some deep and highly time consuming analysis. (In reply to Emmanuel Vadot from comment #6) Will test that soon and report.
(In reply to sigsys from comment #5) Waiting for more people to test but in the meantime could you add a git-format patch to this bug please ? (So with full commit message and correct authorship).
(In reply to Emmanuel Vadot from comment #6) The patch also works well for me, no slowdowns to report after 24 hours.
(In reply to sigsys from comment #5) Has this patch landed already? I'm eager to test on my threadripper with Navi 24 [Radeon PRO W6400]; it's borderline useless after ~48h uptime and needs frequent reboots to fix. At least it's better than before I clamped the ARC to 8GB to slow the process down..
Created attachment 255155 [details] PR277476 fix
(In reply to Emmanuel Vadot from comment #9) Alright here it is. Is it already too late to have this merged in 14.2? I'm pretty sure this patch is safe. GFP_NORETRY isn't used in-tree at all right now. And this patch makes it do pretty much what it says. It doesn't retry. You'd hope that any code using this flag would expect allocations to fail... The problem doesn't always happen for everyone but when it does man it's rough. After a week or two I was getting hangs that lasted 15 seconds sometimes. Restarting firefox would fix it for a while but eventually it becomes unusable. Even if this made it in 14.2 IIUC it would take a while before the 14.X packages would be compiled against the new kernel headers, but it would already be useful to have it in base so that you could get the fix by compiling drm-kmod from ports.
FWIW, I definitely ran into what sounds just like this (with several different cards, on both 515 and 61; 510 was always rock solid). After a few days, I'd sometimes get freezes lasting a minute or more. A workaround that seems to work for me has been switching from the amdgpu to the modesetting X driver; I still occasionally see little blips, but they resolve and don't seem to pile up the way they did on amdgpu, even after months of uptime.
(In reply to sigsys from comment #13) It seems manu@ is having a crash with the patch applied on 5.15. So while it seems safe, we have to rule out some possible impacts in certain situations. I'm afraid it is too late to have it merged in 14.2 anyway, so let's be sure we are not regressing anything while fixing the problem.