Summary: | graphics/drm-515-kmod: amdgpu periodic hangs due to phys contig allocations | ||||||
---|---|---|---|---|---|---|---|
Product: | Ports & Packages | Reporter: | Josef 'Jeff' Sipek <jeffpc> | ||||
Component: | Individual Port(s) | Assignee: | freebsd-x11 (Nobody) <x11> | ||||
Status: | In Progress --- | ||||||
Severity: | Affects Only Me | CC: | agh, emaste, fullermd, khorben, ltning-freebsd, manu, olce, rk, sigsys, swills, tomek, x11 | ||||
Priority: | --- | Flags: | linimon:
maintainer-feedback?
(x11) |
||||
Version: | Latest | ||||||
Hardware: | Any | ||||||
OS: | Any | ||||||
URL: | https://github.com/freebsd/drm-kmod/issues/302 | ||||||
Attachments: |
|
Description
Josef 'Jeff' Sipek
2024-03-04 14:12:20 UTC
While gathering all the dtrace data, I was so distracted I forgot to mention: $ freebsd-version -kru 14.0-RELEASE-p5 14.0-RELEASE-p5 14.0-RELEASE-p5 I dug a bit more into this. It looks like the drm code has provisions for allocating memory via dma APIs. The FreeBSD port doesn't implement those. Specifically, looking at drm-kmod-drm_v5.15.25_5 source: drivers/gpu/drm/amd/amdgpu/gmc_v*.c sets adev->need_swiotlb to drm_need_swiotlb(...). drm_need_swiotlb is implemented in drivers/gpu/drm/drm_cache.c as a 'return false' on FreeBSD. Later on, amdgpu_ttm_init calls ttm_device_init with the use_dma_alloc argument equal to adev->need_swiotlb (IOW, false). Much later on, ttm_pool_alloc is called to allocate a buffer. That in turn calls ttm_pool_alloc_page which amounts to: if (!use_dma_alloc) return alloc_pages(...); panic("ttm_pool.c: use_dma_alloc not implemented"); So, because of the 'return false' during initialization, we always call alloc_pages (aka. linux_alloc_pages) which tries to allocate physically contiguous memory. As I said before, I don't know anything about the graphics stack, so it is possible that this dma API is completely irrelevant. Looking at ttm_pool_alloc some more, it immediately turns the physically contiguous allocation into an array of struct page pointers (tt->pagse). So, depending on how the rest of the module uses the buffer & pages, it may be relatively easy to switch to a virtually-contiguous allocation. I also have RX580. After upgrading from 13.2 to 14.0 I got frequent kernel panics. Related reports: * https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276985 * https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278212 I noticed that setting manual xorg.conf admgpu DRI=2 make kernel panic less frequent but instead system got slower and slower until unresponsive, if I managed to kill xorg in time it could work for a while again until I had to kill xorg again. I could not work out fallback with safe non accelerated xorg.conf using scfb that would allow dual screen setup with one screen rotated. Dual monitor is only possible with amdgpu loaded. It was also not possible to disable acceleration in amdgpu and have xrandr (secondary screen rotation). Rolled back to 13.2. DRM 5.15 / AMDGPU / LinuxKPI makes 14.0 unreliable. (In reply to Josef 'Jeff' Sipek from comment #0) I reported this independently in the drm-kmod GitHub project as https://github.com/freebsd/drm-kmod/issues/302. Going to also cross-link this PR there. I'd really like to get to the bottom of this. However, I don't plan to have the time to do so before end of month at the very least. drm-61-kmod exhibits the same problem. However, drm-510-kmod works fine for me. (In reply to Tomasz "CeDeROM" CEDRO from comment #3) Please see my comment 8 in bug #278212. Yeah so this problem was super annoying. But thanks to the information already posted here, seems like it wasn't too hard to fix. IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization. When allocation fails, it fallsback to asking for less and less contiguous pages (eventually only asking for one page at a time). When ttm_pool_alloc_page() asks for more than one page, it passes alloc_pages() some extra flags (__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM). What's expensive is the vm_page_reclaim_contig() in linux_alloc_pages(). The function tries too hard to find contiguous memory (that the drm code doesn't even require) and as physical memory gets too fragmented it becomes very slow. So, very simple fix, make linux_alloc_pages() react to one of the flag passed by the drm code: diff --git a/sys/compat/linuxkpi/common/include/linux/gfp.h b/sys/compat/linuxkpi/common/include/linux/gfp.h index 2fcc0dc05f29..58a021086c98 100644 --- a/sys/compat/linuxkpi/common/include/linux/gfp.h +++ b/sys/compat/linuxkpi/common/include/linux/gfp.h @@ -44,7 +44,6 @@ #define __GFP_NOWARN 0 #define __GFP_HIGHMEM 0 #define __GFP_ZERO M_ZERO -#define __GFP_NORETRY 0 #define __GFP_NOMEMALLOC 0 #define __GFP_RECLAIM 0 #define __GFP_RECLAIMABLE 0 @@ -58,7 +57,8 @@ #define __GFP_KSWAPD_RECLAIM 0 #define __GFP_WAIT M_WAITOK #define __GFP_DMA32 (1U << 24) /* LinuxKPI only */ -#define __GFP_BITS_SHIFT 25 +#define __GFP_NORETRY (1U << 25) /* LinuxKPI only */ +#define __GFP_BITS_SHIFT 26 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) #define __GFP_NOFAIL M_WAITOK diff --git a/sys/compat/linuxkpi/common/src/linux_page.c b/sys/compat/linuxkpi/common/src/linux_page.c index 18b90b5e3d73..71a6890a3795 100644 --- a/sys/compat/linuxkpi/common/src/linux_page.c +++ b/sys/compat/linuxkpi/common/src/linux_page.c @@ -118,7 +118,7 @@ linux_alloc_pages(gfp_t flags, unsigned int order) page = vm_page_alloc_noobj_contig(req, npages, 0, pmax, PAGE_SIZE, 0, VM_MEMATTR_DEFAULT); if (page == NULL) { - if (flags & M_WAITOK) { + if ((flags & (M_WAITOK | __GFP_NORETRY)) == M_WAITOK) { int err = vm_page_reclaim_contig(req, npages, 0, pmax, PAGE_SIZE, 0); if (err == ENOMEM) Been working fine here with amdgpu for about 3 weeks. (The drm modules need to be recompiled with the modified kernel header.) Intersting find, I've never could reproduce this bug on my RX550, Olivier can you test if the code (which looks ok to me) fixes the issue for you ? (In reply to sigsys from comment #5) I've been suffering from this issue on a Ryzen 9 4900H with embedded Renoir graphics using drm-61-kmod-6.1.92_2 on stable/14-n268738-048132192698. Playing videos using mpv easily triggered the slowdown after some time (especially 4K videos). I've implemented the suggested fix and now I cannot reproduce the behaviour anymore (tested for 2 days now). Even playing multiple 4K videos in parallel does not cause the problem. Thanks for the fix. (In reply to sigsys from comment #5) > IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't actually need contiguous pages. It's just an opportunistic optimization. That would be very good news (at least from the users' point of view). Have not spent time on this issue since my last posts. I had naively thought that the new DRM ports really needed contiguous allocation for whatever reason, and should probably have looked a bit further instead of assuming this would need some deep and highly time consuming analysis. (In reply to Emmanuel Vadot from comment #6) Will test that soon and report. (In reply to sigsys from comment #5) Waiting for more people to test but in the meantime could you add a git-format patch to this bug please ? (So with full commit message and correct authorship). (In reply to Emmanuel Vadot from comment #6) The patch also works well for me, no slowdowns to report after 24 hours. (In reply to sigsys from comment #5) Has this patch landed already? I'm eager to test on my threadripper with Navi 24 [Radeon PRO W6400]; it's borderline useless after ~48h uptime and needs frequent reboots to fix. At least it's better than before I clamped the ARC to 8GB to slow the process down.. Created attachment 255155 [details]
PR277476 fix
(In reply to Emmanuel Vadot from comment #9) Alright here it is. Is it already too late to have this merged in 14.2? I'm pretty sure this patch is safe. GFP_NORETRY isn't used in-tree at all right now. And this patch makes it do pretty much what it says. It doesn't retry. You'd hope that any code using this flag would expect allocations to fail... The problem doesn't always happen for everyone but when it does man it's rough. After a week or two I was getting hangs that lasted 15 seconds sometimes. Restarting firefox would fix it for a while but eventually it becomes unusable. Even if this made it in 14.2 IIUC it would take a while before the 14.X packages would be compiled against the new kernel headers, but it would already be useful to have it in base so that you could get the fix by compiling drm-kmod from ports. FWIW, I definitely ran into what sounds just like this (with several different cards, on both 515 and 61; 510 was always rock solid). After a few days, I'd sometimes get freezes lasting a minute or more. A workaround that seems to work for me has been switching from the amdgpu to the modesetting X driver; I still occasionally see little blips, but they resolve and don't seem to pile up the way they did on amdgpu, even after months of uptime. (In reply to sigsys from comment #13) It seems manu@ is having a crash with the patch applied on 5.15. So while it seems safe, we have to rule out some possible impacts in certain situations. I'm afraid it is too late to have it merged in 14.2 anyway, so let's be sure we are not regressing anything while fixing the problem. |