Created attachment 253058 [details] Boot log with the agp_close() kernel panic and backtrace Poking /dev/agpgart is enough to trigger kernel panic via loaded agp(4) driver [0]. While this was originally discovered by running > env ZPOOL_IMPORT_PATH=/dev zpool import -Na which in turn tries to openat(2) everything under /dev tree including /dev/agpgart, which then triggers agp_close() [1] call in the driver, and then you get kernel panic. Attached there is a boot log including the moment where the system crashes, that is when the above zpool import is called.) I believe simply doing > echo > /dev/agpgart will do the same job - feel feel to try if you have some vintage hardware to crash. *** Note: I do NOT have any hardware to reproduce, test any fixes or whatever else here, provide core dumps or whatever similar. However, this was reproduced by multiple users [2] and caused completely unexpected fatal system crashes. *** Recompiling the kernel without the agp(4) driver avoids this issue. Also, using > set hint.agp.0.disabled=1 at loader prompt avoids the issue (also confirmed by multiple users). Suggestion: Perhaps follow the kernel warning > WARNING: Device "agp" is Giant locked and may be deleted before FreeBSD 15.0. and delete the driver from FreeBSD as soon as practicable. Having it in kernel in this state is doing more harm than good. (There is also another bug unsolved and open since 2014, which also results in unbootable systems [3]). [0] https://man.freebsd.org/cgi/man.cgi?query=agp&sektion=4&format=html [1] https://github.com/freebsd/freebsd-src/blob/5cbb98c8259c48ba22c8359f4c14f5438329ce58/sys/dev/agp/agp.c#L829 [2] https://forum.opnsense.org/index.php?topic=42373.0 [3] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187015
Got a vmcore from a user: (kgdb) bt #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57 #1 doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:405 #2 0xffffffff8049c2ea in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:591 #3 0xffffffff8049c0ed in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504 #4 0xffffffff8049c236 in db_command_script (command=command@entry=0xffffffff81bbf6d3 <db_recursion_data+3> "dump") at /usr/src/sys/ddb/db_command.c:569 #5 0xffffffff804a14a8 in db_script_exec (scriptname=<optimized out>, warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302 #6 0xffffffff804a13b5 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:325 #7 0xffffffff8049f471 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267 #8 0xffffffff80c09108 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe00895e0730) at /usr/src/sys/kern/subr_kdb.c:790 #9 0xffffffff810df419 in trap (frame=0xfffffe00895e0730) at /usr/src/sys/amd64/amd64/trap.c:608 #10 <signal handler called> #11 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556 #12 0xffffffff80bb8ad2 in vpanic (fmt=0xffffffff8126b7cf "%s", ap=ap@entry=0xfffffe00895e0960) at /usr/src/sys/kern/kern_shutdown.c:955 #13 0xffffffff80bb8b83 in panic (fmt=0xffffffff81d82c18 <cnputs_mtx+24> "") at /usr/src/sys/kern/kern_shutdown.c:891 #14 0xffffffff810dfeab in trap_fatal (frame=0xfffffe00895e0a40, eva=0) at /usr/src/sys/amd64/amd64/trap.c:952 #15 0xffffffff810dff07 in trap_pfault (frame=<optimized out>, usermode=<optimized out>, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760 #16 <signal handler called> #17 0xffffffff804d8477 in AGP_UNBIND_MEMORY (dev=0xfffff800045d3000, handle=0xfffff800045d2600) at ./agp_if.h:156 #18 agp_close (kdev=<optimized out>, fflag=<optimized out>, devtype=<optimized out>, td=<optimized out>) at /usr/src/sys/dev/agp/agp.c:840 #19 0xffffffff80b4426f in giant_close (dev=0xfffff800045e7800, fflag=131077, devtype=8192, td=0xfffff80042b6d740) at /usr/src/sys/kern/kern_conf.c:389 #20 0xffffffff80a25191 in devfs_close (ap=0xfffffe00895e0c00) at /usr/src/sys/fs/devfs/devfs_vnops.c:769 #21 0xffffffff811bcc3f in VOP_CLOSE_APV (vop=0xffffffff81ab2888 <devfs_specops>, a=a@entry=0xfffffe00895e0c00) at vnode_if.c:496 #22 0xffffffff80ccbde0 in VOP_CLOSE (vp=0xfffff800425e3540, fflag=5, cred=0xfffff80085d7b500, td=0xfffff80042b6d740) at ./vnode_if.h:247 #23 vn_close1 (vp=vp@entry=0xfffff800425e3540, flags=5, file_cred=0xfffff80085d7b500, td=0xfffff80042b6d740, keep_ref=false) at /usr/src/sys/kern/vfs_vnops.c:543 #24 0xffffffff80cca1cd in vn_closefile (fp=0xfffff80018ba7a00, td=0xfffff80004892900) at /usr/src/sys/kern/vfs_vnops.c:1849 #25 0xffffffff80a25bfa in devfs_close_f (fp=0xfffff800045d3000, td=0x20005) at /usr/src/sys/fs/devfs/devfs_vnops.c:788 #26 0xffffffff80b5038b in fo_close (fp=0xfffff80018ba7a00, td=0x20005) at /usr/src/sys/sys/file.h:392 #27 _fdrop (fp=fp@entry=0xfffff80018ba7a00, td=0x20005, td@entry=0xfffff80042b6d740) at /usr/src/sys/kern/kern_descrip.c:3668 #28 0xffffffff80b53e53 in closef (fp=fp@entry=0xfffff80018ba7a00, td=td@entry=0xfffff80042b6d740) at /usr/src/sys/kern/kern_descrip.c:2841 #29 0xffffffff80b57f91 in closefp_impl (fdp=0xfffffe008977d860, fd=9, fp=0xfffff80018ba7a00, td=0xfffff80042b6d740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1317 #30 0xffffffff810e0880 in syscallenter (td=0xfffff80042b6d740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:188 #31 amd64_syscall (td=0xfffff80042b6d740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1194 #32 <signal handler called> #33 0x000028a511a672ba in ?? () Backtrace stopped: Cannot access memory at address 0x28a517915d88
(kgdb) frame 18 #18 agp_close (kdev=<optimized out>, fflag=<optimized out>, devtype=<optimized out>, td=<optimized out>) at /usr/src/sys/dev/agp/agp.c:840 840 AGP_UNBIND_MEMORY(dev, mem); (kgdb) p dev $1 = (device_t) 0xfffff800045d3000 (kgdb) p mem $2 = (struct agp_memory *) 0xfffff800045d2600 (kgdb) p *dev $3 = {ops = 0xfffff80004892900, link = {tqe_next = 0xf5c0000000000018, tqe_prev = 0x0}, devlink = {tqe_next = 0x0, tqe_prev = 0xfffff800045d3018}, parent = 0x1, children = { tqh_first = 0xfffff800045e7800, tqh_last = 0xfffff800045e7400}, driver = 0xffffffff811ece77, devclass = 0x1030000, unit = 0, nameunit = 0x0, desc = 0x10000000 <error: Cannot access memory at address 0x10000000>, busy = 76113856, state = 4294965248, devflags = 0, flags = 2047, order = 8388608, ivars = 0x10000, softc = 0xfffff800045d2700, sysctl_ctx = {tqh_first = 0x0, tqh_last = 0xfffff80004892b00}, sysctl_tree = 0xfffff80004892880} (kgdb) p *mem $4 = {am_link = {tqe_next = 0xfffff80004768000, tqe_prev = 0xfffff800045d2500}, am_id = 73213704, am_size = 18446735277689742592, am_type = 73213720, am_obj = 0xfffff800045d2800, am_physical = 18446735277689745920, am_offset = 18446735277689745672, am_is_bound = -2119914208}
am_is_bound = -2119914208 this sure is sketchy :) if (mem->am_is_bound) AGP_UNBIND_MEMORY(dev, mem);
Reading the code a bit, it's hard to see where the memory corruption might be coming from. Is it possible to run a debug kernel on this hardware? Or even a KASAN kernel if possible.
Can share the vmcore from the user from a debug kernel or dump further info if you want. I don't want to put the user through more debug. For us disabling agp(4) in the kernel config is a viable option here. But I can ask the user to try a patch probably. I noticed there is no mutex locking in agp_close() as seems customary in the code around am_is_bound et al. But it's probably irrelevant due to Giant lock? Cheers, Franco
(In reply to Franco Fichtner from comment #5) I can take a look at the core. Indeed, Giant should be serializing file operations on /dev/agp0.
I'll sent it via mail. Thanks, Franco
Created attachment 253148 [details] proposed patch There is a regression from commit 437ea82ce7fc that is most likely the cause here. I attached a patch. Can anyone test it?
(In reply to Mark Johnston from comment #8) I was looking at the code too and this is definitely the immediate problem. Reviewed by: mhorne
Thanks. I've published a test kernel on our end. I'll report back with the results. What's the plan here: merge fix to stable branches? Remove agp from default kernel for 15.0 seeing it has no real mileage anymore? The people reporting this have cheap "network" boxes with agp somewhere on the hardware but not even usable since these appear to be serial devices or have otherwise means of VGA. Warner asked the same in 2020: https://cgit.freebsd.org/src/commit/?id=4f8959b9f4b > Compile tested only, but do we still need this driver? We may have an answer now. :) Cheers, Franco
Confirmed fixed by the user who provided the core dump. https://forum.opnsense.org/index.php?topic=42373.msg210575#msg210575 Cheers, Franco
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=12500c14281dc62ddeac4c5e1e6eabd1e380f11c commit 12500c14281dc62ddeac4c5e1e6eabd1e380f11c Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2024-08-29 13:12:19 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2024-08-29 13:12:19 +0000 agp: Set the driver-specific field correctly PR: 281035 Reviewed by: mhorne MFC after: 1 week Fixes: 437ea82ce7fc ("agp: Handle multiple devices more gracefully") sys/dev/agp/agp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
(In reply to Franco Fichtner from comment #11) Thank you. (In reply to Franco Fichtner from comment #10) Yes, I will start some discussion around this.
Just wanted to say thanks to Mark and Mitchell. Unlike some other bug reports here, this one has been really productive.
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=57e0d4b9a609cc2fe267489a3fe46857035ed3ae commit 57e0d4b9a609cc2fe267489a3fe46857035ed3ae Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2024-08-29 13:12:19 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2024-09-05 13:44:03 +0000 agp: Set the driver-specific field correctly PR: 281035 Reviewed by: mhorne MFC after: 1 week Fixes: 437ea82ce7fc ("agp: Handle multiple devices more gracefully") (cherry picked from commit 12500c14281dc62ddeac4c5e1e6eabd1e380f11c) sys/dev/agp/agp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
@Mark close this one?