Bug 281035 - agp driver kernel panic on 14.1 - agp_close(): page fault while in kernel mode
Summary: agp driver kernel panic on 14.1 - agp_close(): page fault while in kernel mode
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 14.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2024-08-24 14:57 UTC by doktornotor
Modified: 2024-09-22 15:36 UTC (History)
3 users (show)

See Also:


Attachments
Boot log with the agp_close() kernel panic and backtrace (11.85 KB, text/plain)
2024-08-24 14:57 UTC, doktornotor
no flags Details
proposed patch (653 bytes, patch)
2024-08-28 14:19 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description doktornotor 2024-08-24 14:57:07 UTC
Created attachment 253058 [details]
Boot log with the agp_close() kernel panic and backtrace

Poking /dev/agpgart is enough to trigger kernel panic via loaded agp(4) driver [0]. While this was originally discovered by running

> env ZPOOL_IMPORT_PATH=/dev zpool import -Na

which in turn tries to openat(2) everything under /dev tree including /dev/agpgart, which then triggers agp_close() [1] call in the driver, and then you get kernel panic. Attached there is a boot log including the moment where the system crashes, that is when the above zpool import is called.)

I believe simply doing 

> echo > /dev/agpgart

will do the same job - feel feel to try if you have some vintage hardware to crash.

*** 
Note: I do NOT have any hardware to reproduce, test any fixes or whatever else here, provide core dumps or whatever similar. However, this was reproduced by multiple users [2] and caused completely unexpected fatal system crashes.
***

Recompiling the kernel without the agp(4) driver avoids this issue. Also, using 

> set hint.agp.0.disabled=1

at loader prompt avoids the issue (also confirmed by multiple users).

Suggestion: Perhaps follow the kernel warning

> WARNING: Device "agp" is Giant locked and may be deleted before FreeBSD 15.0.

and delete the driver from FreeBSD as soon as practicable. Having it in kernel in this state is doing more harm than good. (There is also another bug unsolved and open since 2014, which also results in unbootable systems [3]).

[0] https://man.freebsd.org/cgi/man.cgi?query=agp&sektion=4&format=html
[1] https://github.com/freebsd/freebsd-src/blob/5cbb98c8259c48ba22c8359f4c14f5438329ce58/sys/dev/agp/agp.c#L829
[2] https://forum.opnsense.org/index.php?topic=42373.0
[3] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187015
Comment 1 Franco Fichtner 2024-08-27 13:38:32 UTC
Got a vmcore from a user:

(kgdb) bt
#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=0) at /usr/src/sys/kern/kern_shutdown.c:405
#2  0xffffffff8049c2ea in db_dump (dummy=<optimized out>, dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>) at /usr/src/sys/ddb/db_command.c:591
#3  0xffffffff8049c0ed in db_command (last_cmdp=<optimized out>, cmd_table=<optimized out>, dopager=false) at /usr/src/sys/ddb/db_command.c:504
#4  0xffffffff8049c236 in db_command_script (command=command@entry=0xffffffff81bbf6d3 <db_recursion_data+3> "dump") at /usr/src/sys/ddb/db_command.c:569
#5  0xffffffff804a14a8 in db_script_exec (scriptname=<optimized out>, warnifnotfound=warnifnotfound@entry=0) at /usr/src/sys/ddb/db_script.c:302
#6  0xffffffff804a13b5 in db_script_kdbenter (eventname=<optimized out>) at /usr/src/sys/ddb/db_script.c:325
#7  0xffffffff8049f471 in db_trap (type=<optimized out>, code=<optimized out>) at /usr/src/sys/ddb/db_main.c:267
#8  0xffffffff80c09108 in kdb_trap (type=type@entry=3, code=code@entry=0, tf=tf@entry=0xfffffe00895e0730) at /usr/src/sys/kern/subr_kdb.c:790
#9  0xffffffff810df419 in trap (frame=0xfffffe00895e0730) at /usr/src/sys/amd64/amd64/trap.c:608
#10 <signal handler called>
#11 kdb_enter (why=<optimized out>, msg=<optimized out>) at /usr/src/sys/kern/subr_kdb.c:556
#12 0xffffffff80bb8ad2 in vpanic (fmt=0xffffffff8126b7cf "%s", ap=ap@entry=0xfffffe00895e0960) at /usr/src/sys/kern/kern_shutdown.c:955
#13 0xffffffff80bb8b83 in panic (fmt=0xffffffff81d82c18 <cnputs_mtx+24> "") at /usr/src/sys/kern/kern_shutdown.c:891
#14 0xffffffff810dfeab in trap_fatal (frame=0xfffffe00895e0a40, eva=0) at /usr/src/sys/amd64/amd64/trap.c:952
#15 0xffffffff810dff07 in trap_pfault (frame=<optimized out>, usermode=<optimized out>, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:760
#16 <signal handler called>
#17 0xffffffff804d8477 in AGP_UNBIND_MEMORY (dev=0xfffff800045d3000, handle=0xfffff800045d2600) at ./agp_if.h:156
#18 agp_close (kdev=<optimized out>, fflag=<optimized out>, devtype=<optimized out>, td=<optimized out>) at /usr/src/sys/dev/agp/agp.c:840
#19 0xffffffff80b4426f in giant_close (dev=0xfffff800045e7800, fflag=131077, devtype=8192, td=0xfffff80042b6d740) at /usr/src/sys/kern/kern_conf.c:389
#20 0xffffffff80a25191 in devfs_close (ap=0xfffffe00895e0c00) at /usr/src/sys/fs/devfs/devfs_vnops.c:769
#21 0xffffffff811bcc3f in VOP_CLOSE_APV (vop=0xffffffff81ab2888 <devfs_specops>, a=a@entry=0xfffffe00895e0c00) at vnode_if.c:496
#22 0xffffffff80ccbde0 in VOP_CLOSE (vp=0xfffff800425e3540, fflag=5, cred=0xfffff80085d7b500, td=0xfffff80042b6d740) at ./vnode_if.h:247
#23 vn_close1 (vp=vp@entry=0xfffff800425e3540, flags=5, file_cred=0xfffff80085d7b500, td=0xfffff80042b6d740, keep_ref=false) at /usr/src/sys/kern/vfs_vnops.c:543
#24 0xffffffff80cca1cd in vn_closefile (fp=0xfffff80018ba7a00, td=0xfffff80004892900) at /usr/src/sys/kern/vfs_vnops.c:1849
#25 0xffffffff80a25bfa in devfs_close_f (fp=0xfffff800045d3000, td=0x20005) at /usr/src/sys/fs/devfs/devfs_vnops.c:788
#26 0xffffffff80b5038b in fo_close (fp=0xfffff80018ba7a00, td=0x20005) at /usr/src/sys/sys/file.h:392
#27 _fdrop (fp=fp@entry=0xfffff80018ba7a00, td=0x20005, td@entry=0xfffff80042b6d740) at /usr/src/sys/kern/kern_descrip.c:3668
#28 0xffffffff80b53e53 in closef (fp=fp@entry=0xfffff80018ba7a00, td=td@entry=0xfffff80042b6d740) at /usr/src/sys/kern/kern_descrip.c:2841
#29 0xffffffff80b57f91 in closefp_impl (fdp=0xfffffe008977d860, fd=9, fp=0xfffff80018ba7a00, td=0xfffff80042b6d740, audit=true) at /usr/src/sys/kern/kern_descrip.c:1317
#30 0xffffffff810e0880 in syscallenter (td=0xfffff80042b6d740) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:188
#31 amd64_syscall (td=0xfffff80042b6d740, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1194
#32 <signal handler called>
#33 0x000028a511a672ba in ?? ()
Backtrace stopped: Cannot access memory at address 0x28a517915d88
Comment 2 Franco Fichtner 2024-08-27 13:39:48 UTC
(kgdb) frame 18
#18 agp_close (kdev=<optimized out>, fflag=<optimized out>, devtype=<optimized out>, td=<optimized out>) at /usr/src/sys/dev/agp/agp.c:840
840				AGP_UNBIND_MEMORY(dev, mem);
(kgdb) p dev
$1 = (device_t) 0xfffff800045d3000
(kgdb) p mem
$2 = (struct agp_memory *) 0xfffff800045d2600
(kgdb) p *dev
$3 = {ops = 0xfffff80004892900, link = {tqe_next = 0xf5c0000000000018, tqe_prev = 0x0}, devlink = {tqe_next = 0x0, tqe_prev = 0xfffff800045d3018}, parent = 0x1, children = {
    tqh_first = 0xfffff800045e7800, tqh_last = 0xfffff800045e7400}, driver = 0xffffffff811ece77, devclass = 0x1030000, unit = 0, nameunit = 0x0, 
  desc = 0x10000000 <error: Cannot access memory at address 0x10000000>, busy = 76113856, state = 4294965248, devflags = 0, flags = 2047, order = 8388608, ivars = 0x10000, 
  softc = 0xfffff800045d2700, sysctl_ctx = {tqh_first = 0x0, tqh_last = 0xfffff80004892b00}, sysctl_tree = 0xfffff80004892880}
(kgdb) p *mem
$4 = {am_link = {tqe_next = 0xfffff80004768000, tqe_prev = 0xfffff800045d2500}, am_id = 73213704, am_size = 18446735277689742592, am_type = 73213720, am_obj = 0xfffff800045d2800, 
  am_physical = 18446735277689745920, am_offset = 18446735277689745672, am_is_bound = -2119914208}
Comment 3 Franco Fichtner 2024-08-27 13:43:25 UTC
am_is_bound = -2119914208

this sure is sketchy :)

if (mem->am_is_bound)
    AGP_UNBIND_MEMORY(dev, mem);
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2024-08-27 14:22:51 UTC
Reading the code a bit, it's hard to see where the memory corruption might be coming from.  Is it possible to run a debug kernel on this hardware?  Or even a KASAN kernel if possible.
Comment 5 Franco Fichtner 2024-08-27 14:40:59 UTC
Can share the vmcore from the user from a debug kernel or dump further info if you want. I don't want to put the user through more debug. For us disabling agp(4) in the kernel config is a viable option here. But I can ask the user to try a patch probably.

I noticed there is no mutex locking in agp_close() as seems customary in the code around am_is_bound et al.  But it's probably irrelevant due to Giant lock?



Cheers,
Franco
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2024-08-28 13:27:17 UTC
(In reply to Franco Fichtner from comment #5)
I can take a look at the core.

Indeed, Giant should be serializing file operations on /dev/agp0.
Comment 7 Franco Fichtner 2024-08-28 13:45:50 UTC
I'll sent it via mail.


Thanks,
Franco
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2024-08-28 14:19:18 UTC
Created attachment 253148 [details]
proposed patch

There is a regression from commit 437ea82ce7fc that is most likely the cause here.  I attached a patch.  Can anyone test it?
Comment 9 Mitchell Horne freebsd_committer freebsd_triage 2024-08-28 14:32:30 UTC
(In reply to Mark Johnston from comment #8)

I was looking at the code too and this is definitely the immediate problem.

Reviewed by: mhorne
Comment 10 Franco Fichtner 2024-08-29 06:01:24 UTC
Thanks. I've published a test kernel on our end. I'll report back with the results.

What's the plan here: merge fix to stable branches? Remove agp from default kernel for 15.0 seeing it has no real mileage anymore?

The people reporting this have cheap "network" boxes with agp somewhere on the hardware but not even usable since these appear to be serial devices or have otherwise means of VGA.

Warner asked the same in 2020:

https://cgit.freebsd.org/src/commit/?id=4f8959b9f4b

> Compile tested only, but do we still need this driver?

We may have an answer now.  :)


Cheers,
Franco
Comment 11 Franco Fichtner 2024-08-29 09:45:01 UTC
Confirmed fixed by the user who provided the core dump.

https://forum.opnsense.org/index.php?topic=42373.msg210575#msg210575


Cheers,
Franco
Comment 12 commit-hook freebsd_committer freebsd_triage 2024-08-29 13:52:08 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=12500c14281dc62ddeac4c5e1e6eabd1e380f11c

commit 12500c14281dc62ddeac4c5e1e6eabd1e380f11c
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2024-08-29 13:12:19 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2024-08-29 13:12:19 +0000

    agp: Set the driver-specific field correctly

    PR:             281035
    Reviewed by:    mhorne
    MFC after:      1 week
    Fixes:          437ea82ce7fc ("agp: Handle multiple devices more gracefully")

 sys/dev/agp/agp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 13 Mark Johnston freebsd_committer freebsd_triage 2024-08-29 13:56:07 UTC
(In reply to Franco Fichtner from comment #11)
Thank you.

(In reply to Franco Fichtner from comment #10)
Yes, I will start some discussion around this.
Comment 14 doktornotor 2024-09-04 20:21:51 UTC
Just wanted to say thanks to Mark and Mitchell. Unlike some other bug reports here, this one has been really productive.
Comment 15 commit-hook freebsd_committer freebsd_triage 2024-09-05 13:47:08 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=57e0d4b9a609cc2fe267489a3fe46857035ed3ae

commit 57e0d4b9a609cc2fe267489a3fe46857035ed3ae
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2024-08-29 13:12:19 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2024-09-05 13:44:03 +0000

    agp: Set the driver-specific field correctly

    PR:             281035
    Reviewed by:    mhorne
    MFC after:      1 week
    Fixes:          437ea82ce7fc ("agp: Handle multiple devices more gracefully")

    (cherry picked from commit 12500c14281dc62ddeac4c5e1e6eabd1e380f11c)

 sys/dev/agp/agp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 16 Franco Fichtner 2024-09-22 15:20:10 UTC
@Mark close this one?