Bug 267028 - kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko
Summary: kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko)...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, needs-qa
: 268416 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-10-13 22:47 UTC by George Mitchell
Modified: 2024-12-21 21:41 UTC (History)
9 users (show)

See Also:
grahamperrin: maintainer-feedback?


Attachments
/var/crash/core.txt.1 crash description (80.48 KB, text/plain)
2022-10-13 22:47 UTC, George Mitchell
no flags Details
/var/crash/core.txt.3 crash description from today (79.99 KB, text/plain)
2022-11-11 23:56 UTC, George Mitchell
no flags Details
Another crash (87.45 KB, text/plain)
2022-11-14 17:40 UTC, George Mitchell
no flags Details
Another crash summary; looks like all the earlier ones (79.99 KB, text/plain)
2022-12-09 17:35 UTC, George Mitchell
no flags Details
New core.txt (90.81 KB, text/plain)
2022-12-14 22:37 UTC, George Mitchell
no flags Details
A new crash (96.09 KB, text/plain)
2022-12-16 22:40 UTC, George Mitchell
no flags Details
A new instance of the same crash (96.09 KB, text/plain)
2022-12-16 22:42 UTC, George Mitchell
no flags Details
Crash after updating kernel/world to 13.1-RELEASE-p5 (109.74 KB, text/plain)
2022-12-18 02:16 UTC, George Mitchell
no flags Details
Crash dump (78.93 KB, text/plain)
2023-01-07 18:08 UTC, George Mitchell
no flags Details
Latest crash dump (87.44 KB, text/plain)
2023-01-28 00:13 UTC, George Mitchell
no flags Details
Crash after loading vboxnetflt early by hand (119.82 KB, text/plain)
2023-02-07 15:10 UTC, George Mitchell
no flags Details
New version of the crash, from acpi_wmi (88.80 KB, text/plain)
2023-02-26 17:35 UTC, George Mitchell
no flags Details
A new but related crash (I think) (82.82 KB, text/plain)
2023-03-05 03:21 UTC, George Mitchell
no flags Details
Another crash summary; looks like all the earlier ones (152.66 KB, text/plain)
2023-03-06 18:15 UTC, George Mitchell
no flags Details
Crash without any use of ZFS, with acpi_wmi (120.99 KB, text/plain)
2023-03-07 18:40 UTC, George Mitchell
no flags Details
Relevant part of /var/log/messages (52.16 KB, text/plain)
2023-03-07 18:43 UTC, George Mitchell
no flags Details
New instance (112.96 KB, text/plain)
2023-03-08 22:35 UTC, George Mitchell
no flags Details
Crashes 2 and 3 (196.92 KB, text/plain)
2023-03-08 22:37 UTC, George Mitchell
no flags Details
Another instance of attachment #240591 crash at shutdown time (82.84 KB, text/plain)
2023-03-10 18:28 UTC, George Mitchell
no flags Details
After upgrading to v5.10.163_2 (115.48 KB, text/plain)
2023-03-10 18:45 UTC, George Mitchell
no flags Details
Four boot-time crashes in a row (94.09 KB, application/octet-stream)
2023-03-20 22:17 UTC, George Mitchell
no flags Details
Another shutdown-time crash (83.02 KB, text/plain)
2023-03-21 00:07 UTC, George Mitchell
no flags Details
Crash at shutdown time (106.50 KB, text/plain)
2023-03-22 00:23 UTC, George Mitchell
no flags Details
Crash that happened neither at startup nor shutdown (279.11 KB, text/plain)
2023-04-16 02:02 UTC, George Mitchell
no flags Details
Shutdown crash with version 5.10.163_5 (92.45 KB, text/plain)
2023-04-25 18:12 UTC, George Mitchell
no flags Details
And another plain old boot time crash (105.33 KB, text/plain)
2023-04-25 22:46 UTC, George Mitchell
no flags Details
Latest crash dump/text (88.24 KB, text/plain)
2024-12-13 00:17 UTC, George Mitchell
no flags Details
core.txt.9 (88.28 KB, text/plain)
2024-12-21 17:54 UTC, George Mitchell
no flags Details
core.txt.0 (88.42 KB, text/plain)
2024-12-21 17:55 UTC, George Mitchell
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description George Mitchell 2022-10-13 22:47:29 UTC
Created attachment 237279 [details]
/var/crash/core.txt.1 crash description

It doesn't happen every time.  If I use kld_list="amdgpu" in /etc/rc.conf, it happens close to 50% of the time.  If instead I boot to single user mode and manually kldload amdgpu, it happens maybe 20% of the time.  If I have amdgpu_load="YES" in /boot/loader.conf, the module fails to load at all, without saying anything.

FreeBSD 13.1-RELEASE-p2, drm-510-kmod-5.10.113_7, AMD Ryzen 3 2200G with Radeon Vega Graphics.

Crashes are always general protection fault panics, replete with complaints about drm_modeset_is_locked being false.
Comment 1 George Mitchell 2022-10-13 22:50:06 UTC
This is marginally an improvement from FreeBSD 12, where kldload amdgpu would always immediately totally lock up the machine, with no recovery path short of powering down and back on.  And when this crash DOESN'T happen, everything works marvelously well (and considerably better than running in VESA mode), so thanks for the work so far!
Comment 2 George Mitchell 2022-10-13 22:52:23 UTC
I have four more of the /var/crash/core.txt files, and core dumps (very large, too big to attach here even compressed) for each of them.
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2022-10-13 23:15:13 UTC
Thank you, and please note that issues for <https://github.com/freebsd/drm-kmod> are normally raised in GitHub.
Comment 4 George Mitchell 2022-10-13 23:22:33 UTC
Ugh, I don't have a GitHub account and I would rather not open one.  (Yes, that does seem selfish of me and I apologize.)
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2022-10-14 21:17:47 UTC
From a _very_ quick look, it does not appear that this is an amdgpu problem.
The crash is in the core kernel code and the stack trace has mentions of zfs.

#6  <signal handler called>
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:46
#8  0xffffffff80be8c3d in modlist_lookup (name=0xfffff80004b71000 "zfs", 
    ver=0) at /usr/src/sys/kern/kern_linker.c:1487
#9  modlist_lookup2 (name=0xfffff80004b71000 "zfs", verinfo=0x0)
    at /usr/src/sys/kern/kern_linker.c:1501
#10 linker_load_module (kldname=kldname@entry=0x0, 
    modname=modname@entry=0xfffff80004b71000 "zfs", parent=parent@entry=0x0, 
    verinfo=<optimized out>, verinfo@entry=0x0, 
    lfpp=lfpp@entry=0xfffffe0075fddd90)
    at /usr/src/sys/kern/kern_linker.c:2165
#11 0xffffffff80beb17a in kern_kldload (td=td@entry=0xfffffe007f505a00, 
    file=<optimized out>, file@entry=0xfffff80004b71000 "zfs", 
    fileid=fileid@entry=0xfffffe0075fddde4)
    at /usr/src/sys/kern/kern_linker.c:1150
#12 0xffffffff80beb29b in sys_kldload (td=0xfffffe007f505a00, 
    uap=<optimized out>) at /usr/src/sys/kern/kern_linker.c:1173
#13 0xffffffff810ae6ec in syscallenter (td=0xfffffe007f505a00)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#14 amd64_syscall (td=0xfffffe007f505a00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185

To the reporter: do you by chance have zfs in kld_list ?
Comment 6 Graham Perrin freebsd_committer freebsd_triage 2022-10-14 22:08:49 UTC
Also, how is the root file system tuned? 

tunefs -p /

(In reply to George Mitchell from comment #1)

> … from FreeBSD 12 𣀦…

Did you run 13.0⋯ for a while, or did you upgrade from 12.⋯ direct to 13.1⋯?
Comment 7 Graham Perrin freebsd_committer freebsd_triage 2022-10-14 22:36:56 UTC
> … immediately after kldload amdgpu …

(In reply to George Mitchell from comment #0)

If I understand correctly, the attachment shows: 

1. kldload amdgpu whilst in single user mode

2. a subsequent, but non-immediate, exit ^D to multi-user mode

3. panic 


…
ugen0.4: <Logitech USB Optical Mouse> at usbus0
<118>Enter full pathname of shell or RETURN for /bin/sh: Cannot read termcap database;
<118>using dumb terminal settings.
<118>root@:/ # kldload amdgpu
<6>[drm] amdgpu kernel modesetting enabled.
…
<6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0
<118>root@:/ # ^D
<118>Setting hostuuid: 032e02b4-0499-0547-c106-430700080009.
<118>Setting hostid: 0x82f0750c.


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870
stack pointer	        = 0x28:0xfffffe0075fdda60
frame pointer	        = 0x28:0xfffffe0075fdda60
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 52 (kldload)
…
Comment 8 George Mitchell 2022-10-14 22:38:31 UTC
Thanks for the work so far.  "zfs" is not explicitly in the kld_list, but I do use ZFS and zfs_enable is set to "YES".

Also:

tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 disabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: space to hold for metadata blocks: (-k)            6408
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)                                 

I never ran 13.0; I'm always leery of upgrading to x.0 from x-1.  (My upgrade was from 12.3-p6.)  Also, I still remember a collection of severe crashes from years back with soft updates plus journaling.  Are those problems known to be solved now?  (Sorry to be getting off the main topic.)
Comment 9 George Mitchell 2022-10-14 22:39:33 UTC
In this particular crash, I manually loaded amdgpu in single-user mode, and then immediately hit control-D.
Comment 10 Graham Perrin freebsd_committer freebsd_triage 2022-10-15 08:07:38 UTC
sysrc -f /etc/rc.conf kld_list

– is there amdgpu alone, or are other modules listed? 


(In reply to George Mitchell from comment #9)

Given the brief analysis by avg@ (comment #5), I'm inclined to: 

* view the load of amdgpu as successful

* give thought to other modules, ones that are (or should be) subsequently loaded. 

Do you use IRC, Matrix (e.g. Element) or Discord?
Comment 11 Graham Perrin freebsd_committer freebsd_triage 2022-10-15 08:10:54 UTC
(In reply to George Mitchell from comment #8)

> …crashes from years back with soft updates plus journaling.  
> Are those problems known to be solved now? …

For what's described: without a bug number, it might be impossible for me to tell. 


> … I never ran 13.0; …

13.1 fixed a bug that involved soft updates _without_ soft update journaling: <https://www.freebsd.org/releases/13.1R/relnotes/#storage-ufs>

<https://docs.freebsd.org/en/books/handbook/config/#soft-updates> recommends soft updates. If there's no explicit recommendation to also enable soft update journaling, this could be because (bug 261944) there's not yet, in the Handbook, a suitable explanation of the feature. 

tunefs(8) <https://www.freebsd.org/cgi/man.cgi?query=tunefs&sektion=8&manpath=FreeBSD> for FreeBSD 13.1-RELEASE lacks a recently added explanation, you can gain this by switching the online view of the manual page to FreeBSD 14.0-CURRENT.
Comment 12 George Mitchell 2022-10-15 23:17:40 UTC
Without amdgpu in the kld_list, kld_list currently is not even defined.  Perhaps it's more helpful to show what gets loaded aside from amdgpu int he course of a normal boot:

kldstat
 1   64 0xffffffff80200000  1f300f0 kernel
 2    1 0xffffffff82132000     77e0 sem.ko
 3    3 0xffffffff8213a000    8cc90 vboxdrv.ko
 4    1 0xffffffff82600000   3df128 zfs.ko
 5    2 0xffffffff82518000     4240 vboxnetflt.ko
 6    2 0xffffffff8251d000     aac8 netgraph.ko
 7    1 0xffffffff82528000     31c8 ng_ether.ko
 8    1 0xffffffff8252c000     55e0 vboxnetadp.ko
 9    1 0xffffffff82532000     3378 acpi_wmi.ko
10    1 0xffffffff82536000     3218 intpm.ko
11    1 0xffffffff8253a000     2180 smbus.ko
12    1 0xffffffff8253d000     33c0 uslcom.ko
13    1 0xffffffff82541000     4d90 ucom.ko
14    1 0xffffffff82546000     2340 uhid.ko
15    1 0xffffffff82549000     3380 usbhid.ko
16    1 0xffffffff8254d000     31f8 hidbus.ko
17    1 0xffffffff82551000     3320 wmt.ko
18    1 0xffffffff82555000     4350 ums.ko
19    1 0xffffffff8255a000     5af8 autofs.ko
20    1 0xffffffff82560000     2a08 mac_ntpd.ko
21    1 0xffffffff82563000     20f0 green_saver.ko

The SU+J thing is totally anecdotal, based on what I used to see on freebsd-hackers.  Right now, I format my disks with UFS for root/var/tmp (no more than 8GB for fast fscking), and then a ZFS partition for /usr.

I don't use IRC, Matrix, or Element (not sure what those last two are) and on the rare occasions I use Discord, I use the web site.
Comment 13 George Mitchell 2022-11-06 18:15:41 UTC
As of today, with version drm-510-kmod-5.10.113_8:

1. I can reliably prevent a crash by booting to single user mode, manually kldloading amdgpu, and continuing (typing control-d).  dmesg then reports:

[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
[drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1458:0xD000 0xC8).
drmn0: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[drm] register mmio base: 0xFE600000
[drm] register mmio size: 524288
[drm] add ip block number 0 <soc15_common>
[drm] add ip block number 1 <gmc_v9_0>
[drm] add ip block number 2 <vega10_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <gfx_v9_0>
[drm] add ip block number 5 <sdma_v4_0>
[drm] add ip block number 6 <powerplay>
[drm] add ip block number 7 <dm>
[drm] add ip block number 8 <vcn_v1_0>
drmn0: successfully loaded firmware image 'amdgpu/raven_gpu_info.bin'
[drm] BIOS signature incorrect 44 f
drmn0: Fetched VBIOS from ROM BAR
amdgpu: ATOM BIOS: 113-RAVEN-111
drmn0: successfully loaded firmware image 'amdgpu/raven_sdma.bin'
[drm] VCN decode is enabled in VM mode
[drm] VCN encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
drmn0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
drmn0: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
drmn0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=2048M, BAR=2048M
[drm] RAM width 128bits DDR4
[TTM] Zone  kernel: Available graphics memory: 3100774 KiB
[TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[TTM] Initializing pool allocator
[drm] amdgpu: 2048M of VRAM memory ready
[drm] amdgpu: 3072M of GTT memory ready.
[drm] GART: num cpu pages 262144, num gpu pages 262144
[drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
drmn0: successfully loaded firmware image 'amdgpu/raven_asd.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_ta.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_pfp.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_me.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_ce.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_rlc.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_mec.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_mec2.bin'
amdgpu: hwmgr_sw_init smu backed is smu10_smu
drmn0: successfully loaded firmware image 'amdgpu/raven_vcn.bin'
[drm] Found VCN firmware Version ENC: 1.12 DEC: 2 VEP: 0 Revision: 1
drmn0: Will use PSP to load VCN firmware
[drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
drmn0: RAS: optional ras ta ucode is not available
drmn0: RAP: optional rap ta ucode is not available
[drm] kiq ring mec 2 pipe 1 q 0
[drm] DM_PPLIB: values for F clock
[drm] DM_PPLIB:  400000 in kHz, 3649 in mV
[drm] DM_PPLIB:  933000 in kHz, 4074 in mV
[drm] DM_PPLIB:  1200000 in kHz, 4399 in mV
[drm] DM_PPLIB:  1333000 in kHz, 4399 in mV
[drm] DM_PPLIB: values for DCF clock
[drm] DM_PPLIB:  300000 in kHz, 3649 in mV
[drm] DM_PPLIB:  600000 in kHz, 4074 in mV
[drm] DM_PPLIB:  626000 in kHz, 4250 in mV
[drm] DM_PPLIB:  654000 in kHz, 4399 in mV
[drm] Display Core initialized with v3.2.104!
[drm] VCN decode and encode initialized successfully(under SPG Mode).
drmn0: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8
[drm] fb mappable at 0x60BCA000
[drm] vram apper at 0x60000000
[drm] size 8294400
[drm] fb depth is 24
[drm]    pitch is 7680
VT: Replacing driver "vga" with new "fb".
start FB_INFO:
type=11 height=1080 width=1920 depth=32
pbase=0x60bca000 vbase=0xfffff80060bca000
name=drmn0 flags=0x0 stride=7680 bpp=32
end FB_INFO
drmn0: ring gfx uses VM inv eng 0 on hub 0
drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
drmn0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
drmn0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
drmn0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
drmn0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
drmn0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
drmn0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
drmn0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
drmn0: ring sdma0 uses VM inv eng 0 on hub 1
drmn0: ring sdma0 uses VM inv eng 0 on hub 1
drmn0: ring vcn_dec uses VM inv eng 1 on hub 1
drmn0: ring vcn_enc0 uses VM inv eng 4 on hub 1
drmn0: ring vcn_enc1 uses VM inv eng 5 on hub 1
drmn0: ring jpeg_dec uses VM inv eng 6 on hub 1
vgapci0: child drmn0 requested pci_get_powerstate
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0
 
Is the sysctl_warn_reuse message anything to worry about?

2. Adding amdgpu to the kldlist in rc.conf still crashes more often than not, as previously reported.

3. Attempting to load amdgpu via /boot/loader.conf appears to load the module in memory but not actually make it functional.  (X uses VESA mode as if the module isn't there.)
Comment 14 George Mitchell 2022-11-11 23:56:38 UTC
Created attachment 238024 [details]
/var/crash/core.txt.3 crash description from today

Contrary to comment #13, today I got a crash despite booting to single user mode, typing "kldload amdgpu", and then control-d.  But it looks indistinguishable from the other /var/crash/core.txt.1 description.  Next I'll try booting to single user mode and kldloading zfs before kldloading amdgpu.
Comment 15 George Mitchell 2022-11-14 17:40:31 UTC
Created attachment 238075 [details]
Another crash

This time, I booted into single user mode and typed "kldload zfs amdgpu" with no problems.  Then when I typed ctrl-d I got this crash (which looks pretty much the same as all the other ones, except the places in the backtrace that used to refer to zfs now refer to vboxnetflt, which I load for VirtualBox).  So it seems likely that the crash has nothing to do with any specific other kernel loadable module might be cited in the backtrace.
Comment 16 George Mitchell 2022-11-18 22:34:21 UTC
The following comment is based on zero actual knowledge of how kernel loadable modules work.  Still, based on what I'm seeing with this bug, I hypothesize that after one module is loaded, there is a mechanism by which the next module (and maybe other later ones) call back to modules already loaded in order to prevent incompatible modules (whatever that might mean) from trying to coexist.  And somewhere in that path in the amdgpu module, it is detected that some lock that was taken while amdgpu was loading was erroneously not released.  (Most of the time, the lock IS released, and I don't know exactly under what circumstances it isn't.)

I hope this is helpful.
Comment 17 George Mitchell 2022-12-05 18:27:02 UTC
I've discovered how to avoid this crash (at least the last 20-30 times I have booted up): boot into single user mode, type <ENTER> to run /bin/sh, type "kldload amdgpu," and then (key step!) wait at least five seconds before typing ctrl-D to exit single user mode.  Since I don't know why this helps, I guess it falls into the voodoo category, but maybe it's a clue.
Comment 18 George Mitchell 2022-12-07 16:58:21 UTC
I hate to say again how little I know about kernel module loading, but by any chance is there multithreading in the code that gets called when amdgpu.ko is first loaded?  I can't help thinking that perhaps that code is returning prematurely, before some initialization is completely finished and all locks released.  If I knew where to put it, I would throw in a five-second delay at the end of whatever gets called to load amdgpu.ko.
Comment 19 George Mitchell 2022-12-09 17:35:52 UTC
Created attachment 238668 [details]
Another crash summary; looks like all the earlier ones

Contrary to my comment #17, I got this same crash this morning, even waiting five seconds after loading amdgpu.ko before proceeding.  So the delay doesn't prevent the crash.
Comment 20 George Mitchell 2022-12-11 17:38:02 UTC
I've figured out why this crash is timing related, and also why ZFS is involved.

My system has a 1 TB USB disk, which contains a ZFS file system.  When I power my system on, it takes a variable amount of time for that disk to become ready and for ZFS to take note of it.  (I'm booting from a SATA disk with a traditional old UFS file system.)  So if the USB disk becomes ready while amdgpu is still initializing, apparently this crash happens.  I have no clue why that is true, but I am pretty sure this explains why the the crash happens only part of the time and is timing dependent.

It remains true that the most reliable way to cause the crash is to include amdgpu in the kld_list in /etc/rc.conf and simply boot normally (and to have a *ZFS-formatted USB* disk attached to the system).
Comment 21 George Mitchell 2022-12-12 14:34:05 UTC
I've updated to version drm-510-kmod-5.10.113_8 and it hasn't crashed yet, but I've only had time for one test so far.
Comment 22 Graham Perrin freebsd_committer freebsd_triage 2022-12-13 21:06:07 UTC
(In reply to George Mitchell from comment #21)

If a crash _does_ occur/recur, then maybe test for reproducibility with this in your /boot/loader.conf

kern.smp.disabled=1

<https://www.freebsd.org/cgi/man.cgi?query=smp&sektion=4&manpath=FreeBSD>

(Be prepared for significantly reduced performance after restarting with SMP disabled.)

This is a gut feeling, more than anything (apologies for the noise), partly based on experiences with virtual hardware …
Comment 23 George Mitchell 2022-12-13 22:45:23 UTC
Thanks!  So far, I've booted four times with amdgpu in my kld_list, which previously would likely have yielded at least one crash, with no crash.  So I have my fingers crossed, but I'll try your hack if it crashes again (and your theory certainly sounds plausible).
Comment 24 George Mitchell 2022-12-14 22:37:31 UTC
Created attachment 238802 [details]
New core.txt

The latest version definitely crashes less often, but I just now got a new crash that (to me) looks different than the earlier one.  I was just about ready to mark this fixed!
Comment 25 George Mitchell 2022-12-15 18:11:35 UTC
After further consideration (and a partly sleepless night), I've decided that the latest crash is not an instance of this bug and possibly isn't related to amdgpu.ko at all.  So I'm going to close this bug and maybe open a new one when I understand the new one better.

Anyone looking at this bug in the future should pay no attention to "New core.txt" attachment, but should refer to the obsolete attachments.
Comment 26 George Mitchell 2022-12-16 22:40:09 UTC
Created attachment 238849 [details]
A new crash

I regret to say I'm going to have to reopen this bug.  But I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall).
Comment 27 George Mitchell 2022-12-16 22:42:46 UTC
Created attachment 238850 [details]
A new instance of the same crash

I regret to say the crash has happened again.  I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
Comment 28 George Mitchell 2022-12-16 22:47:49 UTC
Reopening bug.
Comment 29 Graham Perrin freebsd_committer freebsd_triage 2022-12-17 10:02:15 UTC
(In reply to George Mitchell from comment #26)

> FreeBSD court 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 752f813d6 M5P  amd64

Please update the OS. 


----

Given comment #5 from avg@, and (for example) comment #24 the different types of kernel panic: 

fs@ x11@ please: if panics recur with an updated OS, would you recommend continuing with this report (267028)? Or start afresh, with a new report for the more recent type of panic?
Comment 30 Emmanuel Vadot freebsd_committer freebsd_triage 2022-12-17 10:21:09 UTC
This is not drm related, the drm message are noises that we should fix one day when we switch ttys.
Comment 31 George Mitchell 2022-12-17 15:12:17 UTC
(In reply to Graham Perrin from comment #29)
I'm on the release branch, not the stable branch.  So you are suggesting I update from 13.1-RELEASE-p2 to 13.1-RELEASE-p5?  And then recompile the kernel module as well, I assume?
Comment 32 George Mitchell 2022-12-17 17:45:23 UTC
For what it's worth, I'm doing this testing on a desktop machine, so setting kern.smp.disabled=1 actually doesn't impact operation too much -- except for Thunderbird.  And so far I haven't seen the crash with that setting.
Comment 33 Tomasz "CeDeROM" CEDRO 2022-12-17 17:51:30 UTC
Does switch to graphics/drm-510-kmod and updating graphics/gpu-firmware-amd-kmod helps?
Comment 34 George Mitchell 2022-12-17 18:03:21 UTC
In fact, switching to graphics/drm-510-kmod from the generic VESA driver is what originally triggered this bug.  Without using amdgpu.ko there is no problem.
Comment 35 Emmanuel Vadot freebsd_committer freebsd_triage 2022-12-17 18:33:59 UTC
All your reports show that it's from zfs, again the drm messages are noises.
Comment 36 George Mitchell 2022-12-18 02:16:34 UTC
Created attachment 238886 [details]
Crash after updating kernel/world to 13.1-RELEASE-p5

This is after updating my kernel and world to 13.1-RELEASE-p5.  I grant you the backtrace here sure points to the openzfs code, but why does the crash happen only with graphics/drm-510-kmod installed and amdgpu.ko loaded, but not otherwise?  For the time being, I will be running WITHOUT amdgpu.ko in my kld_list, and I am confident this crash will not occur.

I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2.  But based on an earlier comment I updated to 13.1-RELEASE-p5.  Then after going back to kern.smp.disabled=0 I got another of the crash.

I did observe that something in sys/contrib/openzfs/module/zfs got updated between p2 and p5, but it doesn't seem to have fixed this crash.

Compiling graphics/drm-510-kmod under p5 yielded an amdgpu.ko that was identical to amdgpu.ko compiled under p2.
Comment 37 George Mitchell 2022-12-29 00:12:03 UTC
I'm still having this problem, though I can reduce its frequency by booting in single-user mode, kldloading amdgpu, waiting five or ten seconds, and then going to multi-user mode with control-D.  I've updated the title to emphasize that the bug happens only when amdgpu.ko (from graphics/drm-510-kmod version 5.10.113_8) and ZFS are both in use.  Also, it happens during booting, or else never.
Comment 38 George Mitchell 2022-12-29 00:22:31 UTC
I don't want the title to become too wordy, but also I'll note again that my 1TB USB disk (GPT formatted with one ZFS partition only) that takes a measurable, variable amount of time to become ready may be the main reason this crash doesn't always happen.
Comment 39 Graham Perrin freebsd_committer freebsd_triage 2023-01-06 19:19:13 UTC
(In reply to George Mitchell from comment #37)

grep -e solaris -e zfs /boot/loader.conf

grep zfs /etc/rc.conf

What's reported?
Comment 40 George Mitchell 2023-01-06 19:53:18 UTC
(In reply to Graham Perrin from comment #39)

> grep -e solaris -e zfs /boot/loader.conf

> grep zfs /etc/rc.conf
zfs_enable="YES"		# Set to YES to automatically mount ZFS file systems
Comment 41 Graham Perrin freebsd_committer freebsd_triage 2023-01-07 00:33:26 UTC
(In reply to George Mitchell from comment #40)

(In reply to George Mitchell from comment #20)

> … timing related, …

Please add to /boot/loader.conf


zfs_load="YES"
Comment 42 George Mitchell 2023-01-07 18:08:53 UTC
Created attachment 239336 [details]
Crash dump

Well, this helps a bit.  By adding that line to /boot/loader.conf and restoring kld_list="amdgpu" to my /etc/rc.conf, I was able to reboot without the crash four times in a row, whereas before it would crash about every other time.  But it crashed on the fifth time. (See attached core.txt.0.)
Comment 43 George Mitchell 2023-01-07 18:13:56 UTC
In the new core.txt.0, there are about 19 lines of text from the previous shutdown near the beginning of the file.  But the substance of the backtrace looks identical to all the previous ones.  So loading ZFS early mitigates the problem but does not fix it.
Comment 44 Andriy Gapon freebsd_committer freebsd_triage 2023-01-07 23:36:50 UTC
I think that in these frames we clearly see a bogus pointer / address:
#7  <signal handler called>
#8  vtozoneslab (va=18446735277616529408, zone=<optimized out>, 
    slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635
#9  free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>)
    at /usr/src/sys/kern/kern_malloc.c:911
#10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, 
    buf=0xfffff80000000007, size=16688648)
    at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216

I'd recommend poking around frames 11-13 to see from where that address comes.

Also, I don't get an impression that the latest crash is similar to earlier ones.
kern_reboot / zfs__fini vs dbuf_evict_thread.
Comment 45 George Mitchell 2023-01-15 00:28:56 UTC
It appears I could mitigate this problem if I could load amdgpu.ko from /boot/loader.conf, which currently doesn't work.  See bug #268962.  Alternatively, at present I can completely avoid this crash by:

1. having zfs_load="YES" in /boot/loader.conf.
2. booting into single user mode.
3. typing kldload amdgpu.
4. typing control-D.
Comment 46 George Mitchell 2023-01-20 17:43:31 UTC
(In reply to George Mitchell from comment #45)
Correction to comment #45: I can avoid the problem around 95% of the time with the specified steps, but not 100%.
Comment 47 George Mitchell 2023-01-28 00:13:57 UTC
Created attachment 239752 [details]
Latest crash dump

The last couple of crashes strongly resemble all the earlier ones, but they are now less frequent with zfs.ko being loaded at /boot/loader.conf time and amdgpu.ko loaded while booted into single user mode.  The difference (see core.txt.2 from today's date) is that the backtrace line where modlist2_lookup is called is now looking up vboxnetflt instead of zfs.  My rcorder list shows:

/etc/rc.d/dumpon
/etc/rc.d/sysctl
/etc/rc.d/natd
/etc/rc.d/dhclient
/etc/rc.d/hostid
/etc/rc.d/ddb
/etc/rc.d/ccd
/etc/rc.d/gbde
/etc/rc.d/geli
/etc/rc.d/zpool
/etc/rc.d/swap
/etc/rc.d/zfskeys
/etc/rc.d/fsck
/etc/rc.d/zvol
/etc/rc.d/growfs
/etc/rc.d/root
/etc/rc.d/sppp
/etc/rc.d/mdconfig
/etc/rc.d/hostid_save
/etc/rc.d/serial
/etc/rc.d/mountcritlocal
/etc/rc.d/zfsbe
/etc/rc.d/tmp
/etc/rc.d/zfs
/etc/rc.d/var
/etc/rc.d/cfumass
/etc/rc.d/cleanvar
/etc/rc.d/FILESYSTEMS
/etc/rc.d/geli2
/etc/rc.d/ldconfig
/etc/rc.d/kldxref
/etc/rc.d/adjkerntz
/etc/rc.d/hostname
/etc/rc.d/ip6addrctl
/etc/rc.d/ippool
/etc/rc.d/netoptions
/etc/rc.d/opensm
/etc/rc.d/random
/etc/rc.d/iovctl
/etc/rc.d/rctl
/usr/local/etc/rc.d/vboxnet
/etc/rc.d/ugidfw
/etc/rc.d/autounmountd
/etc/rc.d/mixer
/etc/rc.d/ipsec
/usr/local/etc/rc.d/uuidd
/etc/rc.d/kld
/etc/rc.d/ipfilter
/etc/rc.d/devmatch
/etc/rc.d/addswap
/etc/rc.d/ipnat
/etc/rc.d/ipmon
/etc/rc.d/ipfs
/etc/rc.d/netif
/etc/rc.d/ppp
/etc/rc.d/pfsync
/etc/rc.d/pflog
/etc/rc.d/rtsold
/etc/rc.d/static_ndp
/etc/rc.d/static_arp
/etc/rc.d/devd
/etc/rc.d/resolv
/etc/rc.d/stf
/etc/rc.d/ipfw
/etc/rc.d/routing
/etc/rc.d/bridge
/etc/rc.d/zfsd
/etc/rc.d/defaultroute
/etc/rc.d/routed
/etc/rc.d/pf
/etc/rc.d/route6d
/etc/rc.d/ipfw_netflow
/etc/rc.d/blacklistd
/etc/rc.d/netwait
/etc/rc.d/local_unbound
/etc/rc.d/NETWORKING
/etc/rc.d/kdc
/etc/rc.d/tlsservd
/etc/rc.d/iscsid
/etc/rc.d/pppoed
/etc/rc.d/ctld
/etc/rc.d/nfsuserd
/etc/rc.d/tlsclntd
/etc/rc.d/kfd
/usr/local/etc/rc.d/sndiod
/etc/rc.d/gssd
/etc/rc.d/nfscbd
/etc/rc.d/ipropd_master
/etc/rc.d/ipropd_slave
/etc/rc.d/kadmind
/etc/rc.d/kpasswdd
/etc/rc.d/iscsictl
/etc/rc.d/mountcritremote
/etc/rc.d/archdep
/etc/rc.d/dmesg
/etc/rc.d/wpa_supplicant
/etc/rc.d/hostapd
/etc/rc.d/accounting
/etc/rc.d/mdconfig2
/etc/rc.d/devfs
/etc/rc.d/gptboot
/etc/rc.d/virecover
/etc/rc.d/os-release
/etc/rc.d/motd
/etc/rc.d/cleartmp
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/etc/rc.d/linux
/etc/rc.d/sysvipc
/etc/rc.d/hastd
/etc/rc.d/localpkg
/etc/rc.d/auditd
/etc/rc.d/bsnmpd
/etc/rc.d/ntpdate
/etc/rc.d/watchdogd
/etc/rc.d/savecore
/etc/rc.d/pwcheck
/etc/rc.d/power_profile
/etc/rc.d/auditdistd
/etc/rc.d/SERVERS
/etc/rc.d/rpcbind
/etc/rc.d/nisdomain
/etc/rc.d/nfsclient
/etc/rc.d/ypserv
/etc/rc.d/ypupdated
/etc/rc.d/ypxfrd
/etc/rc.d/ypbind
/etc/rc.d/ypldap
/etc/rc.d/ypset
/etc/rc.d/keyserv
/etc/rc.d/automountd
/etc/rc.d/yppasswdd
/etc/rc.d/quota
/etc/rc.d/automount
/etc/rc.d/mountd
/etc/rc.d/nfsd
/etc/rc.d/statd
/etc/rc.d/lockd
/etc/rc.d/DAEMON
/etc/rc.d/rwho
/etc/rc.d/utx
/etc/rc.d/bootparams
/etc/rc.d/hcsecd
/etc/rc.d/ftp-proxy
/etc/rc.d/local
/usr/local/etc/rc.d/git_daemon
/etc/rc.d/lpd
/usr/local/etc/rc.d/dbus
/etc/rc.d/mountlate
/etc/rc.d/nscd
/etc/rc.d/ntpd
/etc/rc.d/powerd
/usr/local/etc/rc.d/slurmd
/usr/local/etc/rc.d/slurmctld
/etc/rc.d/ubthidhci
/etc/rc.d/rarpd
/etc/rc.d/sdpd
/etc/rc.d/apm
/etc/rc.d/rtadvd
/etc/rc.d/moused
/etc/rc.d/rfcomm_pppd_server
/usr/local/etc/rc.d/avahi-daemon
/etc/rc.d/swaplate
/etc/rc.d/bthidd
/etc/rc.d/bluetooth
/usr/local/etc/rc.d/avahi-dnsconfd
/etc/rc.d/LOGIN
/etc/rc.d/sshd
/usr/local/etc/rc.d/vboxheadless
/etc/rc.d/syscons
/etc/rc.d/sysctl_lastload
/usr/local/etc/rc.d/xdm
/usr/local/etc/rc.d/vboxwatchdog
/etc/rc.d/inetd
/usr/local/etc/rc.d/dnetc
/usr/local/etc/rc.d/munged
/etc/rc.d/sendmail
/etc/rc.d/ftpd
/usr/local/etc/rc.d/rsyncd
/usr/local/etc/rc.d/saned
/etc/rc.d/cron
/etc/rc.d/msgs
/etc/rc.d/othermta
/etc/rc.d/jail
/etc/rc.d/bgfsck
/usr/local/etc/rc.d/smartd
/etc/rc.d/securelevel

The vboxnetflt.ko module is loaded by /usr/local/etc/rc.d/vboxnet.
Comment 48 George Mitchell 2023-01-28 00:29:36 UTC
And the list of kernel modules loaded by a non-crashing boot is:

kernel
sem.ko
zfs.ko
if_re.ko
vboxdrv.ko
amdgpu.ko
drm.ko
linuxkpi_gplv2.ko
dmabuf.ko
ttm.ko
amdgpu_raven_sdma_bin.ko
amdgpu_raven_asd_bin.ko
amdgpu_raven_ta_bin.ko
amdgpu_raven_pfp_bin.ko
amdgpu_raven_me_bin.ko
amdgpu_raven_ce_bin.ko
amdgpu_raven_rlc_bin.ko
amdgpu_raven_mec_bin.ko
amdgpu_raven_mec2_bin.ko
amdgpu_raven_vcn_bin.ko
vboxnetflt.ko
(and a whole bunch more)

In other words, when the crash happens, it always involves a call to modlist_lookup2 from whatever kernel module gets loaded following amdgpu.
Comment 49 George Mitchell 2023-02-07 15:05:42 UTC
*** Bug 268416 has been marked as a duplicate of this bug. ***
Comment 50 George Mitchell 2023-02-07 15:10:25 UTC
Created attachment 239967 [details]
Crash after loading vboxnetflt early by hand

Since the previous crash included a reference to vboxnetflt.ko, I experimented a few times with amdgpu.ko added to my kld_lst in /etc/rc.conf, and loading vboxnetflt by hand after booting to single user mode.

I think it's pretty clear at this point that there is no problem in ZFS code.  It's a lock mismanagement problem of some sort in amggpu.ko (from graphics/drm-510-kmod).  If I have permission to change the assignee of this bug, I will.
Comment 51 George Mitchell 2023-02-07 15:12:55 UTC
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
Comment 52 George Mitchell 2023-02-07 15:15:48 UTC
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
Comment 53 George Mitchell 2023-02-07 15:23:23 UTC
(In reply to Graham Perrin from comment #10)

It does appear that amdgpu.ko always loads successfully.  But then the loading of some other module subsequently (which might be zfs.ko or vboxnetflt.ko or maybe something else) somehow causes an unexpected call back into the amdgpu code.  I have no idea how.

The current situation:
1. zfs.ko is loaded from /boot/loader.conf.
2. I always boot into single user mode.
3. The last few times, I had kld_list="amdgpu.ko" in my /etc/rc.conf, but for now I'm taking it back out.
4. So I'm loading amdgpu.ko manually in single user mode and then waiting ten seconds or so before going multiuser.  It's voodoo but it usually avoids the crash.
Comment 54 Emmanuel Vadot freebsd_committer freebsd_triage 2023-02-07 15:23:42 UTC
(In reply to George Mitchell from comment #52)

No it's not, I've told you already that what's printed by drm is not the panic it's noise when we switch ttys during a panic.
All you crash logs talk about zfs dbufs, this isn't amdgpu.
Comment 55 George Mitchell 2023-02-07 15:27:40 UTC
(In reply to Emmanuel Vadot from comment #54)

If I boot up without loading amdgpu.ko at all, then I NEVER get the crash.  Confirmed many many times.
Comment 56 Andriy Gapon freebsd_committer freebsd_triage 2023-02-07 15:31:15 UTC
(In reply to Emmanuel Vadot from comment #54)
I think that George's point was not about anything that gets printed, but what happens depending on whether amdgpu gets loaded (and when) or not.

It's not unimaginable that an exotic bug in one module (or in the module loading code or the code for resolving symbols) results in a memory corruption and a crash elsewhere.

A very wild guess, but I'd check if there are any duplicate symbols between amdgpu and zfs.ko... and even kernel itself.
Comment 57 Emmanuel Vadot freebsd_committer freebsd_triage 2023-02-07 15:33:20 UTC
(In reply to Andriy Gapon from comment #56)

But then anyone else using zfs+amdgpu will have the same problem and that's not the case (I use both on multiple machine running either 13.1, stable/13 or CURRENT).
Comment 58 George Mitchell 2023-02-07 17:41:17 UTC
If it is ZFS, then the only exotic factor on my system is an external USB one-terabyte drive (WDC WD10EZEX-08WN4A0), formatted with GPT and one ZFS partition, that seems to take a variable amount of time to come on line at power up.  I theorized at one point that tasting that drive at an unpredictable time was a factor in the crash.  Your mileage may vary.
Comment 59 Mark Millard 2023-02-07 17:46:27 UTC
(In reply to Emmanuel Vadot from comment #54)

QUOTE
All you crash logs talk about zfs dbufs
END QUOTE

Not true: "Crash dump" and "Latest crash dump" have no
examples of "dbuf" in the submitted text.

Also: The backtrace in "Latest crash dump" makes no mention
of "zfs" at all. (It does occur in other text.)
Comment 60 Mark Millard 2023-02-07 17:55:01 UTC
(In reply to George Mitchell from comment #58)

Could a test be formed on your hardware, loading ZFS
but having no actual import of any pool, possibly
not even a pool to find (empty "zpool import")?

As stands your context is hard for anyone else to
make an analogous context for testing. Finding a
failure in a simpler to replicate context could
help with avoiding your having the only known
failure context.

So any other variations that are simpler contexts for
others to replicate and test would be a good thing.

But, also, if such effort ends up unable to replicate
the problem in your environment, that might be usefull
information as well.
Comment 61 George Mitchell 2023-02-07 19:06:42 UTC
(In reply to Mark Millard from comment #60)
In addition to my external USB ZFS drive, also my /usr file system is a ZFS slice.  My main hard drive has a very small UFS root (and /var and /tmp) slice, because I have a superstitious fear of ZFS on root.  The /usr slice (the rest of the drive) is big enough to take an annoying amount of time to fsck, so when I first added this drive to my system (which was also when I updated from 12 to 13), I chose ZFS for /usr to minimize that time.  For a while, I suppose I could copy my /usr slice onto the /usr slice from my old internal drive and mount that in place of the current /usr slice for some tests, and I could do without the external drive.  I'll have to think about this.
Comment 62 Mark Millard 2023-02-07 19:25:38 UTC
(In reply to George Mitchell from comment #61)

If you can boot an external USB3 drive or some such,
may be a minimal separate drive: UFS 13.1-RELEASE with
enough added to also have amdgpu.ko . With such a
context, do you still manage to see boot failures?

Progressing from the simplest independent context
towards an independent one more like your normal 
context might be easier --and might avoid needing to
change your normal context as much.

Just a test context, not a normal use one. Fewer
constraints on the configuration that way.

Food for thought.
Comment 63 Tomasz "CeDeROM" CEDRO 2023-02-07 21:27:38 UTC
I had the same problem on 13.1-STABLE, vbox module load caused immediate kernel panic, I had rolled back to 13.1-STABLE because of this.

In the bare loader when kernel was loaded the vbox drivers load was okay. When vbox drivers was part of /boot/loader.conf or /etc/rc.conf is caused immediate kernel panic (no dump). virtualbox-ose-kmod was recompiled from ports on a newly installer kernel and system. Not sure if this is amdgpu nor zfs related though..?
Comment 64 Tomasz "CeDeROM" CEDRO 2023-02-07 21:28:49 UTC
*rolled back to 13.1-RELEASE sorry :-) All works fine here. Might be vbox + amdgpu api desync?
Comment 65 Mark Millard 2023-02-08 03:53:41 UTC
(In reply to George Mitchell from comment #36)

Have you ever gotten a crash with kern.smp.disabled=1 ?
If not, how many tests did you try?
Comment 66 Mark Millard 2023-02-08 04:01:22 UTC
(In reply to George Mitchell from comment #53)

A test might be to load something simple or unusual for
your context after amdgpu.ko and seeing if it still crashes.
I'm not sure it is a good example, but does, say, loading
amdgpu.ko and then filemon.ko also lead to a crash (not
loading more after that)?
Comment 67 Andriy Gapon freebsd_committer freebsd_triage 2023-02-08 05:52:00 UTC
(In reply to Emmanuel Vadot from comment #57)
Only "good", easy bugs are like that. That's why I said that this one must be exotic. But there must be something specific about George's environment too. Maybe configuration, maybe build, maybe specific hardware, maybe even a hardware glitch.
E.g., maybe if the graphics is active the RAM is more likely to randomly flip a bit.
Comment 68 George Mitchell 2023-02-08 14:58:54 UTC
(In reply to Mark Millard from comment #65)
Yes, I got the crash.  See comment #26.
Comment 69 George Mitchell 2023-02-08 15:00:46 UTC
I have a spare disk I can use for a test without ZFS.  It's currently at 12.0-RELEASE so it will take me a while to update it to 13.  Possibly I won't have a chance today, but I will try it.
Comment 70 Mark Millard 2023-02-08 17:30:16 UTC
(In reply to George Mitchell from comment #68)

#26 and #27 indicate that you would try the workaround kern.smp.disabled=1 ,
not the result of trying:

#26:
I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall)

#27:
I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).

That is part of why I asked.

Did the failure result with kern.smp.disabled=1 seem the same/similar to the
other failures --or was it distinct in some way?
Comment 71 Mark Millard 2023-02-08 17:49:28 UTC
(In reply to George Mitchell from comment #24)

It looks to me like the backtrace in "Latest crash dump":

KDB: stack backtrace:
#0 0xffffffff80c66ec5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbcf at vpanic+0x17f
#2 0xffffffff80c1ba43 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff81084fb8 at calltrap+0x8
#5 0xffffffff80be8c3d at linker_load_module+0x17d
#6 0xffffffff80beb17a at kern_kldload+0x16a
#7 0xffffffff80beb29b at sys_kldload+0x5b
#8 0xffffffff810ae6ec at amd64_syscall+0x10c
#9 0xffffffff810858cb at fast_syscall_common+0xf8

basically matches the 4 attachments that have been set to be Obsolete.

Should the Obsolete status be undone on the 4? Vs.: Should "Latest
crash dump" be made to also be Obsolete?

I'm guessing that none of the attachments should be obsolete at this
point.
Comment 72 Mark Millard 2023-02-08 17:55:11 UTC
(In reply to Mark Millard from comment #70)

There was also #36 with:

QUOTE
I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2.  But based on an earlier comment I updated to 13.1-RELEASE-p5.  Then after going back to kern.smp.disabled=0 I got another of the crash.
END QUOTE

It only reported not getting a crash for kern.smp.disabled=1 .
Comment 73 George Mitchell 2023-02-08 17:58:00 UTC
(In reply to Mark Millard from comment #70)
I should have referred you to comment #27, not #26.  But I definitely got the crash with smp.disabled=1.
(In reply to Mark Millard from comment #71)
I could make a case for obsoleting all but two of them, but possibly I would be throwing away useful information.  To my unpracticed eye, though, the ones I DID obsolete were pretty redundant with the ones I kept.  They all look pretty similar to me.
Comment 74 Mark Millard 2023-02-08 18:12:44 UTC
(In reply to George Mitchell from comment #73)

But they are all the examples were the backtraces having nothing
from zfs or dbuf. Having 5 of 11 reports that way looks rather
different from 1 out of 7.

I'd say that the frequency is notable.
Comment 75 Mark Millard 2023-02-08 18:18:45 UTC
(In reply to George Mitchell from comment #73)

#27:
QUOTE
I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
END QUOTE

It still says that you will try in the future, not explicitly that you
had a failure with kern.smp.disabled=1 .

#36 reports not having failures with kern.smp.disabled=1 .

I did not find any wording I could interpret as reporting a failure with
kern.smp.disabled=1 (prior to #73).

Do you remember noticing anything distinct? (Probably not, or you would
have commented in #73. But just to be sure . . .)
Comment 76 George Mitchell 2023-02-08 18:29:47 UTC
(In reply to Mark Millard from comment #75)
It's close to two months ago, so my memory may be misleading me, since my age is beginning to resemble the number of this comment.  But I'm pretty sure smp.disabled=1 did not prevent the bug.  I could be wrong.
Comment 77 George Mitchell 2023-02-14 14:55:36 UTC
I have been remiss in testing this without ZFS, because I will have to shuffle a couple of disks around.  I apologize for the delay.  I hope to be able to try this test later this week.
Comment 78 George Mitchell 2023-02-24 22:45:08 UTC
Although I have not yet managed to test this without ZFS, I have established that with zfs_load="YES" but without "vboxnet_enable="YES"" in /etc/rc.conf (zfs.ko and vboxnetflt.ko seeming to be the two modules with which amdgpu.ko has, um, personality conflicts), I can now boot up without crashing (so far).  Does anyone have any idea what zfs.ko and vboxnetflt.ko do that other modules don't do?
Comment 79 George Mitchell 2023-02-24 22:46:20 UTC
I omitted an important phrase.  It should have said, "with with zfs_load="YES" in /boot/loader.conf ..."
Comment 80 George Mitchell 2023-02-26 17:35:29 UTC
Created attachment 240427 [details]
New version of the crash, from acpi_wmi

Here's another module that doesn't get along well with amdgpu.ko on my system: acpi_wmi.ko.  Other than that this crash looks identical to all the earlier ones, as far as I can tell.

It's been about a dozen boot-up tries since I put zfs_load="YES" into /boot/loader.conf (so that ZFS gets loaded early to minimize its interaction with amdgpu.ko) and vboxnet_enable="NO" in /etc/rc.conf (so that vboxnetflt.ko doesn't get its chance to cause trouble either) until I got this new crash.

I'll mention again that this crash always happens within a minute of booting up, or else never.  Anyone have any ideas about what acpi_wmi.ko has in common with zfs.ko and vboxnetflt.ko?
Comment 81 Mark Millard 2023-02-26 20:15:09 UTC
(In reply to George Mitchell from comment #80)

There are multiple, distinct backtraces in your various examples.
This one matches the 4 still-listed-as Obsolete ones and the
"Latest crash dump" one, but not the others (if I remember right).

So it is another example where there is no mention of
dbuf or of zfs in the backtrace's text, unlike some other
backtraces.

So far as I can tell, there still has been no evidence gathering
seeing if the problem can happen absent zfs being loaded or zfs
loaded but no pools ever imported.

If I gather correctly, we now do have evidence that the specific
type of backtrace can happen without vboxnetflt.ko ever having
been loaded, proving it is not necessary for that kind of failure.
That is a form of progress as far as evidence goes. It also
suggests that merely being listed in a backtrace does not mean
that fact necessarily tells one much about the basic problem.

There is some possibility here that there is more than one basic
problem and some of the backtrace variability is associated with
that.
Comment 82 Mark Millard 2023-02-26 20:56:08 UTC
Using the gdb-based backtrace information:

#8  0xffffffff80be8c5d in modlist_lookup (name=0xfffff80006217400 "acpi_wmi", 
    ver=0) at /usr/src/sys/kern/kern_linker.c:1487

is for the strcmp code line in:

static modlist_t
modlist_lookup(const char *name, int ver)
{
        modlist_t mod;

        TAILQ_FOREACH(mod, &found_modules, link) {
                if (strcmp(mod->name, name) == 0 &&
                    (ver == 0 || mod->version == ver))
                        return (mod);
        }
        return (NULL);
}

We also see that strcmp was called via:

#6  <signal handler called>
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:46

We also see name was accessible, as shown in the "#8" line above.
We see from #7 that strcmp was considered called, suggesting that
the mod->name of itself did not fail. The implication would be that
that value of name in mod->name was a bad pointer when strcmp tried
to use the value.

Nothing says that mod->name was or should have been for acpi_wmi
at all. The "acpi_wmi" side of the comparison need not be
relevant information. Other backtraces that look similar may
well have a similar status for the name in the right had argument
to the strcmp.

This might be a useful hint to someone with appropriate background
or suggest some way of detecting the bad value in mod->name earlier
when that earlier context might be of more use for investigations.
Comment 83 George Mitchell 2023-02-26 23:21:09 UTC
I have set up a disk with FREEBSD 13.1-RELEASE-p7 and drm-510-kmod 5.10.113_8 WITHOUT ZFS and vbox-anything.  I don't know how to avoid loading acpi-wmi.ko.  So far it hasn't crashed, but I will try a whole bunch of reboots tomorrow with that disk.
Comment 84 Mark Millard 2023-02-26 23:59:06 UTC
(In reply to George Mitchell from comment #83)

I found the following text on https://cateee.net/lkddb/web-lkddb/ACPI_WMI.html :

QUOTE
ACPI-WMI is a proprietary extension to ACPI to expose parts of the ACPI firmware to userspace - this is done through various vendor defined methods and data blocks in a PNP0C14 device, which are then made available for userspace to call.

The implementation of this in Linux currently only exposes this to other kernel space drivers.

This driver is a required dependency to build the firmware specific drivers needed on many machines, including Acer and HP laptops.
END QUOTE

So, I expect that if acpi_wmi.ko is being loaded by FreeBSD, it may
well be a requirement for that machine to boot and/or operate via
ACPI. But I'm not familiar with the details.
Comment 85 George Mitchell 2023-02-27 17:57:55 UTC
I have a new crash, but I did not get a dump because of an issue I will explain below.

For those who came in late, here's a summary of my system.  dmesg says I have:CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics     (3493.71-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x810f10  Family=0x17  Model=0x11  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
  Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

My motherboard is a Gigabyte B450M D53H.
BIOS is American Megatrends version F4, dated 1/25/2019.

pciconf -lv says:
vgapci0@pci0:6:0:0:     class=0x030000 rev=0xc8 hdr=0x00 vendor=0x1002 device=0x15dd subvendor=0x1458 subdevice=0xd000
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]'
    class      = display
    subclass   = VGA

Until recently, when I was running FBSD 12-RELEASE, my box had one hard drive.  I added a new drive when I upgraded to FBSD 13-RELEASE so I would still have FBSD 12 as an emergency backup.  Part of the upgrade is that on the new disk I created a small UFS slice for /, /var, and /tmp, and most of the rest of the disk is a ZFS slice for /usr (so I wouldn't have to wait for fsck on reboot after crashes).  That means that it isn't practical to do a test without ZFS on that new disk (I'll call it my regular disk now).  So I installed FBSD 13 (same version as my regular disk) on the old disk (I'll call it the test disk now), which had (and still has) a small UFS slice for /, /var, and /tmp and a big UFS slice for /usr.

To boot from the test disk, I use the BIOS boot menu, since (unsurprisingly) I have set the default boot disk to my regular disk.

I removed all mentions of ZFS and VBOX from /boot/loader.conf and /etc/rc.conf on the test disk.  Then I booted up a whole bunch of times.  On the thirteenth try, I got the crash.  Unfortunately, I don't have a crash summary from it because the system rebooted from my regular disk instead of the test disk while I was still staring at the crash message on the screen.  Subsequently, I booted 20 more times from the test disk without getting the crash again.

What I saw (for a few seconds) on the screen from the one crash sure looked like the same old backtrace, and I have to say, to an ignorant yokel like myself, it seemed to be saying that there's a locking problem in amdgpu.  There was absolutely no virtual terminal switching, because I had not started an X server and I did not type ALT+Fn.

I'll try getting a proper crash dump later (possibly tomorrow).  My thanks to all of you for your patience.
Comment 86 Mark Millard 2023-02-27 20:02:46 UTC
(In reply to George Mitchell from comment #85)

Where does dumpdev point for "test disk"? Someplace also on
the "test disk" that a "regular disk" boot would not change?

If yes, the first boot of the "test disk" after the crash
should have picked up the dump information, even if the
"regular disk" was booted between times. But if the dumpdev
place is common to both types of boot, then the regular disk
boot would have processed the dump. likely using a different
/var/crash/ place to store things.

Another question would be if there is sufficient room for
/var/crash/ to contain the saved vmcore.* and related files.

Yet another question is if the test disk has /usr/local/bin/gdb
installed vs. not. ( When present, /usr/local/bin/gdb is used
to provides one of the forms of backtrace, the one with source
file references and line numbers and such. Much nicer to deal
with.)

If a vmcore.* was saved but some related information was
not for some reason, it should be possible to have the
related information produced based on the vmcore.* file.


Side note:

In case it is relevant, I'll note that defining dumpdev
in /boot/loader.conf in a form the kernel chan handle, instead
of in /etc/rc.conf , can be used to allow the system to produce
dumps for earlier crashes. (But I'm guessing the crash was not
that earliy to need such.)
Comment 87 Mark Millard 2023-02-27 20:13:35 UTC
(In reply to George Mitchell from comment #85)

For booting the test disk, getting the kldstat output
from a successful boot might prove useful reference
material at some point: it should show what to expect
to be loaded by the kernel and in what order.

Since you got a crash before starting the X server and
had not used ALT+Fn, that would be appropriate context
for the kldstat relative to the known UFS-only crash.

Other time frames for kldstat may be relevant at some
point.
Comment 88 Mark Millard 2023-02-27 20:50:05 UTC
I booted a ThreadRipper 1950X system via its UFS-only boot media
alternative. The system is not set up for X. For example, no
use/installation of amdgpu.ko for use with its video card. For
reference:

# kldstat
Id Refs Address                Size Name
 1   58 0xffffffff80200000  295a5a0 kernel
 2    1 0xffffffff83210000     3370 acpi_wmi.ko
 3    1 0xffffffff83214000     3210 intpm.ko
 4    1 0xffffffff83218000     2178 smbus.ko
 5    1 0xffffffff8321b000     2220 cpuctl.ko
 6    1 0xffffffff8321e000     3360 uhid.ko
 7    1 0xffffffff83222000     4364 ums.ko
 8    1 0xffffffff83227000     33a0 usbhid.ko
 9    1 0xffffffff8322b000     32a8 hidbus.ko
10    1 0xffffffff8322f000     4d00 ng_ubt.ko
11    6 0xffffffff83234000     ab28 netgraph.ko
12    2 0xffffffff8323f000     a238 ng_hci.ko
13    4 0xffffffff8324a000     2668 ng_bluetooth.ko
14    1 0xffffffff8324d000     8380 uftdi.ko
15    1 0xffffffff83256000     4e48 ucom.ko
16    1 0xffffffff8325b000     3340 wmt.ko
17    1 0xffffffff8325f000     e250 ng_l2cap.ko
18    1 0xffffffff8326e000    1bf08 ng_btsocket.ko
19    1 0xffffffff8328a000     38b8 ng_socket.ko
20    1 0xffffffff8328e000     2a50 mac_ntpd.ko

# uname -apKU
FreeBSD amd64_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #61 main-n261026-d04c86717c8c-dirty: Sun Feb 19 15:03:52 PST 2023     root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400081 1400081
Comment 89 George Mitchell 2023-03-03 23:20:14 UTC
After getting another instance of my crash on my test disk and then booting from the correct disk, I got a crash summary that said:

Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/lib/debug/boot/kernel/kernel.debug]

It occurred to me that when I updated my test disk from FBSD 12 to 13 I had forgotten to run mergemaster.  So I did so today.  But I haven't been able to reproduce the crash in 25 tries since then.  I'm convinced that running mergemaster did not fix the crash, which is after all highly random.  So I will try some more tomorrow.

I appreciate everybody's patience.
Comment 90 Mark Millard 2023-03-04 02:22:23 UTC
(In reply to George Mitchell from comment #89)

What vintage/version of *gdb was in use? (If it was
gdb that complained.) Was it /usr/local/bin/*gdb ?
/usr/libexec/*gdb ? Actually, for the backtrace
activity, it is kgdb that is used, not gdb. Thus my
use of "*gdb" notation.

But a core.txt.* file in my context shows:

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]

which would be for /usr/local/bin/*gdb ( not
/usr/libexec/*gdb ). This is because I have:

# pkg info gdb
gdb-12.1_3
Name           : gdb
Version        : 12.1_3
. . .

installed. (I had to a livecore.* to have something
to reference/illustrate with, having no example
vmcore.* files around for a long time.)

A significantly older gdb might indicate use of
an old /usr/libexec/*gdb that had not been cleaned
out.

I'll note that I got no DWARF complaints from
kgdb and:

# llvm-dwarfdump -r 1 /usr/lib/debug/boot/kernel/kernel.debug | grep DWARF | head -1
0x00000000: Compile Unit: length = 0x000001d3, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x000001d7)

indicates  version = 0x0004 .

This leads me to expect that you have an old
gdb (kgdb) around that is in use.


It sounds like you got a savecore into /var/crash/ .
It should be possible to try investigating that without
having to cause another crash, presuming the system
is not updated (so that it matches the crash contents).
For example, the same sort of command that crashinfo
uses on the saved system-core file could be manually
tried, possibly with a more modern kgdb vintage being
used that would handle the more recent dwarf version.

Attaching your core.txt.* file content might prove
useful.
Comment 91 George Mitchell 2023-03-05 03:21:43 UTC
Created attachment 240591 [details]
A new but related crash (I think)

This one was at shutdown time rather than boot-up time, so potentially virtual terminal switching was involved.  But once again there are references to "WARNING !drm_modeset_is_locked(&plane->mutex) failed" along with a mention of ZFS.  I don't know what it means.
Comment 92 Mark Millard 2023-03-05 06:57:25 UTC
(In reply to George Mitchell from comment #91)

So, apparently, this was not one of the UFS-only experiments.


The gdb backtrace is messy:

. . .
#7  <signal handler called>
. . .
#27 <signal handler called>
#28 0x00000000002881da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd688

This indicates  that we are not seeing evidence from the
earlier problem that got #27. That, in turn, may or may not have
been the original problem.

The context looks to be a very different context than prior
reports. But not seeing what lead to #27 makes forming solid
judgments problematical.


I see from this that a modern gdb (kgdb) was in use for this
failure for the crashinfo generation after the savecore operation,
having no problems with DWARF 4  vs. 2. But it would seem to be
the boot media normally used with ZFS instead of the boot media
intended for UFS-only testing. The two might be different for
what is around for gdb (kgdb) for crashinfo to use.
Comment 93 Mark Millard 2023-03-05 07:06:44 UTC
(In reply to Mark Millard from comment #92)

Looking at it some more and comparing to

#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff8214d251 at spl_nvlist_free+0x61
#7 0xffffffff8220d740 at fm_nvlist_destroy+0x20
#8 0xffffffff822e6e95 at zfs_zevent_post_cb+0x15
#9 0xffffffff8220cd02 at zfs_zevent_drain+0x62
#10 0xffffffff8220cbf8 at zfs_zevent_drain_all+0x58
#11 0xffffffff8220ede9 at fm_fini+0x19
#12 0xffffffff82243b94 at spa_fini+0x54
#13 0xffffffff822ee303 at zfs_kmod_fini+0x33
#14 0xffffffff8215fb3b at zfs_shutdown+0x2b
#15 0xffffffff80c1b76c at kern_reboot+0x3dc
#16 0xffffffff80c1b381 at sys_reboot+0x411
#17 0xffffffff810ae6ec at amd64_syscall+0x10c

both #27 and #28 in:

#26 amd64_syscall (td=0xfffffe000f43ca00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#27 <signal handler called>
#28 0x00000000002881da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd688

are possibly just the normal difficulty with finding where
to stop listing.
Comment 94 Mark Millard 2023-03-05 07:19:56 UTC
(In reply to Mark Millard from comment #93)

#7  <signal handler called>
#8  vtozoneslab (va=18446735277616529408, zone=<optimized out>, 
    slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635

looks to be the "*slab" line in:

static __inline void
vtozoneslab(vm_offset_t va, uma_zone_t *zone, uma_slab_t *slab)
{
        vm_page_t p;
  
        p = PHYS_TO_VM_PAGE(pmap_kextract(va));
        *slab = p->plinks.uma.slab;
        *zone = p->plinks.uma.zone;
}

For reference: 18446735277616529408 == 0xFFFFF80000000000
Comment 95 George Mitchell 2023-03-06 18:15:30 UTC
Created attachment 240622 [details]
Another crash summary; looks like all the earlier ones

Quick summary: I can't cause this crash on my test setup (amdgpu but no ZFS) over close to 50 tries.

In more detail: I deleted all ports from my test setup and then added drm-510-kmod and gpu-firmware-amd-kmod, and (most importantly) gdb.  I then made many fruitless attempts to reproduce the crash.  Experimentally, I added "zfs" to my mod_list in /etc/rc.conf and got another instance of the crash after 11 attempts (see attachment).  This crash looks like all the ones from my regular setup, but at least it appears to be in the right format to get a backtrace, etc.

I then took "zfs" out of my mod_list and tried another 20 times to get the crash to recur.  It did not recur.
Comment 96 John F. Carr 2023-03-06 20:09:29 UTC
(In reply to Mark Millard from comment #94)

The "signal handler called" line hides a function call.  I think the crash is due to a null pointer dereference ("fault virtual address = 0x0") in pmap_kextract called from the line above.  Tracking down the PC address 0xffffffff80bf3727 in the kernel image should clarify.
Comment 97 Mark Millard 2023-03-06 20:20:59 UTC
(In reply to George Mitchell from comment #95)

But, as I understand, comments #85 and #89 reported
crashes of the test setup (no ZFS), as I understand.
(I ignore #91 that was at shutdown and looks different.)

If true, we do have some existence-proof type evidence
for without ZFS involved. It just may be less common.
(Unfortunately some detail was not available for
validating a context match.)

You may not want to spend all your time with the
no-ZFS style tests, but spending some time on
occasion could eventually prove useful. Any big,
complicated thing (like ZFS) that can be eliminated
may help isolate the problem.
Comment 98 Mark Millard 2023-03-06 20:24:11 UTC
(In reply to John F. Carr from comment #96)

As I understand it, "fault virtual address = 0x0" is
for #7 and not for number #27. As far as I can tell
what lead  to #27 and its specific type is not
available to us.
Comment 99 Mark Millard 2023-03-06 20:43:24 UTC
(In reply to George Mitchell from comment #95)

FYI: "Another crash summary; looks like all the earlier ones"
is a crash when it is getting ready to load ZFS, not after
ZFS has been loaded. So ZFS had not been started yet.

So it is evidence for a problem without having ZFS in
operation at all.
Comment 100 John F. Carr 2023-03-06 20:50:18 UTC
(In reply to Mark Millard from comment #98)

Frame 27 is the entry into the kernel via the system call trap.  We know this because it calls amd64_syscall.  Frame 28 is a user program.  We know this because the addresses are at the user address space and not the kernel address space (program counter at 0x2881da, stack frame at 0x7fffffffd688).
Comment 101 Mark Millard 2023-03-06 20:51:49 UTC
(In reply to George Mitchell from comment #95)

FYI: "Another crash summary; looks like all the earlier ones"
is a crash when it is getting ready to load ZFS, not after
ZFS has been loaded. So ZFS had not been started yet.

So it is evidence for a problem without having had ZFS in
operation at all.
Comment 102 George Mitchell 2023-03-06 20:57:27 UTC
(In reply to Mark Millard from comment #97)
You are correct that I did get two dumps without ZFS, but they did not appear to have decipherable dumps.  I'll keep trying for another dump without ZFS now that I know we will obtain a usable dump on the test setup.
(In reply to Mark Millard from comment #101)
That's why we stopped seeing the reference to ZFS when I took "zfs" out of mod_list and put "zfs_load="YES"" in /boot/loader.conf in response to comment #41.
Comment 103 Mark Millard 2023-03-06 21:00:43 UTC
(In reply to John F. Carr from comment #100)

Ahh, so kgdb ends up with fast_syscall_common+0xf8
or the like translated to a <signal handler called> .
For this part, beleive and look at the kernel's
backtrace for the area that says
fast_syscall_common+0xf8 (or whatever).

Good to know. Thanks.
Comment 104 John F. Carr 2023-03-06 21:13:38 UTC
If the problem is memory corruption running a debug kernel might find the corruption closer to when it happens.  Are you able to build and run your own kernel with a configuration file like

include GENERIC
ident   DEBUG
options       INVARIANTS
options       INVARIANT_SUPPORT

?
Comment 105 Mark Millard 2023-03-06 21:26:35 UTC
(In reply to George Mitchell from comment #102)

So are all the load-time crashes with things
loaded via use of:

     kld_list	 (str) A whitespace-separated list of kernel modules to	load
		 right after the local disks are mounted, without any .ko ex-
		 tension or path.  Loading modules at this point in the	boot
		 process is much faster	than doing it via /boot/loader.conf
		 for those modules not necessary for mounting local disks.

and never with things that are loaded via
/boot/loader.conf activity?

It is a possible distinction in the test
results that I'd managed to miss.


(I'll note that the "for those modules not
necessary for mounting local disks" may make
zfs being listed kld_list unusual. That, in
turn, might help explain why, so far, you are
the only one known to be having the load-time
crash problem examples.)
Comment 106 George Mitchell 2023-03-07 13:39:24 UTC
(In reply to John F. Carr from comment #104)
I will try this today.  By the way, perhaps I should have mentioned already that I use SCHED_4BSD (I'm the guy who periodically rants that it should be the default, or at least that the scheduler should be a kernel loadable module), though it's hard to see how that could be a factor.
(In reply to Mark Millard from comment #105)
Yes, I had an occurrence of brain fade when I put zfs into mod_list.  I promise never to have brain fade ever again.
Comment 107 George Mitchell 2023-03-07 18:40:34 UTC
Created attachment 240642 [details]
Crash without any use of ZFS, with acpi_wmi

Here's a crash from my test setup with no use of ZFS at all.  It looks like the earlier crash with acpi_wmi, without which I suspect this hardware won't run.  Also, this kernel had INVARIANTS and INVARIANTS_SUPPORT compiled in (confirmed by the config shown in the summary), though I couldn't tell from anything I saw on the screen.  Next I'll attach the relevant part of /var/log/messages, though I didn't see anything there either.
Comment 108 George Mitchell 2023-03-07 18:43:11 UTC
Created attachment 240643 [details]
Relevant part of /var/log/messages

Here's the log from the time of the crash, up to now.
Comment 109 Mark Millard 2023-03-07 18:52:05 UTC
(In reply to George Mitchell from comment #107)

I'll note that in the example kldstat that I reported
earlier the order started with:

# kldstat
Id Refs Address                Size Name
 1   58 0xffffffff80200000  295a5a0 kernel
 2    1 0xffffffff83210000     3370 acpi_wmi.ko
. . .

So acpi_wm.ko appears to be the first module loaded
in my context. I'd guess that is true for your context
as well.

This would mean that prior module loads are not
required for the problem to happen (loading the
first of the modules). That shold narrow the
range of possibilities (for someone sufficiently
knowledgeable in the subject area).
Comment 110 George Mitchell 2023-03-08 22:35:17 UTC
Created attachment 240683 [details]
New instance

This is from running my regular setup, not the debug setup.  Almost immediately after I got this dump, my system crashed two more times in a row; see next attachment, which appears to contain a summary of both crashes (the 2nd and the 3rd).  None of the stack dumps seem to have a call to modlist_lookup2, so possibly all three of these are some new amdgpu crash.
Comment 111 George Mitchell 2023-03-08 22:37:30 UTC
Created attachment 240684 [details]
Crashes 2 and 3

The second crash was very late in the boot process, unlike most of the others.  Running meld on these files might prove enlightening.
Comment 112 Mark Millard 2023-03-08 23:41:07 UTC
(In reply to George Mitchell from comment #110)

The backtraces mentioning "zap_evict_sync" are not new.
You submitted prior examples as attachments, such as
"New core.txt".

The backtrace(s) with "spa_all_configs" may well be new.
I do not remember such.
Comment 113 George Mitchell 2023-03-09 19:29:58 UTC
Would it help if I attached my system log from the period of time yesterday when I got three crashes in a row?
Comment 114 George Mitchell 2023-03-10 18:28:16 UTC
Created attachment 240729 [details]
Another instance of attachment #240591 [details] crash at shutdown time

For the sake of completeness I'm attaching one more instance of the crash I see every few days at shutdown time instead of boot-up time.

My plan for now is to restore my configuration to the one that most frequently provokes the crash: namely, I load ZFS with zfs_enable in /etc/rc.conf instead of zfs_load in /boot/loader.conf, and I'm adding vbox_enable="YES' back into /etc/rc.conf.  Also, I'm updating from drm-510-kmod-5.10.113_8 to drm-510-kmod-5.10.163_2 since it's available, and I'll see if that crashes still.  If so, then I will stop using amdgpu for a week and verify, for the purpose of maintaining my own sanity, that the crashes stop.  And I'll report back here.
Comment 115 Mark Millard 2023-03-10 18:35:47 UTC
(In reply to George Mitchell from comment #114)

All of the crashes that listed "acpi_wmi" were before
amdgpu could have been involved: acpi_wmi loads first
amdgpu would be later.
Comment 116 George Mitchell 2023-03-10 18:45:49 UTC
Created attachment 240731 [details]
After upgrading to v5.10.163_2

I re-enabled the crashes (i.e. stopped loading ZFS early and turned vbox_enable back on) and got a crash on my very first reboot.  Now I have disabled amdgpu and I'll be astonished if I get a crash before the twelfth of never.

This crash does look slightly different, though, and seems to have had a trap 22 in ZFS code.
Comment 117 George Mitchell 2023-03-10 18:50:20 UTC
(In reply to Mark Millard from comment #115)
Be that as it may, over the period of time from when I first upgraded to FBSD 13.1 until I started seriously trying to use drm-510-kmod, I never saw any occurrences at all of the ZFS crash, the vboxnetflt crash, or the acpi_wmi crash.  And I don't expect to see any of them as long as I don't load amdgpu.ko.
Comment 118 Mark Millard 2023-03-11 22:45:37 UTC
(In reply to George Mitchell from comment #117)

Yea, my expectation that acpi_wmi would always be loaded first
was just wrong. Sorry. With the ZFS boot media, I see:

Id Refs Address                Size Name
 1   94 0xffffffff80200000  295a9b0 kernel
 2    1 0xffffffff82b5b000   5b80d8 zfs.ko
 3    1 0xffffffff83115000     76f8 cryptodev.ko
 4    1 0xffffffff83a10000     3370 acpi_wmi.ko
. . .

I looked at all your attachments again. It appears amdgpu
was already present before the first crash point in all of
them.
Comment 119 Mark Millard 2023-03-11 23:45:30 UTC
For:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870

objdump -d --prefix-addresses /boot/kernel/kernel | less

shows:

ffffffff80d1786b <qsort+0x12ab> mov    %esi,0x4(%r11,%rdx,4)
ffffffff80d17870 <qsort+0x12b0> mov    0x8(%rcx,%rdx,4),%esi

As for other "instruction pointer" examples . . .

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer	= 0x20:0xffffffff80d17890

ffffffff80d1788f <qsort+0x12cf> mov    %esi,0xc(%r11,%rdx,4)
ffffffff80d17894 <qsort+0x12d4> add    $0x4,%rdx

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

The above is outside the kernel's code.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3701 <free+0x11> je     ffffffff80bf378d <free+0x9d>
ffffffff80bf3707 <free+0x17> mov    %rsi,%r14

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff82231ba6

The above is outside the kernel's code.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3701 <free+0x11> je     ffffffff80bf378d <free+0x9d>
ffffffff80bf3707 <free+0x17> mov    %rsi,%r14

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727

ffffffff80bf3722 <free+0x32> call   ffffffff80f66670 <PHYS_TO_VM_PAGE>
ffffffff80bf3727 <free+0x37> mov    (%rax),%r13

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80d0cea0

ffffffff80d0ce9c <vn_ioctl+0x1fc> jne    ffffffff80d0cff2 <vn_ioctl+0x352>
ffffffff80d0cea2 <vn_ioctl+0x202> movzwl 0x2(%r13),%ecx
Comment 120 Mark Millard 2023-03-12 20:14:25 UTC
(In reply to Mark Millard from comment #119)

[Sorry for the accidental duplication of the block that
had "instruction pointer = 0x20:0xffffffff80bf3707".]

The qsort, free, and vn_ioctl addresses do not look to
match up with any of the multi-level backtraces. So we
have very little evidence about what the context was.

I've no clue for the addresses that were outside the
kernel.
Comment 121 Mark Millard 2023-03-12 20:54:27 UTC
(In reply to Mark Millard from comment #120)

Ugg. I just realized that I'd not looked at an official
releng/13.1 build. So using a download of an official
kernel.txz this time . . . (the subroutines stay the same
but the detailed code is different).


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870

ffffffff80d1786d <qsort+0x130d> mov    -0x38(%rbp),%rdi
ffffffff80d17871 <qsort+0x1311> mov    %dl,(%rdi,%rsi,1)


As for other "instruction pointer" examples . . .

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer	= 0x20:0xffffffff80d17890

ffffffff80d1788f <qsort+0x132f> cmp    $0x3,%r8
ffffffff80d17893 <qsort+0x1333> jae    ffffffff80d17910 <qsort+0x13b0>


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

The above is outside the kernel's code.


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff82231ba6

The above is outside the kernel's code.


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3700 <free+0x70> mov    %gs:0xb0,%rax
ffffffff80bf3709 <free+0x79> add    %r15,0x8(%rcx,%rax,1)


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727

ffffffff80bf3724 <free+0x94> cmpb   $0x0,0x128(%rbx)
ffffffff80bf372b <free+0x9b> jne    ffffffff80bf3777 <free+0xe7>


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80d0cea0

ffffffff80d0ce9a <vn_ioctl+0x25a> mov    %r14,-0xc8(%rbp)
ffffffff80d0cea1 <vn_ioctl+0x261> cmpb   $0x0,0xaf417e(%rip)        # ffffffff81801026 <sdt_probes_enabled>
Comment 122 Mark Millard 2023-03-12 23:13:57 UTC
(In reply to George Mitchell from comment #117)

Would it be reasonable to have some testing with amdgpu.ko
loaded but never having a desktop environment active?

Or, may be I should form the idea as questions: What is
the minimal form of having amdgpu.ko loaded in the system?
Can that be tested (if it has not been already)? Does this
minimal form behave any differently than more involved
use of amdgpu.ko (and the associated card firmware)?

In a different direction . . .

In/for a separate context, I once built amdgpu and its
firmware and installed it. But I did not set up an
automatic load. For the rare test, I manually loaded
amdgpu and then started lumina. (It is an old memory. I
might not have the details correct.) This procedure might
have largely avoided later loads of kernel modules and,
so, avoided discovering a problem.
Comment 123 George Mitchell 2023-03-12 23:23:49 UTC
All my so-called test setup tests were run without starting a desktop environment (by which I assume you mean not starting X).  There were still crashes such as in comment #107, attachment #240642 [details].

With my normal setup, kldloading amdgpu manually instead of automatically noticeably reduced the incidence of crashes but did not eliminate them.
Comment 124 Mark Millard 2023-03-13 01:22:33 UTC
(In reply to George Mitchell from comment #123)

"kldloading amdgpu manually": there are two possibilities:

A) Using boot -s and doing kldload and then exiting to
   normal mode. There are examples in your attachments
   of doing this.

B) Getting to normal mode, logging in, and only after that
   doing the first kldload of amdgpu. I do not remember any
   of the attachments clearly indicating such a sequence.
   It puts the amdgpu load after other other normal loads.
Comment 125 Mark Millard 2023-03-13 03:51:59 UTC
Well, I was going to try testing in an environment were I've
got a serial console: an aarch64 main [so: 14] context. But
it turns out that there is at least one missing function
declaration foor the type of context at this point:

/wrkdirs/usr/ports/graphics/drm-515-kmod/work/drm-kmod-drm_v5.15.25/drivers/gpu/drm/drm_cache.c:362:10: error: call to undeclared function 'in_interrupt'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
        WARN_ON(in_interrupt());
                ^
1 error generated.
*** [drm_cache.o] Error code 1

as is visible in the official build log:

http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-515-kmod-5.15.25.log

Turns out the drm-510-kmod variant allowed for releng/13.1
and later is missing possible macro definitions for aarch64:

/wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:741:3: error: call to undeclared function 'DC_FP_START'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                DC_FP_START();
                ^
/wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:743:3: error: call to undeclared function 'DC_FP_END'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                DC_FP_END();
                ^
2 errors generated.
*** [dc.o] Error code 1

as is visible in:

http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-510-kmod-5.10.163_2.log

(It is not just my builds that have such issues:
official builds have the problems as well.)

I was hoping I'd be able to do some testing in the
alternative type of context (likely never starting
X11). That looks to not be in the cards at this
time.
Comment 126 Mark Millard 2023-03-14 03:22:27 UTC
(In reply to Mark Millard from comment #125)

Picking the drm-515-kmod one: it looks like the source 
file referenced needs to include the content of the
file providing the #define :

/usr/main-src/sys/compat/linuxkpi/common/include/linux/preempt.h:#define        in_interrupt() \


There are overall, some other uses:

drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:     if (r < 1 && in_interrupt())
drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:      if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt()))
drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:      if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt()))
drm-kmod//drivers/gpu/drm/drm_cache.c:  if (WARN_ON(in_interrupt())) {
drm-kmod//drivers/gpu/drm/drm_cache.c:  WARN_ON(in_interrupt());

I have not checked if any others of those do get preempt.h
already.

amd64 might be working via header pollution in some way
that aarch64 does not?
Comment 127 Mark Millard 2023-03-17 05:22:39 UTC
(In reply to Mark Millard from comment #126)

Further inspection of what comes next after making
drm_cache.c pick up the in_interrupt definition
suggests that trying builds of aarch64 is premature
at this point, making the type of test I was
intending also premature.
Comment 128 George Mitchell 2023-03-17 17:43:18 UTC
(In reply to George Mitchell from comment #114)
> [...] I will stop using amdgpu for a week and verify, for the purpose
> of maintaining my own sanity, that the crashes stop.   [...]

Back in amd64 land, since the time of that comment, I have rebooted my system 25 times and there have been no crashes at all.  I guess I'm sane.
Comment 129 Mark Millard 2023-03-17 20:01:45 UTC
(In reply to George Mitchell from comment #12)

Could you also share your "kldstat" output for when amdgpu
has been loaded?

More than just amdgpu might be added to what is loaded
before amdgpu compared to when amdgpu is not loaded at
all. For example some of:

# find /boot/ker*/ -name 'linux*' -print | more
/boot/kernel/linux64.ko
/boot/kernel/linux_common.ko
/boot/kernel/linuxkpi.ko
/boot/kernel/linuxkpi_wlan.ko

might be involved, not just amdgpu.

Loading only some prerequisites for amdgpu, but not
amdgpu itself, might prove a useful isolation test.
Comment 130 Mark Millard 2023-03-17 20:30:16 UTC
(In reply to Mark Millard from comment #129)

I wrote "what is loaded before" relative to amdgpu.
But what amdgpu in turn leads to loading that is
listed after amdgpu in "kld stat" output likely is
just as relevant. For all I know all of it may be
from after amdgpu's position in the "kld stat" list.
Comment 131 Mark Millard 2023-03-18 04:20:49 UTC
(In reply to Mark Millard from comment #130)

Based on drm-515-kmod related materials on/for amd64
runing main [so: 14] and the type of card that happened
to be present, I saw:

22    1 0xffffffff83c00000   4fd918 amdgpu.ko
23    2 0xffffffff83a8e000    79f50 drm.ko
24    1 0xffffffff83b08000     22a8 iic.ko
25    3 0xffffffff83b0b000     30d8 linuxkpi_gplv2.ko
26    4 0xffffffff83b0f000     6320 dmabuf.ko
27    3 0xffffffff83b16000     3360 lindebugfs.ko
28    1 0xffffffff83b1a000     b350 ttm.ko
29    1 0xffffffff83b26000     a118 amdgpu_polaris11_k_mc_bin.ko
30    1 0xffffffff83b31000     6370 amdgpu_polaris11_pfp_2_bin.ko
31    1 0xffffffff83b38000     6370 amdgpu_polaris11_me_2_bin.ko
32    1 0xffffffff83b3f000     4370 amdgpu_polaris11_ce_2_bin.ko
33    1 0xffffffff83b44000     7978 amdgpu_polaris11_rlc_bin.ko
34    1 0xffffffff83b4c000    42380 amdgpu_polaris11_mec_2_bin.ko
35    1 0xffffffff83b8f000    42380 amdgpu_polaris11_mec2_2_bin.ko
36    1 0xffffffff83bd2000     5270 amdgpu_polaris11_sdma_bin.ko
37    1 0xffffffff83bd8000     5270 amdgpu_polaris11_sdma1_bin.ko
38    1 0xffffffff840fe000    5db58 amdgpu_polaris11_uvd_bin.ko
39    1 0xffffffff8415c000    2ac78 amdgpu_polaris11_vce_bin.ko
40    1 0xffffffff83bde000    21d90 amdgpu_polaris11_k_smc_bin.ko

This was from deliberately using kldload amdgpu after all the
normal boot/login load activity. No kld_list= use involved at
all.

I wonder how much your environment would crash for amdgpu loaded
this late.

FYI: The prior load activity was:

Id Refs Address                Size Name
 1  132 0xffffffff80200000  295b050 kernel
 2    1 0xffffffff82b5d000     76f8 cryptodev.ko
 3    1 0xffffffff82b65000   5b80d8 zfs.ko
 4    1 0xffffffff83a10000     3370 acpi_wmi.ko
 5    1 0xffffffff83a14000     3210 intpm.ko
 6    1 0xffffffff83a18000     2178 smbus.ko
 7    1 0xffffffff83a1b000     2220 cpuctl.ko
 8    1 0xffffffff83a1e000     3360 uhid.ko
 9    1 0xffffffff83a22000     4364 ums.ko
10    1 0xffffffff83a27000     33a0 usbhid.ko
11    1 0xffffffff83a2b000     32a8 hidbus.ko
12    1 0xffffffff83a2f000     4d00 ng_ubt.ko
13    6 0xffffffff83a34000     ab28 netgraph.ko
14    2 0xffffffff83a3f000     a238 ng_hci.ko
15    4 0xffffffff83a4a000     2668 ng_bluetooth.ko
16    1 0xffffffff83a4d000     8380 uftdi.ko
17    1 0xffffffff83a56000     4e48 ucom.ko
18    1 0xffffffff83a5b000     3340 wmt.ko
19    1 0xffffffff83a5f000     e250 ng_l2cap.ko
20    1 0xffffffff83a6e000    1bf08 ng_btsocket.ko
21    1 0xffffffff83a8a000     38b8 ng_socket.ko
Comment 132 George Mitchell 2023-03-18 17:49:14 UTC
(In reply to Mark Millard from comment #129)
When I boot up to single-user mode, kldstat says:
Id Refs Address                Size Name
 1    7 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
After "kldload amdgpu," it says:
Id Refs Address                Size Name
 1   59 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
 5    1 0xffffffff82600000   417220 amdgpu.ko
 6    2 0xffffffff82518000    739e0 drm.ko
 7    3 0xffffffff8258c000     5220 linuxkpi_gplv2.ko
 8    4 0xffffffff82592000     62d8 dmabuf.ko
 9    1 0xffffffff82599000     c758 ttm.ko
10    1 0xffffffff825a6000     2218 amdgpu_raven_gpu_info_bin.ko
11    1 0xffffffff825a9000     64d8 amdgpu_raven_sdma_bin.ko
12    1 0xffffffff825b0000    2e2d8 amdgpu_raven_asd_bin.ko
13    1 0xffffffff825df000     93d8 amdgpu_raven_ta_bin.ko
14    1 0xffffffff825e9000     7558 amdgpu_raven_pfp_bin.ko
15    1 0xffffffff825f1000     6558 amdgpu_raven_me_bin.ko
16    1 0xffffffff825f8000     4558 amdgpu_raven_ce_bin.ko
17    1 0xffffffff82a18000     b9c0 amdgpu_raven_rlc_bin.ko
18    1 0xffffffff82a24000    437e8 amdgpu_raven_mec_bin.ko
19    1 0xffffffff82a68000    437e8 amdgpu_raven_mec2_bin.ko
20    1 0xffffffff82aac000    5a638 amdgpu_raven_vcn_bin.ko
But after a full boot without amdgpu, it says:
Id Refs Address                Size Name
 1   66 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    ff4b8 if_re.ko
 3    3 0xffffffff82230000    8cc90 vboxdrv.ko
 4    1 0xffffffff822bd000     77e0 sem.ko
 5    1 0xffffffff82600000   3df128 zfs.ko
 6    2 0xffffffff82518000     4240 vboxnetflt.ko
 7    2 0xffffffff8251d000     aac8 netgraph.ko
 8    1 0xffffffff82528000     31c8 ng_ether.ko
 9    1 0xffffffff8252c000     55e0 vboxnetadp.ko
10    1 0xffffffff82532000     3378 acpi_wmi.ko
11    1 0xffffffff82536000     3218 intpm.ko
12    1 0xffffffff8253a000     2180 smbus.ko
13    1 0xffffffff8253d000     33c0 uslcom.ko
14    1 0xffffffff82541000     4d90 ucom.ko
15    1 0xffffffff82546000     2340 uhid.ko
16    1 0xffffffff82549000     3380 usbhid.ko
17    1 0xffffffff8254d000     31f8 hidbus.ko
18    1 0xffffffff82551000     3320 wmt.ko
19    1 0xffffffff82555000     4350 ums.ko
20    1 0xffffffff8255a000     5af8 autofs.ko
21    1 0xffffffff82560000     2a08 mac_ntpd.ko
22    1 0xffffffff82563000     20f0 green_saver.ko
Comment 133 Mark Millard 2023-03-18 18:46:34 UTC
(In reply to George Mitchell from comment #132)

I wonder if, in your context, the following boot
sequencing might sidestep the boot-crash issue:

"A full boot without amdgpu"
then: "kldload amdgpu"
then: normal use.

Basically: doing the amdgpu load as late as
possible relative to everything else loaded,
limiting what all loads after amdgpu.
Comment 134 George Mitchell 2023-03-19 22:23:03 UTC
Okay, my machine is set up as you requested.  It boots to multiuser mode without starting an X session, at which point I load amdgpu and then start my normal XFCE session.  I'll run it this way for a week.

Undoubtedly, it won't exhibit the bootup crash in this mode of operation, but I won't be surprised if I still get a shutdown crash or two.  And in any case this isn't a fix for the underlying bug.

Not sure what new information this is likely to yield.
Comment 135 Mark Millard 2023-03-20 01:39:10 UTC
(In reply to George Mitchell from comment #134)

Having the kldstat output for this combination would
help identify what module is initially involved in any
crash.

Part of what may be of use is how often you see the
dbuf_evict_thread type of backtrace and what module
the first "instruction pointer	=" references in
such cases (if any). Another would be if new crash
contexts show up that have not been seen before.

So far there is no evidence for how many bugs there
are, given the varying failure-structures that show
up. There could even be the possibility of unreliable
memory or bugs specific to amdgpu_raven_*.ko files (such
as sometimes trashing some memory).

I've yet to induce any failure in the amdgpu_polaris11_*.ko
based amd64 context that I have access to (a ThreadRipper
1950X), although by no means is it a close match to your
context. To my knowledge, you still have the only known
examples of any of the failures.

To some extent, if trying new things leads to new forms
of failure for you, it potentially gives me new sequences
to try on the ThreadRipper 1950X. How (un)likely that is
to yield useful information I do not know. (My hope to
also try on aarch64, where I've access to a serial
console, did not pan out.)
Comment 136 George Mitchell 2023-03-20 15:18:16 UTC
Sorry, meant to put these in yesterday.  After booting to single-user mode, kldstat reports:

Id Refs Address                Size Name
 1    7 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko

If I boot to single-user mode and kldload amdgpu, kldstat reports:

Id Refs Address                Size Name
 1   59 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
 5    1 0xffffffff82600000   417220 amdgpu.ko
 6    2 0xffffffff82518000    739e0 drm.ko
 7    3 0xffffffff8258c000     5220 linuxkpi_gplv2.ko
 8    4 0xffffffff82592000     62d8 dmabuf.ko
 9    1 0xffffffff82599000     c758 ttm.ko
10    1 0xffffffff825a6000     2218 amdgpu_raven_gpu_info_bin.ko
11    1 0xffffffff825a9000     64d8 amdgpu_raven_sdma_bin.ko
12    1 0xffffffff825b0000    2e2d8 amdgpu_raven_asd_bin.ko
13    1 0xffffffff825df000     93d8 amdgpu_raven_ta_bin.ko
14    1 0xffffffff825e9000     7558 amdgpu_raven_pfp_bin.ko
15    1 0xffffffff825f1000     6558 amdgpu_raven_me_bin.ko
16    1 0xffffffff825f8000     4558 amdgpu_raven_ce_bin.ko
17    1 0xffffffff82a18000     b9c0 amdgpu_raven_rlc_bin.ko
18    1 0xffffffff82a24000    437e8 amdgpu_raven_mec_bin.ko
19    1 0xffffffff82a68000    437e8 amdgpu_raven_mec2_bin.ko
20    1 0xffffffff82aac000    5a638 amdgpu_raven_vcn_bin.ko

If I boot to multi-user mode without kldloading amdgpu, kldstat reports;

Id Refs Address                Size Name
 1   66 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    ff4b8 if_re.ko
 3    1 0xffffffff82231000     77e0 sem.ko
 4    3 0xffffffff82239000    8cc90 vboxdrv.ko
 5    1 0xffffffff82600000   3df128 zfs.ko
 6    2 0xffffffff82518000     4240 vboxnetflt.ko
 7    2 0xffffffff8251d000     aac8 netgraph.ko
 8    1 0xffffffff82528000     31c8 ng_ether.ko
 9    1 0xffffffff8252c000     55e0 vboxnetadp.ko
10    1 0xffffffff82532000     3378 acpi_wmi.ko
11    1 0xffffffff82536000     3218 intpm.ko
12    1 0xffffffff8253a000     2180 smbus.ko
13    1 0xffffffff8253d000     33c0 uslcom.ko
14    1 0xffffffff82541000     4d90 ucom.ko
15    1 0xffffffff82546000     2340 uhid.ko
16    1 0xffffffff82549000     3380 usbhid.ko
17    1 0xffffffff8254d000     31f8 hidbus.ko
18    1 0xffffffff82551000     3320 wmt.ko
19    1 0xffffffff82555000     4350 ums.ko
20    1 0xffffffff8255a000     5af8 autofs.ko
21    1 0xffffffff82560000     2a08 mac_ntpd.ko
22    1 0xffffffff82563000     20f0 green_saver.ko

If I then kldload amdgpu, it says the same as above, plus:

23    1 0xffffffff82a00000   417220 amdgpu.ko
24    2 0xffffffff82566000    739e0 drm.ko
25    3 0xffffffff825da000     5220 linuxkpi_gplv2.ko
26    4 0xffffffff825e0000     62d8 dmabuf.ko
27    1 0xffffffff825e7000     c758 ttm.ko
28    1 0xffffffff825f4000     2218 amdgpu_raven_gpu_info_bin.ko
29    1 0xffffffff825f7000     64d8 amdgpu_raven_sdma_bin.ko
30    1 0xffffffff82e18000    2e2d8 amdgpu_raven_asd_bin.ko
31    1 0xffffffff829e0000     93d8 amdgpu_raven_ta_bin.ko
32    1 0xffffffff829ea000     7558 amdgpu_raven_pfp_bin.ko
33    1 0xffffffff829f2000     6558 amdgpu_raven_me_bin.ko
34    1 0xffffffff829f9000     4558 amdgpu_raven_ce_bin.ko
35    1 0xffffffff82e47000     b9c0 amdgpu_raven_rlc_bin.ko
36    1 0xffffffff82e53000    437e8 amdgpu_raven_mec_bin.ko
37    1 0xffffffff82e97000    437e8 amdgpu_raven_mec2_bin.ko
38    1 0xffffffff82edb000    5a638 amdgpu_raven_vcn_bin.ko
Comment 137 George Mitchell 2023-03-20 22:17:43 UTC
Created attachment 241022 [details]
Four boot-time crashes in a row

For some reason, I just got four boot-up crashes immediately in a row.  After I cycled power, I was able to boot up without crashing.  I think I'm going to load zfs.ko from /boot/loader.conf to get it loaded earlier, which mitigates this pronlem.  (It's currently loaded with zfs_enable="YES" in /etc/rc.conf.)
Comment 138 Mark Millard 2023-03-20 22:22:34 UTC
(In reply to George Mitchell from comment #137)

Your upload ended up being: application/octet-stream this
time, instead of text/plain .
Comment 139 George Mitchell 2023-03-20 22:25:37 UTC
Yes.  It's a compressed tar file with four core.txt files for the price of one.  They are different enough that I thought I'd better attach them all, though mainly the later ones include increasing portions of the earlier ones because they were on immediately successive boots.
Comment 140 Mark Millard 2023-03-20 23:03:19 UTC
(In reply to George Mitchell from comment #137)

All 4 are examples related to dbuf_evict_thread (a.k.a.
zfs dbuf related crashes), as I feared. All 4 look like:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

Looks to be in:

 5    1 0xffffffff82600000   3df128 zfs.ko


 panic: page fault
cpuid = 1
time = 1679349400
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff827ac768 at zap_evict_sync+0x68
#7 0xffffffff8267d74a at dbuf_destroy+0xba
#8 0xffffffff82683129 at dbuf_evict_one+0xf9
#9 0xffffffff8267b43d at dbuf_evict_thread+0x31d
#10 0xffffffff80bd8abe at fork_exit+0x7e
#11 0xffffffff8108604e at fork_trampoline+0xe

#6  0xffffffff810ade4f in trap_pfault (frame=0xfffffe00b3bb6d00, 
    usermode=false, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  avl_destroy_nodes (tree=tree@entry=0xfffff8001a80b5a0, 
    cookie=cookie@entry=0xfffffe00b3bb6dd0)
    at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023
#9  0xffffffff827ac768 in mze_destroy (zap=0xfffff8001a80b480)
    at /usr/src/sys/contrib/openzfs/module/zfs/zap_micro.c:402

A question would be if this repeats based on amdgpu having been
loaded (again last) but no X11 like activity having ever been
started: limiting amdgpu use to just the load activity or as
close to that limited of use as is possible. (This is separate
from your zfs load time adjustment test.)

My guess is that the content of some memory area(s) is being
trashed in your context. I'm not sure how to track down
what is doing the trashing or were all the trashed area(s)
are if that is what is going on.

At least we now have a clue how to get the specific type of
crash. Before I had no clue what an example initial-context
might be like.


Note: Changing the load order should get a matching kldstat
report to indicate the address ranges that end up involved.
Comment 141 Mark Millard 2023-03-20 23:12:37 UTC
(In reply to George Mitchell from comment #139)

The upload did not look compressed to me: I just
had to use tools that would tolerate the binary
content at the start and end. The rest looked like
normal text without me doing anything to decompress
the file.

But, looking, the prefix text does look like a
partially-binary header, likely added by a tool.
The tail end might just be binary padding.

At least I've a clue for next time.
Comment 142 George Mitchell 2023-03-20 23:17:15 UTC
So I should just boot up to multi-user mode and kldload amdgpu, but not start XFCE?  And repeat until it crashes again?
Comment 143 Mark Millard 2023-03-20 23:29:08 UTC
(In reply to George Mitchell from comment #142)

Seeing if that no-XFCE context crashes vs. not would
be a good idea. If it crashes similarly, then XFCE
activity is not likely to be involved. If it does
not crash, then XFCE activity is likely involved.


FYI: all 4 crashes had:

fault virtual address	= 0x7

(the same small offset from a NULL pointer in
C terms). This does not look like random trashing
of memory (for the few examples available).
Comment 144 George Mitchell 2023-03-21 00:07:59 UTC
Created attachment 241027 [details]
Another shutdown-time crash

I got another shutdown-time crash.  The part of this file that is relevant to this crash starts around line 1400; all the earlier stuff appears to be from the crashes earlier today.
Comment 145 Mark Millard 2023-03-21 01:07:55 UTC
(In reply to George Mitchell from comment #144)

Looking at your full list of attachments, it appears that . . .

All the shutdown time crashes have:

fault virtual address	= 0x0

(And we might now have a known type of context
for getting the type of failure: late amdgpu
but no XFCE.)

All the dbuf_evict_thread related crashes have:

fault virtual address	= 0x7

(Late admgpu but having used XFCE.)

All the kldload related crashes have:

Fatal trap 9: general protection fault while in kernel mode
(but no explicit fault address listed)

(Early amdgpu loading.)


My guess is something is trashing memory in a way
that involves writing zeros over some pointer values
that it should not be touching. Later code extracts
such zeros and applies any offset and then tries to
dereference the result, resulting in a crash.

That you got "fault virtual address = 0x0" for shutdown
without having involved XFCE, suggests that a problem is
already in place before XFCE is potentially involved:
XFCE is not required. (XFCE use might lead to more
trashed memory than otherwise, leading to the 0x7
fault address cases.)

But I do not see how to get solid evidence for or
against such the hypothesis (or related ones).

The only thing I can identify that is likely unique to
your context --but is involved with amdgpu-- is the
involvement of the amdgpu_raven_gpu_*.ko modules.

Unfortunately moving your context to a different system
that avoids such module use or finding someone with a
separate system that does have such (and is willing to
set up experiments), is non-trivial for both directions
of testing.

Beyond possibly some checking on the degree/ease of
repeatability, I do not see how to gather better
information, much less get anywhere near directly
actionable information for fixing the crashes.

The one thing we have not looked at is the crash
dumps themselves, examining what memory looks like
and such. But I do not know what to do for that
either, relative to known-useful information. Such a
direction would be very exploratory and likely very
time consuming.
Comment 146 Mark Millard 2023-03-21 19:32:10 UTC
(In reply to Mark Millard from comment #145)

For the:

fault virtual address	= 0x7

examples, it looks like the value stored in RAM has the 0x7
in it instead of being a later offset addition. The loop
in question in avl_destroy_nodes just uses "mov (%rdi),%rdi"
with no offset involved:

NOTE: Loop starts below
   0x0000000000000ba0 <+64>:	mov    %rdi,%rax
   0x0000000000000ba3 <+67>:	mov    %rdx,%rcx
   0x0000000000000ba6 <+70>:	mov    (%rdi),%rdi
   0x0000000000000ba9 <+73>:	mov    %rax,%rdx
   0x0000000000000bac <+76>:	test   %rdi,%rdi
   0x0000000000000baf <+79>:	jne    0xba0 <avl_destroy_nodes+64>
NOTE: The above is the loop end
Comment 147 George Mitchell 2023-03-22 00:23:01 UTC
Created attachment 241046 [details]
Crash at shutdown time

Another occurrence of the crash at shutdown time rather than boot time.  I'm reluctant to post a vmcore file here, but I can make it available to anyone who thinks it will be useful.
Comment 148 Mark Millard 2023-03-22 01:09:47 UTC
(In reply to George Mitchell from comment #147)

That crash is difference from all prior ones. It crashed
in nfsd via a:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80c895cb
stack pointer	        = 0x28:0xfffffe00b555dba0
frame pointer	        = 0x28:0xfffffe00b555dbb0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1109 (nfsd)

None of the prior kldstat outputs have shown nfsd as
loaded.

For reference:

panic: general protection fault
cpuid = 1
time = 1679441112
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff81084fd8 at calltrap+0x8
#5 0xffffffff80c8866b at seltdclear+0x2b
#6 0xffffffff80c88355 at kern_select+0xbd5
#7 0xffffffff80c88456 at sys_select+0x56
#8 0xffffffff810ae6ec at amd64_syscall+0x10c
#9 0xffffffff810858eb at fast_syscall_common+0xf8

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c1b7ec in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80c1bc5e in vpanic (fmt=0xffffffff811b2f41 "%s", 
    ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff80c1ba63 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:844
#5  0xffffffff810addf5 in trap_fatal (frame=0xfffffe00b555dae0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:944
#6  <signal handler called>
#7  0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, 
    dst=<optimized out>, expect=<optimized out>)
    at /usr/src/sys/amd64/include/atomic.h:225
#8  selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007)
    at /usr/src/sys/kern/sys_generic.c:1755
#9  0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00)
    at /usr/src/sys/kern/sys_generic.c:1967
#10 0xffffffff80c88355 in kern_select (td=<optimized out>, 
    td@entry=0xfffffe00b52e9a00, nd=7, fd_in=<optimized out>, 
    fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized out>, 
    tvp@entry=0x0, abi_nfdbits=64) at /usr/src/sys/kern/sys_generic.c:1210
#11 0xffffffff80c88456 in sys_select (td=0xfffffe00b52e9a00, 
    uap=0xfffffe00b52e9de8) at /usr/src/sys/kern/sys_generic.c:1014
#12 0xffffffff810ae6ec in syscallenter (td=0xfffffe00b52e9a00)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#13 amd64_syscall (td=0xfffffe00b52e9a00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#14 <signal handler called>
#15 0x00000008011a373a in ?? ()

Note: 18446741877726026240 == 0xfffffe00b52e9a00
Comment 149 George Mitchell 2023-03-22 01:15:31 UTC
(In reply to Mark Millard from comment #148)
> None of the prior kldstat outputs have shown nfsd as loaded.
That's because they weren't verbose kldstats.  nfsd is statically linked into the kernel.  kldstat -v definitely shows that nfsd is present.
Comment 150 George Mitchell 2023-03-23 00:30:00 UTC
In order to reconfirm my sincere belief that the key factor in these crashes is amdgpu (and also because I need a respite from the crashes), I'm running without amdgpu (and running X in VESA mode) for a while.  I fully expect that the crashes will stop as a result.
Comment 151 Mark Millard 2023-03-23 01:21:54 UTC
(In reply to George Mitchell from comment #150)

Sounds appropriate.

"amdgpu" is really the sort of bundle:

23    1 0xffffffff82a00000   417220 amdgpu.ko
24    2 0xffffffff82566000    739e0 drm.ko
25    3 0xffffffff825da000     5220 linuxkpi_gplv2.ko
26    4 0xffffffff825e0000     62d8 dmabuf.ko
27    1 0xffffffff825e7000     c758 ttm.ko
28    1 0xffffffff825f4000     2218 amdgpu_raven_gpu_info_bin.ko
29    1 0xffffffff825f7000     64d8 amdgpu_raven_sdma_bin.ko
30    1 0xffffffff82e18000    2e2d8 amdgpu_raven_asd_bin.ko
31    1 0xffffffff829e0000     93d8 amdgpu_raven_ta_bin.ko
32    1 0xffffffff829ea000     7558 amdgpu_raven_pfp_bin.ko
33    1 0xffffffff829f2000     6558 amdgpu_raven_me_bin.ko
34    1 0xffffffff829f9000     4558 amdgpu_raven_ce_bin.ko
35    1 0xffffffff82e47000     b9c0 amdgpu_raven_rlc_bin.ko
36    1 0xffffffff82e53000    437e8 amdgpu_raven_mec_bin.ko
37    1 0xffffffff82e97000    437e8 amdgpu_raven_mec2_bin.ko
38    1 0xffffffff82edb000    5a638 amdgpu_raven_vcn_bin.ko

I'm still at a loss for getting any improved type of
evidence. Spending time related to the dnetc related
scheduler benchmarking today has been a nice break
from pondering this.
Comment 152 George Mitchell 2023-03-28 14:43:26 UTC
As expected, I have had no crashes since avoiding drm-510-kmod and running in VESA mode.  Might it be worth updating 5.10.163_2 to 5.10.163_3?

Notes I haven't mentioned recently: Prior to FBSD 13, whenever I tried drm-510-kmod, my machine would lock up hard and not respond to anything other than cycling power.  I have a AMD Ryzen 3 2200G with Radeon Vega Graphics running on a Gigabyte B450M D53H motherboard.  Every time I boot up, I see the following ACPI warnings, which don't otherwise seem to affect operation:

Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)
ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)
ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)

Do any of you understand these?
Comment 153 Mark Millard 2023-03-28 16:50:42 UTC
(In reply to George Mitchell from comment #152)

I'm not sure what all is involved in setting up the VESA
usage test, but it sounds like it was a great test for
isolating the problem to the material associated with
amdgpu loading for your Radeon Vega Graphics context.

Are there any negative consequences to the use of VESA?

If the notes are simple/short could you supply instructions
so that I could try the analogous thing in the Polaris 11
context that I have access to?
Comment 154 Mark Millard 2023-03-28 17:02:36 UTC
(In reply to George Mitchell from comment #152)

Looked at my ACPI boot warning/error messages and I get just (with a little
context shown from the grep for ACPI lines):

acpi_wmi0: <ACPI-WMI mapping> on acpi0
ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361)
acpi_wmi1: <ACPI-WMI mapping> on acpi0
ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361)
acpi_wmi2: <ACPI-WMI mapping> on acpi0

But I do not get anything analogous to your reported:

Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)

So that last has some chance of being involved in your context
since I've been unable to reproduce your problems and the message
is unioque to your context. (Only suggestive.)

Any chance that there is an UEFI update available for your machine?
Comment 155 Mark Millard 2023-03-28 22:15:02 UTC
(In reply to Mark Millard from comment #153)

Hmm. I see that:

https://docs.freebsd.org/en/books/handbook/x11/#x-install

reports:

"VESA module must be used when booting in BIOS mode and SCFB module must
be used when booting in UEFI mode."

My context is UEFI so VESA looks to be inappropriate for my context.

Your using BIOS (non-UEFI) vs. my using UEFI (not-BIOS) is another
context difference relative to my not managing to reproduce the
problems.
Comment 156 George Mitchell 2023-03-29 18:34:07 UTC
Ironically, I am presently forced back into using amdgpu.ko because the xorg-server update from 21.1.6,1 to 21.1.7,1 broke the VESA driver (bug #270509).
Comment 157 George Mitchell 2023-03-29 20:34:43 UTC
I forgot to mention earlier: Whenever I start chrome from a terminal window, I see the message:

amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description.

Probably not related to this bug, but I thought I'd better mention it.
Comment 158 Graham Perrin freebsd_committer freebsd_triage 2023-04-01 19:42:15 UTC
(In reply to George Mitchell from comment #32)

> … I'm doing this testing on a desktop machine, …

(In reply to George Mitchell from comment #152)

> … not respond to anything other than cycling power. …

In that situation, does the system respond to a normal (not long) press on the power button? 

----

On my everyday notebook here, I have this in sysctl.conf(5): 

hw.acpi.power_button_state="S5"
Comment 159 George Mitchell 2023-04-01 19:58:58 UTC
(In reply to Graham Perrin from comment #158)
When I referred to cycling power, I meant by a long press of the power button, which worked just fine (except that I was going to have to run fsck on the next boot).  Also, that was when I was running FBSD 12 and I'm not in a position to repeat that test any more.  Thanks for the input.
Comment 160 Tomasz "CeDeROM" CEDRO 2023-04-13 12:49:11 UTC
I also use vbox + zfs + zmdgpu. On 13.2-STABLE I had kernel panic on vboxdrv / vboxnetadp load. So I switched to 13.1-RELEASE. Now after upgrading to 13.2 I have this problem again. Maybe related?

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270809
Comment 161 Mark Millard 2023-04-13 13:39:22 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #160)

The package builds via 13.2-RELEASE have not even started yet.

Systems using/needing kernel-specific ports should wait to upgrade
to 13.2-RELEASE until the packages are known to be available if
they are updating via binary packages.

This is normal when a new release happens. FreeBSD does not hold
the release until after the packages are available. 13.1-RELEASE
is still supported for some time but cannot use 13.2-RELEASE based
packages generally.
Comment 162 Tomasz "CeDeROM" CEDRO 2023-04-13 14:03:22 UTC
Thanks Mark :-) The problem is that build from ports kernel crashes on module load :-(
Comment 163 Mark Millard 2023-04-13 14:17:27 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #162)

Crashing from having a wrong module vintage for the
kernel is normal/historical as I understand. So,
unfortunately, not anything new.

The package build servers will not start building
based on 13.2-RELEASE until 13.1-RELEASE goes EOL
as I understand. Prior to that building from source
is what is supported when such kernel-dependent
ports are involved. FreeBSD still has some
build-from-source biases in its handling of things.
Resource limitations may well still be forcing such,
for all I know.

So, either wait to use 13.2-RELEASE or build and
install (some) ports via source based builds if
you require ports with kernel-dependent modules.
Comment 164 Mark Millard 2023-04-13 14:32:34 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #162)

Sorry that I misinterpreted some of the context/wording.

And nice to see that the 13.1-RELEASE build is rejected
with a message, now that I look again.
Comment 165 George Mitchell 2023-04-16 02:02:10 UTC
Created attachment 241523 [details]
Crash that happened neither at startup nor shutdown

Perhaps not related to my original crash, but undoubtedly a crash that happened in amdgpu code.  I was watching a movie using vlc.  I decided I was finished watching and I typed control-q.  The screen froze with a frame from the movie still showing, and after a few seconds the machine rebooted and saved a coredump, with the attached crash summary that really doesn't resemble any of the earlier ones saved here.  Does anyone have any words of wisdom?

To avoid the startup crash, I had booted to single user mode and had kldloaded vboxnetflt and amdgpu before continuing to multiuser mode.
Comment 166 Tomasz "CeDeROM" CEDRO 2023-04-16 02:26:38 UTC
I got tired of all those VirtualBox problems. I do not really care anymore about that program, if its problems are related to amdgpu or zfs. I have switched to bhyve that can be easily managed from a shell with vm utility [1]. I recommend doing the same.

[1] https://github.com/churchers/vm-bhyve
Comment 167 George Mitchell 2023-04-16 02:48:46 UTC
This is an amdgpu problem.  Although vboxnetflt is one of the kernel modules that can, in cooperation with amdgpu, exhibit the crash, zfs and acpi_vmi have also exhibited the same failure -- and the most recent crash summary contains no reference to vboxnetflt participating in the crash.  (It does show that I manually typed "kldload vboxnetflt" in single-user mode about an hour and a half before the crash occurred.)
Comment 168 George Mitchell 2023-04-17 22:21:52 UTC
After upgrading to 5.10.163_5 today, I haven't yet had this crash -- but I've booted only a couple of times so far and it's too soon to jump to any conclusions.
Comment 169 George Mitchell 2023-04-25 18:12:20 UTC
Created attachment 241741 [details]
Shutdown crash with version 5.10.163_5

5.10.163_5 still crashes.  This time it was a shutdown time.
Comment 170 George Mitchell 2023-04-25 22:46:16 UTC
Created attachment 241750 [details]
And another plain old boot time crash

I had thought I could artificially provoke the crash by booting to single user mode, loading the amdgpu, zfs, vboxnetflt, and acpi_wmi kernel modules in quick succession, and then continuing to multiuser mode.  But that didn't do it.  So yesterday I went back to the old way of loading zfs with "zfs_enable="YES"" in rc.conf instead of "zfs_load="YES"" in /boot/loader.conf, and loading amdgpu by setting kld_list="amdgpu" in rc.conf.  And now I get the crashes again.
Comment 171 Mark Millard 2023-04-25 23:02:18 UTC
(In reply to George Mitchell from comment #170)

I'm unclear on the contrasting case: when you use
/boot/loader.conf material instead of /etc/rc.conf
material what happens these days? No crashes?
Fairly rare crashes of the usual types? Fairly
rare crashes of other types? A mix of fairly rare
crashes of the 2 categories? (I may well not be
thinking of everything that would be of note.
So take the questions as just illustrative.)
Comment 172 Mark Millard 2023-04-26 23:24:29 UTC
One of the things that makes this hard to analyze is
that the first failure quickly leads to other failures
and most of the evidence for for the later failure.
For example, in the following note that original
trap number is 12 but the backtrace is for/after
a later trap, of type-number 22 instead. There
is very little information directly about the
original trap type-number 12:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727
stack pointer	        = 0x28:0xfffffe000e1a7ba0
frame pointer	        = 0x28:0xfffffe000e1a7bd0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1 (init)
trap number		= 12
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:619
. . .
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:894
kernel trap 22 with interrupts disabled
                            kernel trap 22 with interrupts disabled
 panic: page fault
cpuid = 0
time = 1682435560
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff8261d251 at spl_nvlist_free+0x61
#7 0xffffffff826dd740 at fm_nvlist_destroy+0x20
#8 0xffffffff827b6e95 at zfs_zevent_post_cb+0x15
#9 0xffffffff826dcd02 at zfs_zevent_drain+0x62
#10 0xffffffff826dcbf8 at zfs_zevent_drain_all+0x58
#11 0xffffffff826dede9 at fm_fini+0x19
#12 0xffffffff82713b94 at spa_fini+0x54
#13 0xffffffff827be303 at zfs_kmod_fini+0x33
#14 0xffffffff8262fb3b at zfs_shutdown+0x2b
#15 0xffffffff80c1b76c at kern_reboot+0x3dc
#16 0xffffffff80c1b381 at sys_reboot+0x411
#17 0xffffffff810ae6ec at amd64_syscall+0x10c
. . .

The primary hint about what code execution context lead
to the original instance of trap type 12 above is
basically:

instruction pointer	= 0x20:0xffffffff80bf3727

amdgpu does not leave in place a clean context for
debugging kernel crashes. Trying to keep the video
context operational for a kernel that has crashed,
while not messing up the analysis context for the
original problem is problematical.

My guess would be that normal analysis of such tries
to have the problem occur in a virtual machine sort
of context where another (outer) context is available
that is independent and can look at the details from
outside the failing context. But even that would
require the failing context in the VM to stop before
amdgpu or the like messed up the evidence in the VM.
(Not that I've ever done that type of evidence
gathering.)
Comment 173 George Mitchell 2023-04-27 00:17:34 UTC
Here are a collection of points in response to Mark Millard's request.

1. Regardless of the order in which I load kernel modules by hand in single-user mode, I can't ever duplicate the crash.

2. The crash never happens if amdgpu.ko is not loaded.

3. Emmanuel Vadot categorically states that the many, many references to drm_modeset_is_locked failures in the crash summaries are noise and don't indicate drm failures and are caused by virtual terminal switching.  But I still get crashes even when there are no virtual terminal switches (because I didn't start X windows and I didn't type ALT-Fn).

4. The crash always happens after amdgpu.ko is loaded, and (in terms of time of occurrence) at about the time vboxnetflt.ko or acpi_wmi.ko is loaded.  The seeming zfs crash can happen even when zfs.ko is loaded before amdgpu.ko, and I theorize that it happens when my large (1TB) USB ZFS-formatted drive comes on line and gets tasted (after amdgpu.ko is loaded).

5. But I can't come up with any theory in which I can blame the actual crash on vboxnetflt.ko, acpi_wmi.ko, or zfs.ko.  This bug should not be assigned to freebsd-fs. But I can't tell you to whom it should be assigned.
Comment 174 George Mitchell 2023-05-17 22:13:48 UTC
Since my last note on April 27, I have been booting up in this manner:

1. Boot to single user mode.
2. Run a script that loads amdgpu.ko, zfs.ko, vboxnetflt.ko, and acpi-wmi.ko in immediate succession.
3. Exit to multiuser mode.

In the course of roughly 50-60 bootups, there have been only two crashes during single user mode, but regrettably they leave no trace because the root partition is still mounted read-only.  At least I think that's why there's no dump.  So something about single-user mode makes the crash much less likely to occur.  Anyway, jumping through these hoops does enable me to run my graphics with the improved driver.
Comment 175 Graham Perrin freebsd_committer freebsd_triage 2023-05-17 23:26:52 UTC
(In reply to George Mitchell from comment #174)

> … crashes during single user mode, but regrettably they leave no trace 
> … the root partition is still mounted read-only. …

Hint (whilst in single-user mode): 

mount -uw / && zfs mount -a


sysrc dumpdev

– you'll probably find a different device, typically the swap partition. 


sysrc dumpdir

– you'll probably find /var/crash.


service dumpon describe

– if you boot in single user mode after a kernel panic, then /var/crash will not yet include information about the panic.


service savecore describe
Comment 176 Vladimir Druzenko freebsd_committer freebsd_triage 2024-07-14 21:57:54 UTC
Is this still relevant?
Comment 177 George Mitchell 2024-07-14 23:22:40 UTC
Yes, even after updating to 13.3-RELEASE-p4.  I'm not brave enough yet to upgrade to 14.   I work around the problem by booting in single-user mode, running this script:

#!/bin/sh
mount -u /
mount -r /usr
kldload amdgpu.ko
kldload zfs.ko
kldload vboxnetflt.ko
kldload acpi_wmi.ko
sleep 3
mount -u /usr

which 99% of the time doesn't crash, and then exiting to multiuser.  I haven't yet figured out how to get a crash dump with /, /tmp/, and /var/ mounted R/W (they're all on one physical partition) and /usr/ mounted RO.

Probably irrelevant fact: every time I start chrome from the command line, I get the message:

ammdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description.
If they do, bad things may happen!

But in fact there seem to be no ill effects.
Comment 178 Vladimir Druzenko freebsd_committer freebsd_triage 2024-07-15 10:13:46 UTC
(In reply to George Mitchell from comment #177)
I'm from vbox@ and after partially reading the comments I would say that it is unlikely that the root of the problem is VirtualBox.
It looks like the problem is with the amdgpu. In extreme cases, it could be a fundamental issue in the kernel in handling modules. Or a floating hardware issue…

There is graphics/drm-515-kmod for 14.0+ and graphics/drm-61-kmod for 14.1+.
Maybe you can check that without upgrade to 14.1.
Comment 179 George Mitchell 2024-07-15 14:26:23 UTC
Of course it's highly unlikely that the problem is in VirtualBox -- or zfs or acpi_wmi.  But that's a minority view here.  If I can get a proper core dump when in single-user mode (with RO /usr), it would surely clarify the issue.

It seems unlikely in the extreme to me that the 14 version would work with 13, given that a 13.2 compile of the port would not work with a 13.3 kernel.
Comment 180 Vladimir Druzenko freebsd_committer freebsd_triage 2024-07-15 15:41:47 UTC
(In reply to George Mitchell from comment #179)
> It seems unlikely in the extreme to me that the 14 version would work with 13, given that a 13.2 compile of the port would not work with a 13.3 kernel.
I said about run 14.1 without upgrade current system (install on different empty HDD/SSD) and test more recent version of the amdgpu.
Comment 181 George Mitchell 2024-12-08 15:50:46 UTC
After upgrading my system from 13.3-RELEASE-p8 to 13.4-RELEASE-p2 and recompiling drm_510_kmod (getting version drm-510-kmod-5.10.163_10 as reported by pkg info, despite the different version string seen below), I got a crah in single-user mode that at least left intelligible text on the screen, though I did not get a dump.  Here, manually transcribed (unfortunately), are the things I saw:

(once) Fatal Trap 9 in kernel mode
(five times) kernel trap 22 with interrupts disabled

Backtrace:
kdb_backtrace
vpanic
panic
trap_fatal
call_trap
linker_load_dependencies
link_elf_load_file
linker_load_module
kern_kldload
sys_kldload
amd64_syscall
fast_syscall_common

The string "v5.10.163_7" once.  I don't know where it came from, and my system log definitely says:

Dec  7 18:04:33 court pkg[1782]: drm-510-kmod-5.10.163_9 deinstalled
Dec  7 18:04:41 court pkg[1786]: drm-510-kmod-5.10.163_10 installed

That's all I have, regrettably.  Probably before the end of the year I will be upgrading to 14-RELEASE.  Perhaps this helps.

Despite the involvement of the zfs, vboxnetflt, and acpi_wmi kernel modules, none of those modules ever causes any trouble when drm-510-kmod is not present, and my (long) software engineering experience tells me that the simplest explanation is a bug in drm-510-kmod.
Comment 182 George Mitchell 2024-12-08 18:12:22 UTC
I forgot to mention the most obvious thing on the screen, which has been the hallmark of this crash all along: fifty-plus repetitions of the famous "WARNING !drm_modeset_is_locked(&crtc->mutex) failed" message.
Comment 183 Mark Millard 2024-12-08 21:20:06 UTC
(In reply to George Mitchell from comment #182)

Over the 2+ years of failures, what is just the first failure-indicating
message from each failing boot? Well, likely you could only approximate
that. The point is to try to ignore later messages from failing boots
that could just be consequences of prior failure activity for which
there is already evidence.

For example, if "WARNING !drm_modeset_is_locked(&crtc->mutex) failed"
is never first, it is less likely to be of interest. But if it is
always first it could be more likely to be of interest. (This is
just for illustration, not special to the specific message.)

Going the other way, again just as an example message: does it
sometimes occur even when no overall failure happens? Such would also
make such a message somewhat less likely to be of interest.

Of course, different boots might not get the same kind of first
failure-indicating message. But the list and relative frequency
of occurrence might be of some use.

Another issue could be that you might not have good evidence for
first failure-indicating from the failing boot attempts: no way
to answer the question then.
Comment 184 George Mitchell 2024-12-08 22:43:04 UTC
To answer many of your questions about the drm_modeset_is_locked message, may I direct your attention to the second attachment to the bug: https://bz-attachments.freebsd.org/attachment.cgi?id=238849.  TL;DR: It's never literally first, but it's pretty close to first and there are always multiple, multiple copies of it that are impossible to ignore.
Comment 185 George Mitchell 2024-12-08 23:56:29 UTC
And I buried the lede.  I never see any evidence of drm_modeset_is_locked during normal operation.
Comment 186 Mark Millard 2024-12-09 01:26:50 UTC
(In reply to George Mitchell from comment #184)

Then I'm afraid some information from earlier in the
failure sequence will prove essential to identifying
and fixing the overall problem: Necessary --but such
information need not be sufficient on its own.
Comment 187 Mark Millard 2024-12-09 01:45:44 UTC
(In reply to Mark Millard from comment #186)

Which of the following also happen for boots where there is
no evidence of a problem?

[drm] BIOS signature incorrect 53 7
. . .
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
. . .
ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)
. . .
acpi_wmi1: cannot find EC device
. . .
ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)
driver bug: Unable to set devclass (class: ppc devname: (unknown))


Do any of those only happen when there is a failure?
Comment 188 Mark Millard 2024-12-09 01:48:49 UTC
(In reply to Mark Millard from comment #187)

I managed to not list as one of the messages to
ask about:

Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)
Comment 189 George Mitchell 2024-12-09 14:33:18 UTC
ALL of the messages you cited occur all of the time.
Comment 190 John F. Carr 2024-12-09 14:43:04 UTC
From the stack trace:

#7  <signal handler called>
#8  avl_destroy_nodes (tree=tree@entry=0xfffff8001b6c0420, 
    cookie=cookie@entry=0xfffffe00b3bcddd0)
    at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023

I used to have a repeatable crash with a bad pointer in ZFS AVL code.  See bug #268909.  Eventually it went away, whether due to disk data changes or bug fixes I can't say.  I did not have any graphics drivers loaded.
Comment 191 Mark Millard 2024-12-09 17:21:14 UTC
(In reply to John F. Carr from comment #190)

Note the attachment named "Crash without any use of ZFS, with acpi_wmi"
( see also comment #107 and "Relevant part of /var/log/messages" ).
There is not even the likes of:

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)

in the log, much less a backtrace with zfs content.

Also, "Latest crash dump" is one that makes no mention of
ZFS in its backtraces, as I remember.

Each exposes-failure context had a similar "does not need
to be in the backtrace for a failure to occur" status, at
least given other example failure-reporting code was involved.
My memory of the history was that failure never happened
without drm-510-kmod being in use but the initial exposure
of the problem was never via a backtrace involving
drm-510-kmod .
Comment 192 George Mitchell 2024-12-10 14:46:41 UTC
I got the crash again.  I still haven't figured out how to get a dump in single user mode, but at least I got pictures:

https://www.m5p.com/public/george/267028/IMG_20241210_093221099.jpg
https://www.m5p.com/public/george/267028/IMG_20241210_093239744_HDR.jpg

I'm sorry for the quality of the first one; I'll try to get a better one on the next crash.  Apparently, now that I'm on 13.4 instead of 13.3 (and with the latest version of the kernel module), I'll get at least the text screen instead of the immediate reboot.
Comment 193 Mark Millard 2024-12-10 16:02:31 UTC
(In reply to George Mitchell from comment #192)

Viewed at actual size or zoomed in on an iPad I was able
to read some of the text in:

https://www.m5p.com/public/george/267028/IMG_20241210_093239744_HDR.jpg

. . .
vgapci0: child drmn0 requested pci_get_powerstate
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
<6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0

It panicked here instead of getting to the point of showing what
would normally be next: "Autoloading module: acpi_wmi"

However, the <6> is also not normal and may well be sigificant.

The panic was something like:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1: apic id = 01
instruction pointer = 0x20:0xffffffff00cf0110
. . .
current process = 25 (kldload)
trap number = 9

(I doubt that I could tell c vs. e or 0 vs. 8 .)

In the presence of the panic context I do not expect the drm_modeset
locking is able to operate correctly. In other words: I expect that
the !drm_modeset_is_locked notices are expected, given the Fatal
trap 9 and its being handling in the kernel. That in turn messes
things up even more for later --or so I guess.

I expect that the later trap 22 (in the other picture) is from the
lack of handling during the attempt to present the trap 9 information.

The other picture's backtrace suggests for the instruction
pointer reported above (general range of upper address bits):

instruction pointer = 0x2?:0xffffffff80?f?11?

where the "?"'s are about the 0 vs. 8 and c vs. e question.
Comment 194 Mark Millard 2024-12-10 16:16:38 UTC
I'll note that drm-510-kmod has the status:

 .if ${OPSYS} == FreeBSD && ${OSVERSION} >= 1401501
 IGNORE=		not supported on FreeBSD 14.2 and higher
 .endif

drm-515-kmod has the status:

 .if ${OPSYS} == FreeBSD && ${OSVERSION} < 1400081
 IGNORE=		not supported on older than 14.0, no kernel support
 .endif

drm-61-kmod has the status:

 .if ${OPSYS} == FreeBSD && !( ${OSVERSION} >= 1500008 || ( ${OSVERSION} >= 1400508 && ${OSVERSION} < 1500000 ))
 IGNORE=		not supported on older than 14-STABLE 1400508, no kernel support
 .endif


So drm-510-kmod will not last through all of 14.* .
Comment 195 Mark Millard 2024-12-10 17:03:05 UTC
(In reply to George Mitchell from comment #192)

As for having a swap partition ready for use for dumping
as early as possible: assign dumpdev in /boot/loader.conf .
An example from one of my contexts:

# grep -i dump /boot/loader.conf
dumpdev="/dev/gpt/OptBswp364"

The kernel does have to be far enough along to put it to
use but the above is the only way to have things configured
for the earliest point at which dumping is supported, as
far as I know. For dumping to swap partitions and this
earlier time frame, to my knowledge single user mode vs.
not boot attempts would not make a difference.

Actually, the above may do one thing that interferes:
/dev/gpt/ use might not be the best idea for earliest
possible dumps. Use an exact device reference instead. In
my context, as it is now, /dev/gpt/OptBswp364 is actually
/dev/nda2p2 . So:

dumpdev="/dev/nda2p2"

would likely be better for earliest-dump-possible in my
context.
Comment 196 George Mitchell 2024-12-10 17:36:08 UTC
Mark, I honestly thank you sincerely for putting dumpdev in /boot/loader.conf -- that is almost certainly the piece of the puzzle I have been missing!  I've added

dumpdev="/dev/ada0p2"

And we'll see what happens.  Presumably things are far enough along at that point that it can find /var/crash in that partition.

I already knew that I would have to go to drm-515-kmod when updating to 14.2-RELEASE.  But I started having problems with this hardware at least a couple of iterations of operating system and display driver back.  I'll be pleasantly surprised if drm-515-kmod does better.
Comment 197 Mark Millard 2024-12-10 18:23:47 UTC
(In reply to George Mitchell from comment #196)

I think I may  have guess wrong about the staging
that you were referring to:

A ) Before the reboot the dump is put in the swap partition
  as raw data. There is a dump command that can be used
  at the db> prompt.

B) Then rebooting analyzes the dump in the swap partition
  and puts the information in /var/crash .

My notes did not over (B), just (A).
Comment 198 George Mitchell 2024-12-10 18:33:47 UTC
My swap partition is /dev/ada0p3; is that what I should set dumpdev to?
Comment 199 Mark Millard 2024-12-10 18:54:26 UTC
(In reply to George Mitchell from comment #198)
(In reply to George Mitchell from comment #196)

For getting the crash dump out of the swap partition
and into /var/crash ( replacing the: ???? ) with
its analysis:

# savecore /var/crash /dev/????
# crashinfo -b 

(I think that last will automatically pick up the latest
saved core. Otherwise add the path to the dump file after
the -b argument.)

See:

man 8 savecore
man 8 crashinfo

Note that /var/crash/ would need to be writable. For
single user mode, you first need to deal with making
it writable as I remember. This gets into your
partitioning and mount point usage and ZFS vs. UFS.
If everything is in one UFS partition, then:

# mount -w /

should be sufficient, as an example.
Comment 200 Mark Millard 2024-12-10 19:05:14 UTC
(In reply to George Mitchell from comment #196)

I do not know if you have all the right tools
installed for crashinfo : you need devel/gdb
(that installs a kgdb as well).

QUOTE
Once crashinfo has located a core dump and kernel, it uses several
     utilities to analyze the core including dmesg(8), fstat(1), iostat(8),
     ipcs(1), kgdb(1) (ports/devel/gdb), netstat(1), nfsstat(1), ps(1),
     pstat(8), and vmstat(8).  Note that kgdb must be installed from the
     devel/gdb port or gdb package.
END QUOTE
Comment 201 Mark Millard 2024-12-10 19:10:06 UTC
(In reply to George Mitchell from comment #198)

/boot/loader.conf should have:

dumpdev="/dev/ada0p3"

You may have to explicitly request the dump at the db>
prompt.
Comment 202 Mark Millard 2024-12-10 20:44:59 UTC
Are you building an installing your own drm-510-kmod?
vs.
Are you using the FreeBSD packaged mod-510-kmod?

The official 13.*-RELEASE builds are still built via
jail 133amd64-default . In other words: the builds
are for 13.3-RELEASE, not for 13.4-RELEASE .

For now, when you are running 13.4-RELEASE you need
to have built drm-510-kmod in a 13.4-RELEASE context
(such as via a 13.4-RELEASE poudriere jail) and to
have installed that build. This includes making sure
that your poudriere jail content is 13.4-RELEASE,
not 13.3-RELEASE .
Comment 203 George Mitchell 2024-12-10 20:57:19 UTC
In reply to comment #202, allow me to refer you to comment #181.

(Off topic: this bug sets my personal record for bug with most comments.)
Comment 204 Mark Millard 2024-12-10 22:55:45 UTC
(In reply to George Mitchell from comment #203)

Just for paranoia, what did the install message show for
%%OPSYS%% and %%OSREL%% in the new installation message
text:

+Please note that this package was built for %%OPSYS%% %%OSREL%%.
+If this is not your current running version, please rebuild
+it from ports to prevent panics when loading the module.

?
Comment 205 George Mitchell 2024-12-10 23:10:07 UTC
pkg info -D drm-510-kmod
[...]
Please note that this package was built for FreeBSD 13.4.
If this is not your current running version, please rebuild
it from ports to prevent panics when loading the module.



With regard to getting crash dumps, I've never (before this bug) had to do anything other than rely on 'dumpdev="AUTO"' to do the right thing automagically, so that's why I've had trouble with that.  But I had figured out that I needed my root partition to be writable; see comment #177.
Comment 206 Mark Millard 2024-12-11 02:18:08 UTC
(In reply to George Mitchell from comment #205)

With the over 2 years of effort at creating your "personal
record for bug with most comments" (with my help!), I've
gotten to the point that I no longer search it all to
find some of information that I do not remember if it is
present or not.

It was easier to remember back when I was more active for
the issue: I've jumped from 2023-04-26 23:24:29 UTC to
2024-12-08 21:20:06 UTC with no involvement in the
middle, not normally having drm-*-kmod in use in my
environment and not having to deal with the issue in my
environment.


One issue that the dump will likely have is being made
after multiple traps (such as 22's after the 9) and
such, instead of just after the initial one. The later
activity will likely corrupt some of the information
from the initial fault's time frame.

It would be nice if one could configure to have the
initial trap (likely 9) initiate the dump directly
and get to the ddb> prompt after the dump was done,
then allowing for a reboot.
Comment 207 Mark Millard 2024-12-11 03:33:44 UTC
For the trap 9's:

instruction pointer = 0x2?:0xffffffff80?f?11?

Looking at the kernel code's for around:

0xffffffff80cf011? I find the code in that area is
in qsort. The old comment # 121 found such as well:

   0xffffffff80cf00ff <+6047>:	jae    0xffffffff80cf0470 <qsort+6928>
   0xffffffff80cf0105 <+6053>:	mov    %rbx,%rax
   0xffffffff80cf0108 <+6056>:	shr    $0x2,%rax
   0xffffffff80cf010c <+6060>:	mov    %rbx,%r15
   0xffffffff80cf010f <+6063>:	shr    $0x3,%r15
   0xffffffff80cf0113 <+6067>:	lea    -0x1(%rbx),%rdx
   0xffffffff80cf0117 <+6071>:	mov    %rdx,-0xa0(%rbp)
   0xffffffff80cf011e <+6078>:	lea    -0x1(%rax),%rdx
   0xffffffff80cf0122 <+6082>:	mov    %rdx,-0x98(%rbp)

(Not that the code details inside qsort match.)

Other alternatives:

(kgdb) disass 0xffffffff80cf8110
Dump of assembler code for function deflate_slow:
   0xffffffff80cf80f8 <+1048>:	je     0xffffffff80cf812b <deflate_slow+1099>
   0xffffffff80cf80fa <+1050>:	mov    0x18(%r13),%rdi
   0xffffffff80cf80fe <+1054>:	mov    0x20(%r15),%rsi
   0xffffffff80cf8102 <+1058>:	mov    %r12d,%edx
   0xffffffff80cf8105 <+1061>:	call   0xffffffff80cfeea0 <zmemcpy>
   0xffffffff80cf810a <+1066>:	mov    %r12d,%eax
   0xffffffff80cf810d <+1069>:	add    %rax,0x18(%r13)
   0xffffffff80cf8111 <+1073>:	add    %rax,0x20(%r15)
   0xffffffff80cf8115 <+1077>:	add    %rax,0x28(%r13)
   0xffffffff80cf8119 <+1081>:	sub    %r12d,0x20(%r13)
   0xffffffff80cf811d <+1085>:	sub    %rax,0x28(%r15)
   0xffffffff80cf8121 <+1089>:	jne    0xffffffff80cf812b <deflate_slow+1099>

(kgdb) disass 0xffffffff80ef0110
Dump of assembler code for function mac_vnode_check_write_impl:
   0xffffffff80ef00f7 <+71>:	je     0xffffffff80ef00e0 <mac_vnode_check_write_impl+48>
   0xffffffff80ef00f9 <+73>:	mov    0x188(%rbx),%rcx
   0xffffffff80ef0100 <+80>:	mov    %r12,%rdi
   0xffffffff80ef0103 <+83>:	mov    %r14,%rsi
   0xffffffff80ef0106 <+86>:	mov    %rbx,%rdx
   0xffffffff80ef0109 <+89>:	call   *%rax
   0xffffffff80ef010b <+91>:	mov    %eax,%edi
   0xffffffff80ef010d <+93>:	mov    %r15d,%esi
   0xffffffff80ef0110 <+96>:	call   0xffffffff80edefb0 <mac_error_select>
   0xffffffff80ef0115 <+101>:	mov    %eax,%r15d
   0xffffffff80ef0118 <+104>:	jmp    0xffffffff80ef00e0 <mac_vnode_check_write_impl+48>
   0xffffffff80ef011a <+106>:	cmpq   $0x0,0x11d029e(%rip)        # 0xffffffff820c03c0 <mac_policy_list>
   0xffffffff80ef0122 <+114>:	je     0xffffffff80ef017f <mac_vnode_check_write_impl+207>


(kgdb) disass 0xffffffff80ef8110
Dump of assembler code for function ffs_blkfree_cg:
   0xffffffff80ef80fa <+106>:	jbe    0xffffffff80ef81aa <ffs_blkfree_cg+282>
   0xffffffff80ef8100 <+112>:	mov    %rdi,-0x30(%rbp)
   0xffffffff80ef8104 <+116>:	mov    0x38(%rax),%r15
   0xffffffff80ef8108 <+120>:	lea    -0x38(%rbp),%r8
   0xffffffff80ef810c <+124>:	lea    -0x98(%rbp),%r9
   0xffffffff80ef8113 <+131>:	mov    %rbx,%rdi
   0xffffffff80ef8116 <+134>:	mov    %r10,-0x58(%rbp)
   0xffffffff80ef811a <+138>:	mov    %r10,%rsi
   0xffffffff80ef811d <+141>:	mov    %rdx,-0x48(%rbp)
   0xffffffff80ef8121 <+145>:	mov    $0x80,%ecx
Comment 208 Mark Millard 2024-12-11 07:29:19 UTC
Hmm. The backtrace in:

https://www.m5p.com/public/george/267028/IMG_20241210_093221099.jpg

is incoherent.

Dump of "disass/s" code report for function link_elf_load_file:
/usr/src/sys/kern/link_elf.c:
952	{
   0xffffffff80c1e5a0 <+0>:	push   %rbp
. . .
   0xffffffff80c1ef00 <+2400>:	jmp    0xffffffff80c1ecf4 <link_elf_load_file+1876>
End of assembler dump.

Compare that +2400 to the backtrace's:

link_elf_load_file+0x115c (note: 0x115c == 4444)
Comment 209 George Mitchell 2024-12-11 14:40:56 UTC
Mark, I sincerely appreciate the help you have provided on this bug.  My major character flaw is a tendency to lapse into sarcasm at the drop of a hat.  I'm working on it.
Comment 210 Mark Millard 2024-12-11 17:45:06 UTC
Does your 13.4-RELEASE environment have the likes of:

/usr/lib/debug/boot/kernel/kernel.debug

in addition to /boot/kernel/kernel ?

If yes, is the kernel.debug file up to date with the kernel file?
Comment 211 George Mitchell 2024-12-11 18:02:21 UTC
The two files have the same time stamp (Dec  6 00:40, when I updated to 13.4), so I assume they are in sync.
Comment 212 Mark Millard 2024-12-11 23:44:03 UTC
For a successful boot, could you report the local equivalent of:

# kgdb # So: the live system (not what I was actually doing)
. . .
(kgdb) disass btext
Dump of assembler code for function btext:
   0xffffffff8038e000 <+0>:	push   $0x2
   0xffffffff8038e002 <+2>:	popf
   0xffffffff8038e003 <+3>:	mov    %rsp,%rbp
   0xffffffff8038e006 <+6>:	mov    0x4(%rbp),%edi
   0xffffffff8038e009 <+9>:	mov    0x8(%rbp),%esi
   0xffffffff8038e00c <+12>:	mov    $0xffffffff81d84580,%rsp
   0xffffffff8038e013 <+19>:	xor    %ebp,%ebp
   0xffffffff8038e015 <+21>:	call   0xffffffff8108dca0 <hammer_time>
   0xffffffff8038e01a <+26>:	mov    %rax,%rsp
   0xffffffff8038e01d <+29>:	call   0xffffffff80b7d260 <mi_startup>
   0xffffffff8038e022 <+34>:	hlt
   0xffffffff8038e023 <+35>:	jmp    0xffffffff8038e022 <btext+34>
   0xffffffff8038e025 <+37>:	cs nopw 0x0(%rax,%rax,1)
   0xffffffff8038e02f <+47>:	nop
End of assembler dump.
(kgdb) info files
Symbols from "/usr/home/root/artifacts/13.4R/usr/lib/debug/boot/kernel/kernel.debug".
Local exec file:
	`/usr/home/root/artifacts/13.4R/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff802310f0 is .hash
	0xffffffff802310f0 - 0xffffffff8025f9c0 is .gnu.hash
	0xffffffff8025f9c0 - 0xffffffff802f2450 is .dynsym
	0xffffffff802f2450 - 0xffffffff8036d0c4 is .dynstr
	0xffffffff8036d0c8 - 0xffffffff8038da68 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811863f8 is .text
	0xffffffff81186400 - 0xffffffff817f8c20 is .rodata
. . . (past the first .text section anyway) . . .

Note that my output above is for the kernel file instead of
for a live system. For all I know a live system might relocate
sections to distinct address ranges vs. the above. That is
part of what I'm checking for your context.

Also:

(kgdb) disass 0xffffffff80cf0110
. . . (various pages later) . . .
A range of  lines spanning at least from somewhat before
0xffffffff80cf011? to somewhat after it.
. . .

(The above will likely name what function it is displaying
-- the one that spans the address listed.)
Comment 213 George Mitchell 2024-12-11 23:53:15 UTC
This is what I get:

kgdb
GNU gdb (GDB) 15.1 [GDB v15.1 for FreeBSD]
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.4".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
Reading symbols from /boot/kernel/sem.ko...
Reading symbols from /usr/lib/debug//boot/kernel/sem.ko.debug...
Reading symbols from /boot/modules/if_re.ko...
(No debugging symbols found in /boot/modules/if_re.ko)
Reading symbols from /boot/kernel/fusefs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/fusefs.ko.debug...
Reading symbols from /boot/modules/amdgpu.ko...
(No debugging symbols found in /boot/modules/amdgpu.ko)
Reading symbols from /boot/modules/drm.ko...
(No debugging symbols found in /boot/modules/drm.ko)
Reading symbols from /boot/kernel/iic.ko...
Reading symbols from /usr/lib/debug//boot/kernel/iic.ko.debug...
Reading symbols from /boot/modules/linuxkpi_gplv2.ko...
(No debugging symbols found in /boot/modules/linuxkpi_gplv2.ko)
--Type <RET> for more, q to quit, c to continue without paging--
Reading symbols from /boot/modules/dmabuf.ko...
(No debugging symbols found in /boot/modules/dmabuf.ko)
Reading symbols from /boot/modules/ttm.ko...
(No debugging symbols found in /boot/modules/ttm.ko)
Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_gpu_info_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_sdma_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_asd_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_ta_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_pfp_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_me_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_ce_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_rlc_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_mec_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_mec2_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_vcn_bin.ko)
Reading symbols from /boot/kernel/zfs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...
Reading symbols from /boot/kernel/netgraph.ko...
Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug...
Reading symbols from /boot/kernel/acpi_wmi.ko...
--Type <RET> for more, q to quit, c to continue without paging--
Reading symbols from /usr/lib/debug//boot/kernel/acpi_wmi.ko.debug...
Reading symbols from /boot/kernel/intpm.ko...
Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug...
Reading symbols from /boot/kernel/smbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...
Reading symbols from /boot/kernel/uhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...
Reading symbols from /boot/kernel/usbhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug...
Reading symbols from /boot/kernel/hidbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug...
Reading symbols from /boot/kernel/wmt.ko...
Reading symbols from /usr/lib/debug//boot/kernel/wmt.ko.debug...
Reading symbols from /boot/kernel/ums.ko...
Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...
Reading symbols from /boot/kernel/autofs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/autofs.ko.debug...
Reading symbols from /boot/kernel/mac_ntpd.ko...
Reading symbols from /usr/lib/debug//boot/kernel/mac_ntpd.ko.debug...
Reading symbols from /boot/kernel/green_saver.ko...
Reading symbols from /usr/lib/debug//boot/kernel/green_saver.ko.debug...
sched_switch (td=td@entry=0xffffffff82043780 <thread0_st>, flags=flags@entry=260) at /usr/src/sys/kern/sched_4bsd.c:1085
1085			SDT_PROBE0(sched, , , on__cpu);
(kgdb) disass btext
Dump of assembler code for function btext:
   0xffffffff8038e000 <+0>:	push   $0x2
   0xffffffff8038e002 <+2>:	popf
   0xffffffff8038e003 <+3>:	mov    %rsp,%rbp
   0xffffffff8038e006 <+6>:	mov    0x4(%rbp),%edi
   0xffffffff8038e009 <+9>:	mov    0x8(%rbp),%esi
   0xffffffff8038e00c <+12>:	mov    $0xffffffff81d83880,%rsp
   0xffffffff8038e013 <+19>:	xor    %ebp,%ebp
   0xffffffff8038e015 <+21>:	call   0xffffffff8108bca0 <hammer_time>
   0xffffffff8038e01a <+26>:	mov    %rax,%rsp
   0xffffffff8038e01d <+29>:	call   0xffffffff80b7d260 <mi_startup>
   0xffffffff8038e022 <+34>:	hlt
   0xffffffff8038e023 <+35>:	jmp    0xffffffff8038e022 <btext+34>
   0xffffffff8038e025 <+37>:	cs nopw 0x0(%rax,%rax,1)
   0xffffffff8038e02f <+47>:	nop
End of assembler dump.
(kgdb) info files
Symbols from "/boot/kernel/kernel".
Kernel core dump file:
	`/dev/mem', file type FreeBSD kernel vmcore.
Local exec file:
	`/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff80231108 is .hash
	0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash
	0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym
	0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr
	0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811843f8 is .text
	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata
	0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set
	0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set
	0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set
	0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set
	0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips
	0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs
	0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set
	0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set
	0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set
	0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set
	0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set
	0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set
	0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set
	0xffffffff81803620 - 0xffffffff81803660 is usb_host_id
	0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set
	0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset
	0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors
	0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set
	0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set
	0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set
	0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset
	0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset
	0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set
	0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set
	0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set
	0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set
	0xffffffff81803820 - 0xffffffff81804881 is kern_conf
	0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id
	0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame
	0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic
	0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding
	0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently
	0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly
	0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line
	0xffffffff81c08000 - 0xffffffff81d51248 is .data
	0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set
	0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set
	0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu
	0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet
	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82545000 - 0xffffffff82547000 is .text in /boot/kernel/sem.ko
	0xffffffff82547000 - 0xffffffff82548000 is .rodata in /boot/kernel/sem.ko
	0xffffffff82548000 - 0xffffffff8254895c is .data in /boot/kernel/sem.ko
	0xffffffff82548960 - 0xffffffff82548978 is set_sysctl_set in /boot/kernel/sem.ko
	0xffffffff82548978 - 0xffffffff82548988 is set_sysinit_set in /boot/kernel/sem.ko
	0xffffffff82548988 - 0xffffffff82548990 is set_sysuninit_set in /boot/kernel/sem.ko
	0xffffffff82548990 - 0xffffffff82548a10 is .bss in /boot/kernel/sem.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82548a10 - 0xffffffff82548a28 is set_modmetadata_set in /boot/kernel/sem.ko
	0xffffffff82548a28 - 0xffffffff82548a4c is .note.gnu.build-id in /boot/kernel/sem.ko
	0xffffffff8254d000 - 0xffffffff825d4000 is .text in /boot/modules/if_re.ko
	0xffffffff825d4000 - 0xffffffff825db000 is .rodata in /boot/modules/if_re.ko
	0xffffffff825db000 - 0xffffffff825db4a0 is .data in /boot/modules/if_re.ko
	0xffffffff825db4a0 - 0xffffffff825db4f0 is set_sysctl_set in /boot/modules/if_re.ko
	0xffffffff825db4f0 - 0xffffffff825db500 is set_modmetadata_set in /boot/modules/if_re.ko
	0xffffffff825db500 - 0xffffffff825db508 is set_sysinit_set in /boot/modules/if_re.ko
	0xffffffff825db508 - 0xffffffff825db528 is .bss in /boot/modules/if_re.ko
	0xffffffff825db528 - 0xffffffff825db54c is .note.gnu.build-id in /boot/modules/if_re.ko
	0xffffffff8264d000 - 0xffffffff8265a000 is .text in /boot/kernel/fusefs.ko
	0xffffffff8265a000 - 0xffffffff8265c000 is .rodata in /boot/kernel/fusefs.ko
	0xffffffff8265c000 - 0xffffffff8265e874 is .data in /boot/kernel/fusefs.ko
	0xffffffff8265e878 - 0xffffffff8265e970 is set_sdt_probes_set in /boot/kernel/fusefs.ko
	0xffffffff8265e970 - 0xffffffff8265eba0 is set_sdt_argtypes_set in /boot/kernel/fusefs.ko
	0xffffffff8265eba0 - 0xffffffff8265ebd8 is set_sysinit_set in /boot/kernel/fusefs.ko
	0xffffffff8265ebd8 - 0xffffffff8265ebf8 is set_sysuninit_set in /boot/kernel/fusefs.ko
	0xffffffff8265ebf8 - 0xffffffff8265ec60 is set_sysctl_set in /boot/kernel/fusefs.ko
	0xffffffff8265ec60 - 0xffffffff8265ecc0 is .bss in /boot/kernel/fusefs.ko
	0xffffffff8265ecc0 - 0xffffffff8265ecc8 is set_sdt_providers_set in /boot/kernel/fusefs.ko
	0xffffffff8265ecc8 - 0xffffffff8265ece0 is set_modmetadata_set in /boot/kernel/fusefs.ko
	0xffffffff8265ece0 - 0xffffffff8265ed04 is .note.gnu.build-id in /boot/kernel/fusefs.ko
	0xffffffff82a00000 - 0xffffffff82cf4000 is .text in /boot/modules/amdgpu.ko
	0xffffffff82cf4000 - 0xffffffff82dfa000 is .rodata in /boot/modules/amdgpu.ko
	0xffffffff82dfa000 - 0xffffffff82e07378 is .bss in /boot/modules/amdgpu.ko
	0xffffffff82e07380 - 0xffffffff82e0fd74 is .data in /boot/modules/amdgpu.ko
	0xffffffff82e0fd78 - 0xffffffff82e10150 is set_sysctl_set in /boot/modules/amdgpu.ko
	0xffffffff82e10150 - 0xffffffff82e10178 is set_sysinit_set in /boot/modules/amdgpu.ko
	0xffffffff82e10178 - 0xffffffff82e10188 is set_sysuninit_set in /boot/modules/amdgpu.ko
	0xffffffff82e10188 - 0xffffffff82e101e0 is set_modmetadata_set in /boot/modules/amdgpu.ko
	0xffffffff82e101e0 - 0xffffffff82e10204 is .note.gnu.build-id in /boot/modules/amdgpu.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82918000 - 0xffffffff8296c000 is .text in /boot/modules/drm.ko
	0xffffffff8296c000 - 0xffffffff82988000 is .rodata in /boot/modules/drm.ko
	0xffffffff82988000 - 0xffffffff82988190 is .bss in /boot/modules/drm.ko
	0xffffffff82988190 - 0xffffffff829899a8 is .data in /boot/modules/drm.ko
	0xffffffff829899a8 - 0xffffffff82989a20 is set_sysinit_set in /boot/modules/drm.ko
	0xffffffff82989a20 - 0xffffffff82989a80 is set_sysuninit_set in /boot/modules/drm.ko
	0xffffffff82989a80 - 0xffffffff82989b50 is set_sysctl_set in /boot/modules/drm.ko
	0xffffffff82989b50 - 0xffffffff82989b5c is .data.read_mostly in /boot/modules/drm.ko
	0xffffffff82989b60 - 0xffffffff82989bd8 is set_modmetadata_set in /boot/modules/drm.ko
	0xffffffff82989bd8 - 0xffffffff82989bfc is .note.gnu.build-id in /boot/modules/drm.ko
	0xffffffff8298a000 - 0xffffffff8298b000 is .text in /boot/kernel/iic.ko
	0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in /boot/kernel/iic.ko
	0xffffffff8298c000 - 0xffffffff8298c270 is .data in /boot/kernel/iic.ko
	0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in /boot/kernel/iic.ko
	0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in /boot/kernel/iic.ko
	0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in /boot/kernel/iic.ko
	0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in /boot/kernel/iic.ko
	0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in /boot/kernel/iic.ko
	0xffffffff8298d000 - 0xffffffff8298f000 is .text in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298f000 - 0xffffffff82990000 is .rodata in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff82990000 - 0xffffffff829900c8 is .data in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900c8 - 0xffffffff829900f0 is set_modmetadata_set in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900f0 - 0xffffffff829900f8 is set_sysinit_set in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900f8 - 0xffffffff829900fc is .bss in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900fc - 0xffffffff82990120 is .note.gnu.build-id in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff82991000 - 0xffffffff82996000 is .text in /boot/modules/dmabuf.ko
	0xffffffff82996000 - 0xffffffff82997000 is .rodata in /boot/modules/dmabuf.ko
	0xffffffff82997000 - 0xffffffff82997240 is .data in /boot/modules/dmabuf.ko
	0xffffffff82997240 - 0xffffffff82997250 is set_modmetadata_set in /boot/modules/dmabuf.ko
	0xffffffff82997250 - 0xffffffff82997268 is set_sysinit_set in /boot/modules/dmabuf.ko
	0xffffffff82997268 - 0xffffffff82997280 is set_sysuninit_set in /boot/modules/dmabuf.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82997280 - 0xffffffff82997318 is .bss in /boot/modules/dmabuf.ko
	0xffffffff82997318 - 0xffffffff8299733c is .note.gnu.build-id in /boot/modules/dmabuf.ko
	0xffffffff82998000 - 0xffffffff829a2000 is .text in /boot/modules/ttm.ko
	0xffffffff829a2000 - 0xffffffff829a3000 is .rodata in /boot/modules/ttm.ko
	0xffffffff829a3000 - 0xffffffff829a3500 is .data in /boot/modules/ttm.ko
	0xffffffff829a3500 - 0xffffffff829a3520 is set_sysinit_set in /boot/modules/ttm.ko
	0xffffffff829a3520 - 0xffffffff829a3538 is set_sysuninit_set in /boot/modules/ttm.ko
	0xffffffff829a3540 - 0xffffffff829a4720 is .bss in /boot/modules/ttm.ko
	0xffffffff829a4720 - 0xffffffff829a4758 is set_modmetadata_set in /boot/modules/ttm.ko
	0xffffffff829a4758 - 0xffffffff829a477c is .note.gnu.build-id in /boot/modules/ttm.ko
	0xffffffff829a5000 - 0xffffffff829a6000 is .text in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a6000 - 0xffffffff829a7000 is .rodata in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7000 - 0xffffffff829a713c is rodata in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7140 - 0xffffffff829a71f0 is .data in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a71f0 - 0xffffffff829a7210 is set_modmetadata_set in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7210 - 0xffffffff829a7218 is set_sysinit_set in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7218 - 0xffffffff829a723c is .note.gnu.build-id in /boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a8000 - 0xffffffff829a9000 is .text in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829a9000 - 0xffffffff829aa000 is .rodata in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829aa000 - 0xffffffff829ae400 is rodata in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae400 - 0xffffffff829ae4b0 is .data in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4b0 - 0xffffffff829ae4d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4d0 - 0xffffffff829ae4d8 is set_sysinit_set in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4d8 - 0xffffffff829ae4fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829af000 - 0xffffffff829b0000 is .text in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829b0000 - 0xffffffff829b1000 is .rodata in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829b1000 - 0xffffffff829dd200 is rodata in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829dd200 - 0xffffffff829dd2b0 is .data in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829dd2b0 - 0xffffffff829dd2d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829dd2d0 - 0xffffffff829dd2d8 is set_sysinit_set in /boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829dd2d8 - 0xffffffff829dd2fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_asd_bin.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff829de000 - 0xffffffff829df000 is .text in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829df000 - 0xffffffff829e0000 is .rodata in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e0000 - 0xffffffff829e7300 is rodata in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e7300 - 0xffffffff829e73b0 is .data in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e73b0 - 0xffffffff829e73d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e73d0 - 0xffffffff829e73d8 is set_sysinit_set in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e73d8 - 0xffffffff829e73fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e8000 - 0xffffffff829e9000 is .text in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829e9000 - 0xffffffff829ea000 is .rodata in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ea000 - 0xffffffff829ef480 is rodata in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef480 - 0xffffffff829ef530 is .data in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef530 - 0xffffffff829ef550 is set_modmetadata_set in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef550 - 0xffffffff829ef558 is set_sysinit_set in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef558 - 0xffffffff829ef57c is .note.gnu.build-id in /boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829f0000 - 0xffffffff829f1000 is .text in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f1000 - 0xffffffff829f2000 is .rodata in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f2000 - 0xffffffff829f6480 is rodata in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6480 - 0xffffffff829f6530 is .data in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6530 - 0xffffffff829f6550 is set_modmetadata_set in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6550 - 0xffffffff829f6558 is set_sysinit_set in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6558 - 0xffffffff829f657c is .note.gnu.build-id in /boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f7000 - 0xffffffff829f8000 is .text in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829f8000 - 0xffffffff829f9000 is .rodata in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829f9000 - 0xffffffff829fb480 is rodata in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb480 - 0xffffffff829fb530 is .data in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb530 - 0xffffffff829fb550 is set_modmetadata_set in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb550 - 0xffffffff829fb558 is set_sysinit_set in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb558 - 0xffffffff829fb57c is .note.gnu.build-id in /boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff82e11000 - 0xffffffff82e12000 is .text in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e12000 - 0xffffffff82e13000 is .rodata in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e13000 - 0xffffffff82e1c8e4 is rodata in /boot/modules/amdgpu_raven_rlc_bin.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82e1c8e8 - 0xffffffff82e1c998 is .data in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c998 - 0xffffffff82e1c9b8 is set_modmetadata_set in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c9b8 - 0xffffffff82e1c9c0 is set_sysinit_set in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c9c0 - 0xffffffff82e1c9e4 is .note.gnu.build-id in /boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1d000 - 0xffffffff82e1e000 is .text in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e1e000 - 0xffffffff82e1f000 is .rodata in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e1f000 - 0xffffffff82e60710 is rodata in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e60710 - 0xffffffff82e607c0 is .data in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607c0 - 0xffffffff82e607e0 is set_modmetadata_set in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607e0 - 0xffffffff82e607e8 is set_sysinit_set in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607e8 - 0xffffffff82e6080c is .note.gnu.build-id in /boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e61000 - 0xffffffff82e62000 is .text in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82e62000 - 0xffffffff82e63000 is .rodata in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82e63000 - 0xffffffff82ea4710 is rodata in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea4710 - 0xffffffff82ea47c0 is .data in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47c0 - 0xffffffff82ea47e0 is set_modmetadata_set in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47e0 - 0xffffffff82ea47e8 is set_sysinit_set in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47e8 - 0xffffffff82ea480c is .note.gnu.build-id in /boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea5000 - 0xffffffff82ea6000 is .text in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82ea6000 - 0xffffffff82ea7000 is .rodata in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82ea7000 - 0xffffffff82eff560 is rodata in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82eff560 - 0xffffffff82eff610 is .data in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82eff610 - 0xffffffff82eff630 is set_modmetadata_set in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82eff630 - 0xffffffff82eff638 is set_sysinit_set in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82eff638 - 0xffffffff82eff65c is .note.gnu.build-id in /boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff83000000 - 0xffffffff8324c000 is .text in /boot/kernel/zfs.ko
	0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in /boot/kernel/zfs.ko
	0xffffffff832dc000 - 0xffffffff832fe228 is .data in /boot/kernel/zfs.ko
	0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in /boot/kernel/zfs.ko
	0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in /boot/kernel/zfs.ko
	0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in /boot/kernel/zfs.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in /boot/kernel/zfs.ko
	0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in /boot/kernel/zfs.ko
	0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in /boot/kernel/zfs.ko
	0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in /boot/kernel/zfs.ko
	0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in /boot/kernel/zfs.ko
	0xffffffff82f05000 - 0xffffffff82f0d000 is .text in /boot/kernel/netgraph.ko
	0xffffffff82f0d000 - 0xffffffff82f0f000 is .rodata in /boot/kernel/netgraph.ko
	0xffffffff82f0f000 - 0xffffffff82f0f900 is .data in /boot/kernel/netgraph.ko
	0xffffffff82f0f900 - 0xffffffff82f0f918 is set_modmetadata_set in /boot/kernel/netgraph.ko
	0xffffffff82f0f918 - 0xffffffff82f0f960 is set_sysinit_set in /boot/kernel/netgraph.ko
	0xffffffff82f0f960 - 0xffffffff82f0f9a0 is set_sysuninit_set in /boot/kernel/netgraph.ko
	0xffffffff82f0f9a0 - 0xffffffff82f0f9d8 is set_vnet in /boot/kernel/netgraph.ko
	0xffffffff82f0f9d8 - 0xffffffff82f0fa98 is .bss in /boot/kernel/netgraph.ko
	0xffffffff82f0fa98 - 0xffffffff82f0fac8 is set_sysctl_set in /boot/kernel/netgraph.ko
	0xffffffff82f0fac8 - 0xffffffff82f0faec is .note.gnu.build-id in /boot/kernel/netgraph.ko
	0xffffffff829fc000 - 0xffffffff829fe000 is .text in /boot/kernel/acpi_wmi.ko
	0xffffffff829fe000 - 0xffffffff829ff000 is .rodata in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff000 - 0xffffffff829ff2f8 is .data in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff2f8 - 0xffffffff829ff310 is set_sysinit_set in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff310 - 0xffffffff829ff320 is set_sysuninit_set in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff320 - 0xffffffff829ff350 is set_modmetadata_set in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff350 - 0xffffffff829ff378 is .bss in /boot/kernel/acpi_wmi.ko
	0xffffffff829ff378 - 0xffffffff829ff39c is .note.gnu.build-id in /boot/kernel/acpi_wmi.ko
	0xffffffff82f00000 - 0xffffffff82f02000 is .text in /boot/kernel/intpm.ko
	0xffffffff82f02000 - 0xffffffff82f03000 is .rodata in /boot/kernel/intpm.ko
	0xffffffff82f03000 - 0xffffffff82f031c8 is .data in /boot/kernel/intpm.ko
	0xffffffff82f031c8 - 0xffffffff82f03200 is set_modmetadata_set in /boot/kernel/intpm.ko
	0xffffffff82f03200 - 0xffffffff82f03210 is set_sysinit_set in /boot/kernel/intpm.ko
	0xffffffff82f03210 - 0xffffffff82f03218 is .bss in /boot/kernel/intpm.ko
	0xffffffff82f03218 - 0xffffffff82f0323c is .note.gnu.build-id in /boot/kernel/intpm.ko
	0xffffffff82f10000 - 0xffffffff82f11000 is .text in /boot/kernel/smbus.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82f11000 - 0xffffffff82f12000 is .rodata in /boot/kernel/smbus.ko
	0xffffffff82f12000 - 0xffffffff82f1216c is .data in /boot/kernel/smbus.ko
	0xffffffff82f12170 - 0xffffffff82f12178 is set_modmetadata_set in /boot/kernel/smbus.ko
	0xffffffff82f12178 - 0xffffffff82f12180 is .bss in /boot/kernel/smbus.ko
	0xffffffff82f12180 - 0xffffffff82f121a4 is .note.gnu.build-id in /boot/kernel/smbus.ko
	0xffffffff82f13000 - 0xffffffff82f15000 is .text in /boot/kernel/uhid.ko
	0xffffffff82f15000 - 0xffffffff82f16000 is .rodata in /boot/kernel/uhid.ko
	0xffffffff82f16000 - 0xffffffff82f162a4 is .data in /boot/kernel/uhid.ko
	0xffffffff82f162a8 - 0xffffffff82f162b8 is set_sysctl_set in /boot/kernel/uhid.ko
	0xffffffff82f162b8 - 0xffffffff82f162e8 is set_modmetadata_set in /boot/kernel/uhid.ko
	0xffffffff82f162e8 - 0xffffffff82f162f0 is set_sysinit_set in /boot/kernel/uhid.ko
	0xffffffff82f162f0 - 0xffffffff82f16300 is .bss in /boot/kernel/uhid.ko
	0xffffffff82f16300 - 0xffffffff82f16340 is usb_host_id in /boot/kernel/uhid.ko
	0xffffffff82f16340 - 0xffffffff82f16364 is .note.gnu.build-id in /boot/kernel/uhid.ko
	0xffffffff82f17000 - 0xffffffff82f19000 is .text in /boot/kernel/usbhid.ko
	0xffffffff82f19000 - 0xffffffff82f1a000 is .rodata in /boot/kernel/usbhid.ko
	0xffffffff82f1a000 - 0xffffffff82f1a290 is .data in /boot/kernel/usbhid.ko
	0xffffffff82f1a290 - 0xffffffff82f1a2a8 is set_sysctl_set in /boot/kernel/usbhid.ko
	0xffffffff82f1a2a8 - 0xffffffff82f1a2e0 is set_modmetadata_set in /boot/kernel/usbhid.ko
	0xffffffff82f1a2e0 - 0xffffffff82f1a2e8 is set_sysinit_set in /boot/kernel/usbhid.ko
	0xffffffff82f1a2e8 - 0xffffffff82f1a2f8 is .bss in /boot/kernel/usbhid.ko
	0xffffffff82f1a300 - 0xffffffff82f1a380 is usb_host_id in /boot/kernel/usbhid.ko
	0xffffffff82f1a380 - 0xffffffff82f1a3a4 is .note.gnu.build-id in /boot/kernel/usbhid.ko
	0xffffffff82f1b000 - 0xffffffff82f1d000 is .text in /boot/kernel/hidbus.ko
	0xffffffff82f1d000 - 0xffffffff82f1e000 is .rodata in /boot/kernel/hidbus.ko
	0xffffffff82f1e000 - 0xffffffff82f1e008 is .bss in /boot/kernel/hidbus.ko
	0xffffffff82f1e008 - 0xffffffff82f1e258 is .data in /boot/kernel/hidbus.ko
	0xffffffff82f1e258 - 0xffffffff82f1e298 is set_modmetadata_set in /boot/kernel/hidbus.ko
	0xffffffff82f1e298 - 0xffffffff82f1e2b0 is set_sysinit_set in /boot/kernel/hidbus.ko
	0xffffffff82f1e2b0 - 0xffffffff82f1e2d4 is .note.gnu.build-id in /boot/kernel/hidbus.ko
	0xffffffff82f1f000 - 0xffffffff82f21000 is .text in /boot/kernel/wmt.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82f21000 - 0xffffffff82f22000 is .rodata in /boot/kernel/wmt.ko
	0xffffffff82f22000 - 0xffffffff82f22290 is .data in /boot/kernel/wmt.ko
	0xffffffff82f22290 - 0xffffffff82f222a8 is set_sysctl_set in /boot/kernel/wmt.ko
	0xffffffff82f222a8 - 0xffffffff82f222e0 is set_modmetadata_set in /boot/kernel/wmt.ko
	0xffffffff82f222e0 - 0xffffffff82f222e8 is set_sysinit_set in /boot/kernel/wmt.ko
	0xffffffff82f222e8 - 0xffffffff82f222f8 is .bss in /boot/kernel/wmt.ko
	0xffffffff82f22300 - 0xffffffff82f22320 is usb_host_id in /boot/kernel/wmt.ko
	0xffffffff82f22320 - 0xffffffff82f22344 is .note.gnu.build-id in /boot/kernel/wmt.ko
	0xffffffff82f23000 - 0xffffffff82f26000 is .text in /boot/kernel/ums.ko
	0xffffffff82f26000 - 0xffffffff82f27000 is .rodata in /boot/kernel/ums.ko
	0xffffffff82f27000 - 0xffffffff82f272c0 is .data in /boot/kernel/ums.ko
	0xffffffff82f272c0 - 0xffffffff82f272d0 is set_sysctl_set in /boot/kernel/ums.ko
	0xffffffff82f272e0 - 0xffffffff82f27300 is usb_host_id in /boot/kernel/ums.ko
	0xffffffff82f27300 - 0xffffffff82f27338 is set_modmetadata_set in /boot/kernel/ums.ko
	0xffffffff82f27338 - 0xffffffff82f27340 is set_sysinit_set in /boot/kernel/ums.ko
	0xffffffff82f27340 - 0xffffffff82f27350 is .bss in /boot/kernel/ums.ko
	0xffffffff82f27350 - 0xffffffff82f27374 is .note.gnu.build-id in /boot/kernel/ums.ko
	0xffffffff82f28000 - 0xffffffff82f2c000 is .text in /boot/kernel/autofs.ko
	0xffffffff82f2c000 - 0xffffffff82f2d000 is .rodata in /boot/kernel/autofs.ko
	0xffffffff82f2d000 - 0xffffffff82f2da24 is .data in /boot/kernel/autofs.ko
	0xffffffff82f2da28 - 0xffffffff82f2da78 is set_sysinit_set in /boot/kernel/autofs.ko
	0xffffffff82f2da78 - 0xffffffff82f2da80 is set_sysuninit_set in /boot/kernel/autofs.ko
	0xffffffff82f2da80 - 0xffffffff82f2dac0 is set_sysctl_set in /boot/kernel/autofs.ko
	0xffffffff82f2dac0 - 0xffffffff82f2dae0 is .bss in /boot/kernel/autofs.ko
	0xffffffff82f2dae0 - 0xffffffff82f2daf8 is set_modmetadata_set in /boot/kernel/autofs.ko
	0xffffffff82f2daf8 - 0xffffffff82f2db1c is .note.gnu.build-id in /boot/kernel/autofs.ko
	0xffffffff82f2e000 - 0xffffffff82f2f000 is .text in /boot/kernel/mac_ntpd.ko
	0xffffffff82f2f000 - 0xffffffff82f30000 is .rodata in /boot/kernel/mac_ntpd.ko
	0xffffffff82f30000 - 0xffffffff82f309d0 is .data in /boot/kernel/mac_ntpd.ko
	0xffffffff82f309d0 - 0xffffffff82f309e8 is set_sysctl_set in /boot/kernel/mac_ntpd.ko
	0xffffffff82f309e8 - 0xffffffff82f30a00 is set_modmetadata_set in /boot/kernel/mac_ntpd.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82f30a00 - 0xffffffff82f30a08 is set_sysinit_set in /boot/kernel/mac_ntpd.ko
	0xffffffff82f30a08 - 0xffffffff82f30a2c is .note.gnu.build-id in /boot/kernel/mac_ntpd.ko
	0xffffffff82f31000 - 0xffffffff82f32000 is .text in /boot/kernel/green_saver.ko
	0xffffffff82f32000 - 0xffffffff82f33000 is .rodata in /boot/kernel/green_saver.ko
	0xffffffff82f33000 - 0xffffffff82f330cc is .data in /boot/kernel/green_saver.ko
	0xffffffff82f330d0 - 0xffffffff82f330e8 is set_modmetadata_set in /boot/kernel/green_saver.ko
	0xffffffff82f330e8 - 0xffffffff82f330f0 is set_sysinit_set in /boot/kernel/green_saver.ko
	0xffffffff82f330f0 - 0xffffffff82f33114 is .note.gnu.build-id in /boot/kernel/green_saver.ko
(kgdb) disass 0xffffffff80cf0110
Dump of assembler code for function strcmp:
   0xffffffff80cf0100 <+0>:	push   %rbp
   0xffffffff80cf0101 <+1>:	mov    %rsp,%rbp
   0xffffffff80cf0104 <+4>:	xor    %ecx,%ecx
   0xffffffff80cf0106 <+6>:	cs nopw 0x0(%rax,%rax,1)
   0xffffffff80cf0110 <+16>:	movzbl (%rdi,%rcx,1),%eax
   0xffffffff80cf0114 <+20>:	movzbl (%rsi,%rcx,1),%edx
   0xffffffff80cf0118 <+24>:	cmp    %dl,%al
   0xffffffff80cf011a <+26>:	jne    0xffffffff80cf0127 <strcmp+39>
   0xffffffff80cf011c <+28>:	inc    %rcx
   0xffffffff80cf011f <+31>:	test   %eax,%eax
   0xffffffff80cf0121 <+33>:	jne    0xffffffff80cf0110 <strcmp+16>
   0xffffffff80cf0123 <+35>:	xor    %eax,%eax
   0xffffffff80cf0125 <+37>:	pop    %rbp
   0xffffffff80cf0126 <+38>:	ret
   0xffffffff80cf0127 <+39>:	sub    %edx,%eax
   0xffffffff80cf0129 <+41>:	pop    %rbp
   0xffffffff80cf012a <+42>:	ret
End of assembler dump.
(kgdb)
Comment 214 Mark Millard 2024-12-12 00:29:04 UTC
Interesting. Your context:

Local exec file:
	`/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff80231108 is .hash
	0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash
	0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym
	0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr
	0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811843f8 is .text
	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata
. . .

The downloaded kernel.txz expanded:

Local exec file:
	`/usr/home/root/artifacts/13.4R/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff802310f0 is .hash
	0xffffffff802310f0 - 0xffffffff8025f9c0 is .gnu.hash
	0xffffffff8025f9c0 - 0xffffffff802f2450 is .dynsym
	0xffffffff802f2450 - 0xffffffff8036d0c4 is .dynstr
	0xffffffff8036d0c8 - 0xffffffff8038da68 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811863f8 is .text
	0xffffffff81186400 - 0xffffffff817f8c20 is .rodata
. . .

And, your context:

(kgdb) disass 0xffffffff80cf0110
Dump of assembler code for function strcmp:
   0xffffffff80cf0100 <+0>:	push   %rbp
   0xffffffff80cf0101 <+1>:	mov    %rsp,%rbp
   0xffffffff80cf0104 <+4>:	xor    %ecx,%ecx
   0xffffffff80cf0106 <+6>:	cs nopw 0x0(%rax,%rax,1)
   0xffffffff80cf0110 <+16>:	movzbl (%rdi,%rcx,1),%eax
   0xffffffff80cf0114 <+20>:	movzbl (%rsi,%rcx,1),%edx
   0xffffffff80cf0118 <+24>:	cmp    %dl,%al
   0xffffffff80cf011a <+26>:	jne    0xffffffff80cf0127 <strcmp+39>
   0xffffffff80cf011c <+28>:	inc    %rcx
   0xffffffff80cf011f <+31>:	test   %eax,%eax
   0xffffffff80cf0121 <+33>:	jne    0xffffffff80cf0110 <strcmp+16>
   0xffffffff80cf0123 <+35>:	xor    %eax,%eax
   0xffffffff80cf0125 <+37>:	pop    %rbp
   0xffffffff80cf0126 <+38>:	ret
   0xffffffff80cf0127 <+39>:	sub    %edx,%eax
   0xffffffff80cf0129 <+41>:	pop    %rbp
   0xffffffff80cf012a <+42>:	ret
End of assembler dump.

My 13.4-RELEASE kernel.txz expansion:

(kgdb) disass strcmp
Dump of assembler code for function strcmp:
   0xffffffff80cf2290 <+0>:	push   %rbp
   0xffffffff80cf2291 <+1>:	mov    %rsp,%rbp
   0xffffffff80cf2294 <+4>:	xor    %ecx,%ecx
   0xffffffff80cf2296 <+6>:	cs nopw 0x0(%rax,%rax,1)
   0xffffffff80cf22a0 <+16>:	movzbl (%rdi,%rcx,1),%eax
   0xffffffff80cf22a4 <+20>:	movzbl (%rsi,%rcx,1),%edx
   0xffffffff80cf22a8 <+24>:	cmp    %dl,%al
   0xffffffff80cf22aa <+26>:	jne    0xffffffff80cf22b7 <strcmp+39>
   0xffffffff80cf22ac <+28>:	inc    %rcx
   0xffffffff80cf22af <+31>:	test   %eax,%eax
   0xffffffff80cf22b1 <+33>:	jne    0xffffffff80cf22a0 <strcmp+16>
   0xffffffff80cf22b3 <+35>:	xor    %eax,%eax
   0xffffffff80cf22b5 <+37>:	pop    %rbp
   0xffffffff80cf22b6 <+38>:	ret
   0xffffffff80cf22b7 <+39>:	sub    %edx,%eax
   0xffffffff80cf22b9 <+41>:	pop    %rbp
   0xffffffff80cf22ba <+42>:	ret
End of assembler dump.

Same code, different address range.

It does not look like I can investigate backtraces
via just the kernel*.txz contents. That invalidates
a lot of my older notes that involve such.

strcmp is in your context is at least believable,
presuming the address interpretation of the blur
is accurate: the nice start of an instruction
could well produce the general protection fault:

   0xffffffff80cf0110 <+16>:	movzbl (%rdi,%rcx,1),%eax

Most bad values would not point to an intended
instruction start of an instruction that could
generate the initially reported failure.

For reference, for the kernel.txz expansion:

# strings boot/kernel/kernel | grep "\-RELEASE "
@(#)FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC
FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC
Comment 215 George Mitchell 2024-12-12 00:55:50 UTC
I was running 13.1-RELEASE when this bug was filed; 13.2-RELEASE later on (both those kernels no longer exist); 13.3 up until last Saturday; and now 13.4, whence the most recent output in the last couple of comments ...
Comment 216 Mark Millard 2024-12-12 01:18:54 UTC
(In reply to George Mitchell from comment #215)

My means of picking up kernel files to looks at was
not matching the patch level involved, just the 13.*
status each time. I was making a bad assumption by
doing so, only noticed now.

So far as I know, there are no pre-made official
distributions of the patched variants of the kernel
files for RELEASE.

I do not know if PkgBase for 14.* release builds would
be sufficient for matching for a 14.*-RELEASE-p* well
for the purpose or not. 13.* has no PkgBase
distributions to try.

https://pkg.freebsd.org/FreeBSD:14:amd64/base_release_*/
Comment 217 Mark Millard 2024-12-12 18:44:36 UTC
comment #5 and comment #82 both identify strcmp for the failure
context and both identify it is a strcmp during modlist_lookup
that got the failures in those examples. This is part of the
linker_load_module activity, something the back trace in your
recent example also indicates as going on.

comment #5  was for a context attempting to find "zfs".
comment #82 was for a cotnext attempting to find "acpi_wmi".

(That aspect varies across the failures.)

The comment #82 notes are likely the closest to being failure
details as far as I can tell.

Only the kgdb backtrace seems to be all that useful and is what
comment #5 and comment #82 were based on, apparently with
correct contexts for matching the live system kernel of the
times in question.
Comment 218 Mark Millard 2024-12-12 19:16:39 UTC
(In reply to Mark Millard from comment #217)

I'll note that linker_load_dependencies that shows up
in the modern example kernel (non-kgdb) backtrace also
calls modlist_lookup . It also uses strcmp directly.
And it also calls modlist_lookup2 that in turn calls
modlist_lookup . It can also recurse back out to
linker_load_module.

It appears that getting a dump during the initial general
protection fault and getting it savecore'd and crashinfo'd
so that we can see a kgdb backtrace is what would be the
primary next-useful-thing.

I also hope that:

static modlisthead_t found_modules;

can be examined to see if the modlist has any bad name
pointers that strcmp ends up trying to use.
Comment 219 Mark Millard 2024-12-12 19:53:38 UTC
For a successful boot, could you try:

# kgdb
. . .
(kgdb) disass/s linker_load_dependencies
. . . ( a range around offset +0x274 ) . . .

If you have /usr/src/ in place as a copy of the source
for 13.4-RELELASE, the "/s" should lead to also showing
related source code, but tracking code generation order,
not source code order.

As things may be inlined, you may see the strcmp and such
from called routines, not just source from
linker_load_dependencies itself.

This might give an idea of which phase
linker_load_dependencies was in when it ended up leading
to the failure during strcmp .
Comment 220 George Mitchell 2024-12-12 20:00:26 UTC
root@court:/home/george # kgdb
GNU gdb (GDB) 15.1 [GDB v15.1 for FreeBSD]
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.4".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
Reading symbols from /boot/kernel/sem.ko...
Reading symbols from /usr/lib/debug//boot/kernel/sem.ko.debug...
Reading symbols from /boot/modules/if_re.ko...
(No debugging symbols found in /boot/modules/if_re.ko)
Reading symbols from /boot/kernel/fusefs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/fusefs.ko.debug...
Reading symbols from /boot/modules/amdgpu.ko...
(No debugging symbols found in /boot/modules/amdgpu.ko)
Reading symbols from /boot/modules/drm.ko...
(No debugging symbols found in /boot/modules/drm.ko)
Reading symbols from /boot/kernel/iic.ko...
Reading symbols from /usr/lib/debug//boot/kernel/iic.ko.debug...
Reading symbols from /boot/modules/linuxkpi_gplv2.ko...
(No debugging symbols found in /boot/modules/linuxkpi_gplv2.ko)
--Type <RET> for more, q to quit, c to continue without paging--c
Reading symbols from /boot/modules/dmabuf.ko...
(No debugging symbols found in /boot/modules/dmabuf.ko)
Reading symbols from /boot/modules/ttm.ko...
(No debugging symbols found in /boot/modules/ttm.ko)
Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_gpu_info_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_sdma_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_asd_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_ta_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_pfp_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_me_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_ce_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_rlc_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_mec_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_mec2_bin.ko)
Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko...
(No debugging symbols found in /boot/modules/amdgpu_raven_vcn_bin.ko)
Reading symbols from /boot/kernel/zfs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug...
Reading symbols from /boot/kernel/netgraph.ko...
Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug...
Reading symbols from /boot/kernel/acpi_wmi.ko...
Reading symbols from /usr/lib/debug//boot/kernel/acpi_wmi.ko.debug...
Reading symbols from /boot/kernel/intpm.ko...
Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug...
Reading symbols from /boot/kernel/smbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug...
Reading symbols from /boot/kernel/uhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug...
Reading symbols from /boot/kernel/usbhid.ko...
Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug...
Reading symbols from /boot/kernel/hidbus.ko...
Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug...
Reading symbols from /boot/kernel/wmt.ko...
Reading symbols from /usr/lib/debug//boot/kernel/wmt.ko.debug...
Reading symbols from /boot/kernel/ums.ko...
Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug...
Reading symbols from /boot/kernel/autofs.ko...
Reading symbols from /usr/lib/debug//boot/kernel/autofs.ko.debug...
Reading symbols from /boot/kernel/mac_ntpd.ko...
Reading symbols from /usr/lib/debug//boot/kernel/mac_ntpd.ko.debug...
Reading symbols from /boot/kernel/green_saver.ko...
Reading symbols from /usr/lib/debug//boot/kernel/green_saver.ko.debug...
sched_switch (td=td@entry=0xffffffff82043780 <thread0_st>, flags=flags@entry=260) at /usr/src/sys/kern/sched_4bsd.c:1085
1085			SDT_PROBE0(sched, , , on__cpu);
(kgdb) disass/s linker_load_dependencies
Dump of assembler code for function linker_load_dependencies:
/usr/src/sys/kern/kern_linker.c:
2213	{
   0xffffffff80bc0840 <+0>:	push   %rbp
   0xffffffff80bc0841 <+1>:	mov    %rsp,%rbp
   0xffffffff80bc0844 <+4>:	push   %r15
   0xffffffff80bc0846 <+6>:	push   %r14
   0xffffffff80bc0848 <+8>:	push   %r13
   0xffffffff80bc084a <+10>:	push   %r12
   0xffffffff80bc084c <+12>:	push   %rbx
   0xffffffff80bc084d <+13>:	sub    $0x18,%rsp
   0xffffffff80bc0851 <+17>:	mov    %rdi,%r14

2214		linker_file_t lfdep;
2215		struct mod_metadata **start, **stop, **mdp, **nmdp;
2216		struct mod_metadata *mp, *nmp;
2217		const struct mod_depend *verinfo;
2218		modlist_t mod;
2219		const char *modname, *nmodname;
2220		int ver, error = 0;
2221	
2222		/*
2223		 * All files are dependent on /kernel.
2224		 */
2225		sx_assert(&kld_sx, SA_XLOCKED);
2226		if (linker_kernel_file) {
   0xffffffff80bc0854 <+20>:	mov    0x1484f6d(%rip),%rbx        # 0xffffffff820457c8 <linker_kernel_file>
   0xffffffff80bc085b <+27>:	test   %rbx,%rbx
   0xffffffff80bc085e <+30>:	je     0xffffffff80bc0895 <linker_load_dependencies+85>

2227			linker_kernel_file->refs++;
--Type <RET> for more, q to quit, c to continue without paging--c
   0xffffffff80bc0860 <+32>:	incl   0x8(%rbx)

773		file->deps = realloc(file->deps, (file->ndeps + 1) * sizeof(*newdeps),
   0xffffffff80bc0863 <+35>:	mov    0x78(%r14),%rdi
   0xffffffff80bc0867 <+39>:	mov    0x70(%r14),%eax
   0xffffffff80bc086b <+43>:	inc    %eax
   0xffffffff80bc086d <+45>:	movslq %eax,%rsi
   0xffffffff80bc0870 <+48>:	shl    $0x3,%rsi
   0xffffffff80bc0874 <+52>:	mov    $0xffffffff81cc6be0,%rdx
   0xffffffff80bc087b <+59>:	mov    $0x102,%ecx
   0xffffffff80bc0880 <+64>:	call   0xffffffff80bc84f0 <realloc>
   0xffffffff80bc0885 <+69>:	mov    %rax,0x78(%r14)

774		    M_LINKER, M_WAITOK | M_ZERO);
775		file->deps[file->ndeps] = dep;
   0xffffffff80bc0889 <+73>:	movslq 0x70(%r14),%rcx
   0xffffffff80bc088d <+77>:	mov    %rbx,(%rax,%rcx,8)

776		file->ndeps++;
   0xffffffff80bc0891 <+81>:	incl   0x70(%r14)

./linker_if.h:
142		KOBJOPLOOKUP(((kobj_t)file)->ops,linker_lookup_set);
   0xffffffff80bc0895 <+85>:	mov    (%r14),%rcx
   0xffffffff80bc0898 <+88>:	movzbl 0x1112839(%rip),%edx        # 0xffffffff81cd30d8 <linker_lookup_set_desc>
   0xffffffff80bc089f <+95>:	mov    (%rcx,%rdx,8),%rax
   0xffffffff80bc08a3 <+99>:	cmpq   $0xffffffff81cd30d8,(%rax)
   0xffffffff80bc08aa <+106>:	je     0xffffffff80bc08c3 <linker_load_dependencies+131>
   0xffffffff80bc08ac <+108>:	lea    (%rcx,%rdx,8),%rsi
   0xffffffff80bc08b0 <+112>:	mov    0x800(%rcx),%rdi
   0xffffffff80bc08b7 <+119>:	mov    $0xffffffff81cd30d8,%rdx
   0xffffffff80bc08be <+126>:	call   0xffffffff80c3ca30 <kobj_lookup_method>
   0xffffffff80bc08c3 <+131>:	xor    %ebx,%ebx
   0xffffffff80bc08c5 <+133>:	lea    -0x38(%rbp),%rdx
   0xffffffff80bc08c9 <+137>:	lea    -0x30(%rbp),%rcx

143		rc = ((linker_lookup_set_t *) _m)(file, name, start, stop, count);
   0xffffffff80bc08cd <+141>:	mov    %r14,%rdi
   0xffffffff80bc08d0 <+144>:	mov    $0xffffffff8122e707,%rsi
   0xffffffff80bc08d7 <+151>:	xor    %r8d,%r8d
   0xffffffff80bc08da <+154>:	call   *0x8(%rax)

/usr/src/sys/kern/kern_linker.c:
2231		    NULL) != 0)
   0xffffffff80bc08dd <+157>:	test   %eax,%eax

2230		if (linker_file_lookup_set(lf, MDT_SETNAME, &start, &stop,
   0xffffffff80bc08df <+159>:	jne    0xffffffff80bc0af7 <linker_load_dependencies+695>

2232			return (0);
2233		for (mdp = start; mdp < stop; mdp++) {
   0xffffffff80bc08e5 <+165>:	mov    -0x38(%rbp),%r15
   0xffffffff80bc08e9 <+169>:	mov    -0x30(%rbp),%rdx
   0xffffffff80bc08ed <+173>:	cmp    %rdx,%r15
   0xffffffff80bc08f0 <+176>:	jb     0xffffffff80bc0a76 <linker_load_dependencies+566>

2244				return (EEXIST);
2245			}
2246		}
2247	
2248		for (mdp = start; mdp < stop; mdp++) {
   0xffffffff80bc08f6 <+182>:	cmp    %rdx,%r15
   0xffffffff80bc08f9 <+185>:	jae    0xffffffff80bc0aea <linker_load_dependencies+682>
   0xffffffff80bc08ff <+191>:	mov    %r14,-0x40(%rbp)
   0xffffffff80bc0903 <+195>:	jmp    0xffffffff80bc0941 <linker_load_dependencies+257>

2269				linker_file_add_dependency(lf, lfdep);
2270				continue;
2271			}
2272			error = linker_load_module(NULL, modname, lf, verinfo, NULL);
   0xffffffff80bc0905 <+197>:	xor    %edi,%edi
   0xffffffff80bc0907 <+199>:	mov    %r12,%rsi
   0xffffffff80bc090a <+202>:	mov    -0x40(%rbp),%r14
   0xffffffff80bc090e <+206>:	mov    %r14,%rdx
   0xffffffff80bc0911 <+209>:	mov    %r13,%rcx
   0xffffffff80bc0914 <+212>:	xor    %r8d,%r8d
   0xffffffff80bc0917 <+215>:	call   0xffffffff80bbd3f0 <linker_load_module>

2273			if (error) {
   0xffffffff80bc091c <+220>:	test   %eax,%eax
   0xffffffff80bc091e <+222>:	jne    0xffffffff80bc0b17 <linker_load_dependencies+727>
   0xffffffff80bc0924 <+228>:	data16 data16 cs nopw 0x0(%rax,%rax,1)

2248		for (mdp = start; mdp < stop; mdp++) {
   0xffffffff80bc0930 <+240>:	add    $0x8,%r15
   0xffffffff80bc0934 <+244>:	mov    -0x30(%rbp),%rdx
   0xffffffff80bc0938 <+248>:	cmp    %rdx,%r15
   0xffffffff80bc093b <+251>:	jae    0xffffffff80bc0ae6 <linker_load_dependencies+678>

2249			mp = *mdp;
   0xffffffff80bc0941 <+257>:	mov    (%r15),%rax

2250			if (mp->md_type != MDT_DEPEND)
   0xffffffff80bc0944 <+260>:	cmpl   $0x1,0x4(%rax)
   0xffffffff80bc0948 <+264>:	jne    0xffffffff80bc0930 <linker_load_dependencies+240>

2253			verinfo = mp->md_data;
   0xffffffff80bc094a <+266>:	mov    0x8(%rax),%r13

2252			modname = mp->md_cval;
   0xffffffff80bc094e <+270>:	mov    0x10(%rax),%r12

2254			nmodname = NULL;
2255			for (nmdp = start; nmdp < stop; nmdp++) {
   0xffffffff80bc0952 <+274>:	mov    -0x38(%rbp),%rbx
   0xffffffff80bc0956 <+278>:	jmp    0xffffffff80bc0964 <linker_load_dependencies+292>
   0xffffffff80bc0958 <+280>:	nopl   0x0(%rax,%rax,1)
   0xffffffff80bc0960 <+288>:	add    $0x8,%rbx
   0xffffffff80bc0964 <+292>:	cmp    %rdx,%rbx
   0xffffffff80bc0967 <+295>:	jae    0xffffffff80bc0990 <linker_load_dependencies+336>

2256				nmp = *nmdp;
   0xffffffff80bc0969 <+297>:	mov    (%rbx),%rax

2257				if (nmp->md_type != MDT_VERSION)
   0xffffffff80bc096c <+300>:	cmpl   $0x3,0x4(%rax)
   0xffffffff80bc0970 <+304>:	jne    0xffffffff80bc0960 <linker_load_dependencies+288>

2258					continue;
2259				nmodname = nmp->md_cval;
   0xffffffff80bc0972 <+306>:	mov    0x10(%rax),%rsi

2260				if (strcmp(modname, nmodname) == 0)
   0xffffffff80bc0976 <+310>:	mov    %r12,%rdi
   0xffffffff80bc0979 <+313>:	call   0xffffffff80cf0100 <strcmp>

2261					break;
2262			}
2263			if (nmdp < stop)/* early exit, it's a self reference */
   0xffffffff80bc097e <+318>:	mov    -0x30(%rbp),%rdx

2260				if (strcmp(modname, nmodname) == 0)
   0xffffffff80bc0982 <+322>:	test   %eax,%eax
   0xffffffff80bc0984 <+324>:	jne    0xffffffff80bc0960 <linker_load_dependencies+288>
   0xffffffff80bc0986 <+326>:	cs nopw 0x0(%rax,%rax,1)

2261					break;
2262			}
2263			if (nmdp < stop)/* early exit, it's a self reference */
   0xffffffff80bc0990 <+336>:	cmp    %rdx,%rbx
   0xffffffff80bc0993 <+339>:	jb     0xffffffff80bc0930 <linker_load_dependencies+240>
   0xffffffff80bc0995 <+341>:	mov    0x1484e0c(%rip),%r14        # 0xffffffff820457a8 <found_modules>

1501		if (verinfo == NULL)
   0xffffffff80bc099c <+348>:	test   %r13,%r13
   0xffffffff80bc099f <+351>:	je     0xffffffff80bc09b3 <linker_load_dependencies+371>
   0xffffffff80bc09a1 <+353>:	test   %r14,%r14

1502			return (modlist_lookup(name, 0));
1503		bestmod = NULL;
1504		TAILQ_FOREACH(mod, &found_modules, link) {
   0xffffffff80bc09a4 <+356>:	je     0xffffffff80bc0905 <linker_load_dependencies+197>
   0xffffffff80bc09aa <+362>:	xor    %ebx,%ebx
   0xffffffff80bc09ac <+364>:	jmp    0xffffffff80bc09e8 <linker_load_dependencies+424>
   0xffffffff80bc09ae <+366>:	xchg   %ax,%ax

1487		TAILQ_FOREACH(mod, &found_modules, link) {
   0xffffffff80bc09b0 <+368>:	mov    (%r14),%r14
   0xffffffff80bc09b3 <+371>:	test   %r14,%r14
   0xffffffff80bc09b6 <+374>:	je     0xffffffff80bc0905 <linker_load_dependencies+197>

1488			if (strcmp(mod->name, name) == 0 &&
   0xffffffff80bc09bc <+380>:	mov    0x18(%r14),%rdi
   0xffffffff80bc09c0 <+384>:	mov    %r12,%rsi
   0xffffffff80bc09c3 <+387>:	call   0xffffffff80cf0100 <strcmp>
   0xffffffff80bc09c8 <+392>:	test   %eax,%eax
   0xffffffff80bc09ca <+394>:	jne    0xffffffff80bc09b0 <linker_load_dependencies+368>
   0xffffffff80bc09cc <+396>:	mov    %r14,%rbx
   0xffffffff80bc09cf <+399>:	jmp    0xffffffff80bc0a23 <linker_load_dependencies+483>
   0xffffffff80bc09d1 <+401>:	mov    %r14,%rbx
   0xffffffff80bc09d4 <+404>:	data16 data16 cs nopw 0x0(%rax,%rax,1)

1502			return (modlist_lookup(name, 0));
1503		bestmod = NULL;
1504		TAILQ_FOREACH(mod, &found_modules, link) {
   0xffffffff80bc09e0 <+416>:	mov    (%r14),%r14
   0xffffffff80bc09e3 <+419>:	test   %r14,%r14
   0xffffffff80bc09e6 <+422>:	je     0xffffffff80bc0a1a <linker_load_dependencies+474>

1505			if (strcmp(mod->name, name) != 0)
   0xffffffff80bc09e8 <+424>:	mov    0x18(%r14),%rdi
   0xffffffff80bc09ec <+428>:	mov    %r12,%rsi
   0xffffffff80bc09ef <+431>:	call   0xffffffff80cf0100 <strcmp>
   0xffffffff80bc09f4 <+436>:	test   %eax,%eax
   0xffffffff80bc09f6 <+438>:	jne    0xffffffff80bc09e0 <linker_load_dependencies+416>

1506				continue;
1507			ver = mod->version;
   0xffffffff80bc09f8 <+440>:	mov    0x20(%r14),%eax

1508			if (ver == verinfo->md_ver_preferred)
   0xffffffff80bc09fc <+444>:	cmp    0x4(%r13),%eax
   0xffffffff80bc0a00 <+448>:	je     0xffffffff80bc09cc <linker_load_dependencies+396>

1509				return (mod);
1510			if (ver >= verinfo->md_ver_minimum &&
   0xffffffff80bc0a02 <+450>:	cmp    0x0(%r13),%eax
   0xffffffff80bc0a06 <+454>:	jl     0xffffffff80bc09e0 <linker_load_dependencies+416>

1511			    ver <= verinfo->md_ver_maximum &&
   0xffffffff80bc0a08 <+456>:	cmp    0x8(%r13),%eax
   0xffffffff80bc0a0c <+460>:	jg     0xffffffff80bc09e0 <linker_load_dependencies+416>

1512			    (bestmod == NULL || ver > bestmod->version))
   0xffffffff80bc0a0e <+462>:	test   %rbx,%rbx
   0xffffffff80bc0a11 <+465>:	je     0xffffffff80bc09d1 <linker_load_dependencies+401>
   0xffffffff80bc0a13 <+467>:	cmp    0x20(%rbx),%eax

1510			if (ver >= verinfo->md_ver_minimum &&
   0xffffffff80bc0a16 <+470>:	jg     0xffffffff80bc09d1 <linker_load_dependencies+401>
   0xffffffff80bc0a18 <+472>:	jmp    0xffffffff80bc09e0 <linker_load_dependencies+416>

2264				continue;
2265			mod = modlist_lookup2(modname, verinfo);
2266			if (mod) {	/* woohoo, it's loaded already */
   0xffffffff80bc0a1a <+474>:	test   %rbx,%rbx
   0xffffffff80bc0a1d <+477>:	je     0xffffffff80bc0905 <linker_load_dependencies+197>

2267				lfdep = mod->container;
   0xffffffff80bc0a23 <+483>:	mov    0x10(%rbx),%rbx

2268				lfdep->refs++;
   0xffffffff80bc0a27 <+487>:	incl   0x8(%rbx)
   0xffffffff80bc0a2a <+490>:	mov    -0x40(%rbp),%r14

773		file->deps = realloc(file->deps, (file->ndeps + 1) * sizeof(*newdeps),
   0xffffffff80bc0a2e <+494>:	mov    0x78(%r14),%rdi
   0xffffffff80bc0a32 <+498>:	mov    0x70(%r14),%eax
   0xffffffff80bc0a36 <+502>:	inc    %eax
   0xffffffff80bc0a38 <+504>:	movslq %eax,%rsi
   0xffffffff80bc0a3b <+507>:	shl    $0x3,%rsi
   0xffffffff80bc0a3f <+511>:	mov    $0xffffffff81cc6be0,%rdx
   0xffffffff80bc0a46 <+518>:	mov    $0x102,%ecx
   0xffffffff80bc0a4b <+523>:	call   0xffffffff80bc84f0 <realloc>
   0xffffffff80bc0a50 <+528>:	mov    %rax,0x78(%r14)

774		    M_LINKER, M_WAITOK | M_ZERO);
775		file->deps[file->ndeps] = dep;
   0xffffffff80bc0a54 <+532>:	movslq 0x70(%r14),%rcx
   0xffffffff80bc0a58 <+536>:	mov    %rbx,(%rax,%rcx,8)

776		file->ndeps++;
   0xffffffff80bc0a5c <+540>:	incl   0x70(%r14)
   0xffffffff80bc0a60 <+544>:	jmp    0xffffffff80bc0930 <linker_load_dependencies+240>

2232			return (0);
2233		for (mdp = start; mdp < stop; mdp++) {
   0xffffffff80bc0a65 <+549>:	mov    -0x30(%rbp),%rdx
   0xffffffff80bc0a69 <+553>:	add    $0x8,%r15
   0xffffffff80bc0a6d <+557>:	cmp    %rdx,%r15
   0xffffffff80bc0a70 <+560>:	jae    0xffffffff80bc0b08 <linker_load_dependencies+712>

2234			mp = *mdp;
   0xffffffff80bc0a76 <+566>:	mov    (%r15),%rax

2235			if (mp->md_type != MDT_VERSION)
   0xffffffff80bc0a79 <+569>:	cmpl   $0x3,0x4(%rax)
   0xffffffff80bc0a7d <+573>:	jne    0xffffffff80bc0a69 <linker_load_dependencies+553>

1487		TAILQ_FOREACH(mod, &found_modules, link) {
   0xffffffff80bc0a7f <+575>:	mov    0x1484d22(%rip),%rbx        # 0xffffffff820457a8 <found_modules>
   0xffffffff80bc0a86 <+582>:	test   %rbx,%rbx
   0xffffffff80bc0a89 <+585>:	je     0xffffffff80bc0a69 <linker_load_dependencies+553>
   0xffffffff80bc0a8b <+587>:	mov    0x8(%rax),%rcx
   0xffffffff80bc0a8f <+591>:	mov    0x10(%rax),%r12
   0xffffffff80bc0a93 <+595>:	mov    (%rcx),%r13d
   0xffffffff80bc0a96 <+598>:	jmp    0xffffffff80bc0aa8 <linker_load_dependencies+616>
   0xffffffff80bc0a98 <+600>:	nopl   0x0(%rax,%rax,1)
   0xffffffff80bc0aa0 <+608>:	mov    (%rbx),%rbx
   0xffffffff80bc0aa3 <+611>:	test   %rbx,%rbx
   0xffffffff80bc0aa6 <+614>:	je     0xffffffff80bc0a65 <linker_load_dependencies+549>

1488			if (strcmp(mod->name, name) == 0 &&
   0xffffffff80bc0aa8 <+616>:	mov    0x18(%rbx),%rdi
   0xffffffff80bc0aac <+620>:	mov    %r12,%rsi
   0xffffffff80bc0aaf <+623>:	call   0xffffffff80cf0100 <strcmp>
   0xffffffff80bc0ab4 <+628>:	test   %eax,%eax
   0xffffffff80bc0ab6 <+630>:	jne    0xffffffff80bc0aa0 <linker_load_dependencies+608>
   0xffffffff80bc0ab8 <+632>:	test   %r13d,%r13d

1489			    (ver == 0 || mod->version == ver))
   0xffffffff80bc0abb <+635>:	je     0xffffffff80bc0ac3 <linker_load_dependencies+643>
   0xffffffff80bc0abd <+637>:	cmp    %r13d,0x20(%rbx)

1488			if (strcmp(mod->name, name) == 0 &&
   0xffffffff80bc0ac1 <+641>:	jne    0xffffffff80bc0aa0 <linker_load_dependencies+608>

2242				    " '%s'!\n", modname, ver,
2243				    mod->container->filename);
   0xffffffff80bc0ac3 <+643>:	mov    0x10(%rbx),%rax
   0xffffffff80bc0ac7 <+647>:	mov    0x28(%rax),%rcx

2241				printf("interface %s.%d already present in the KLD"
   0xffffffff80bc0acb <+651>:	mov    $0xffffffff811fbd3e,%rdi
   0xffffffff80bc0ad2 <+658>:	mov    %r12,%rsi
   0xffffffff80bc0ad5 <+661>:	mov    %r13d,%edx
   0xffffffff80bc0ad8 <+664>:	xor    %eax,%eax
   0xffffffff80bc0ada <+666>:	call   0xffffffff80c42bc0 <printf>
   0xffffffff80bc0adf <+671>:	mov    $0x11,%ebx
   0xffffffff80bc0ae4 <+676>:	jmp    0xffffffff80bc0af7 <linker_load_dependencies+695>

2276				break;
2277			}
2278		}
2279	
2280		if (error)
2281			return (error);
2282		linker_addmodules(lf, start, stop, 0);
   0xffffffff80bc0ae6 <+678>:	mov    -0x38(%rbp),%r15
   0xffffffff80bc0aea <+682>:	mov    %r14,%rdi
   0xffffffff80bc0aed <+685>:	mov    %r15,%rsi
   0xffffffff80bc0af0 <+688>:	call   0xffffffff80bc0b80 <linker_addmodules>
   0xffffffff80bc0af5 <+693>:	xor    %ebx,%ebx

2283		return (error);
2284	}
   0xffffffff80bc0af7 <+695>:	mov    %ebx,%eax
   0xffffffff80bc0af9 <+697>:	add    $0x18,%rsp
   0xffffffff80bc0afd <+701>:	pop    %rbx
   0xffffffff80bc0afe <+702>:	pop    %r12
   0xffffffff80bc0b00 <+704>:	pop    %r13
   0xffffffff80bc0b02 <+706>:	pop    %r14
   0xffffffff80bc0b04 <+708>:	pop    %r15
   0xffffffff80bc0b06 <+710>:	pop    %rbp
   0xffffffff80bc0b07 <+711>:	ret

2248		for (mdp = start; mdp < stop; mdp++) {
   0xffffffff80bc0b08 <+712>:	mov    -0x38(%rbp),%r15
   0xffffffff80bc0b0c <+716>:	cmp    %rdx,%r15
   0xffffffff80bc0b0f <+719>:	jb     0xffffffff80bc08ff <linker_load_dependencies+191>
   0xffffffff80bc0b15 <+725>:	jmp    0xffffffff80bc0aea <linker_load_dependencies+682>

2275				    " version mismatch\n", lf->filename, modname);
   0xffffffff80bc0b17 <+727>:	mov    0x28(%r14),%rsi

2274				printf("KLD %s: depends on %s - not available or"
   0xffffffff80bc0b1b <+731>:	mov    $0xffffffff81223a3c,%rdi
   0xffffffff80bc0b22 <+738>:	mov    %r12,%rdx
   0xffffffff80bc0b25 <+741>:	mov    %eax,%ebx
   0xffffffff80bc0b27 <+743>:	xor    %eax,%eax
   0xffffffff80bc0b29 <+745>:	call   0xffffffff80c42bc0 <printf>
   0xffffffff80bc0b2e <+750>:	jmp    0xffffffff80bc0af7 <linker_load_dependencies+695>
End of assembler dump.
(kgdb)
Comment 221 Mark Millard 2024-12-12 21:15:56 UTC
(In reply to George Mitchell from comment #220)

+0x274 == +628 (at 0xffffffff80bc0ab4)

1488                    if (strcmp(mod->name, name) == 0 &&
  0xffffffff80bc0aa8 <+616>:   mov    0x18(%rbx),%rdi
  0xffffffff80bc0aac <+620>:   mov    %r12,%rsi
  0xffffffff80bc0aaf <+623>:   call   0xffffffff80cf0100 <strcmp>
  0xffffffff80bc0ab4 <+628>:   test   %eax,%eax
  0xffffffff80bc0ab6 <+630>:   jne    0xffffffff80bc0aa0
<linker_load_dependencies+608>

I expect that kgdb would expose strcmp in the backtrace, as
it does a better job generally. (The original Fatal Trap
notice reported the address in strcmp as well.)

This suggests problems for the value in mp->md_cval (via modname)
as of the strcmp call (based on where the Fatal Trap was reported
to be at in strcmp).

Note the mod->name argument to strcmp .

Note that if mod-> could not be dereferenced, the failure would be
before strcmp was called. Thus it looks to be the ->name value
that strcmp ends up using that lead to the failure.

Same sort of thing as for comment #82 (but a separate in-line).

in strcmp it showed:

   0xffffffff80cf0110 <+16>:	movzbl (%rdi,%rcx,1),%eax

So if you ever get a chance to have it report the %rdi value
on an actual failure, the value may prove interesting.

Of course, none of this gets far enough to suggest why the
value of mod->name would be messed up.
Comment 222 Mark Millard 2024-12-12 23:20:54 UTC
(In reply to Mark Millard from comment #221)

References to "mp->md_cval (via modname)" should have
been to mod->name instead.
Comment 223 George Mitchell 2024-12-13 00:17:35 UTC
Created attachment 255825 [details]
Latest crash dump/text

Setting dumpdev="/dev/ada0p3" was enough to get an actual dump in single-user mode, and it's attached.  I marked all previous attachments obsolete, but of course they are still ther if someone wants to look at them.  My core file is too big to attach, even compressed, but I can make it available if you think it will actually be more helpful than the core.txt.8 file that I've attached.
Comment 224 Mark Millard 2024-12-13 01:52:22 UTC
(In reply to George Mitchell from comment #223)

Cool.

System core files tend to have information one does not
want to publish. You might want any transfers to be in
a more secure person-to-person form instead of being
public, possibly via encryption.

But it gets messier overall: one would need the kernel
/boot/kernel/* files and the /usr/lib/debug/boot/kernel/*.debug
files if one is not also running a matching 13.4-RELEASE-p?
build someplace. (kgdb uses the information in these files as
well.)

There are no simple, reference copies of those files to
download for use in analyzing a system crash file a far as
I know --given it is a patched update that is in use.

For now, for me, I'll probably just ask for you to use
kgdb, the kernel file, the core file, and the implicit
*.debug and other files to report some things if I come up
with questions.
Comment 225 Mark Millard 2024-12-13 02:27:50 UTC
(In reply to George Mitchell from comment #223)

If you happen to get other examples that list alternatives
to zfsctrl and zfs.ko in the likes of:

#8  0xffffffff80bc0ab4 in modlist_lookup (name=0xffffffff83255959 "zfsctrl", 
    ver=1) at /usr/src/sys/kern/kern_linker.c:1488

or in the likes of:

#14 0xffffffff80bbfa04 in kern_kldload (td=td@entry=0xfffffe0080377e00, 
    file=file@entry=0xfffff800045da000 "zfs.ko", 
    fileid=fileid@entry=0xfffffe0075f8ede4)
    at /usr/src/sys/kern/kern_linker.c:1149
        lf = 0x258800000000
        error = 0
        modname = 0xffffffff83255959 "zfsctrl"
        kldname = 0xfffff800045da000 "zfs.ko"

It likely would be good to capture them for future reference as
well.
Comment 226 George Mitchell 2024-12-13 02:32:37 UTC
Most of my effort has been toward discovering how to make this bug bite me less often.  But as you might guess from the summary of the bug, there have been plenty of instances where "zfs" was replaced with "acpi_wmi" or "vboxnetflt" with otherwise similar contexts.
Comment 227 Mark Millard 2024-12-13 03:28:34 UTC
(In reply to George Mitchell from comment #226)

"zfsctrl" is new compared to past examples as far as I know.
Even for "zfs" and "acpi_wmi" I think it would be good to
have an example from the 13.4-RELEASE-p? in use. I'll note
that "zfsctrl" vs. "zfs" are not equivalent: different code
path and call-chain, or so it appears. I can identify where
"zfsctrl" is from --but cannot for the historical "zfs".

Other new strings and/or *.ko files would be good too.

In part what I'm looking for is the earliest example to occur.
In other respects, the variability indicates a race or something
is invovled in whatever the original corruption is: not always
failing in the same place.

Note that in the below sequence, acpi_wmi.ko would happen
later. Does it ever crash before zfs.ko ends up hitting a
corruption? (I expect that the below list is in the order
of the *.ko loads but am not sure.)

Never-fails up too the first sometimes-fails puts some
sort of bounds on things.

Reading symbols from /boot/kernel/fusefs.ko...
Reading symbols from /boot/kernel/sem.ko...
Reading symbols from /boot/modules/if_re.ko...
Reading symbols from /boot/modules/amdgpu.ko...
Reading symbols from /boot/modules/drm.ko...
Reading symbols from /boot/kernel/iic.ko...
Reading symbols from /boot/modules/linuxkpi_gplv2.ko...
Reading symbols from /boot/modules/dmabuf.ko...
Reading symbols from /boot/modules/ttm.ko...
Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko...
Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko...
Reading symbols from /boot/kernel/zfs.ko...
Comment 228 Mark Millard 2024-12-13 06:14:49 UTC
In kgdb based on the kernel and system crash core file (and related),
it be possible to do a sequence like:

(kgdb) print *found_modules->tqh_first
$51 = {link = {tqe_next = 0xfffff8010175d340, tqe_prev = 0xffffffff81b8e218 <found_modules>}, container = 0xfffff80101918c00, name = 0xffffffff81113803 "cam", version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next
$52 = {link = {tqe_next = 0xfffff8010175d300, tqe_prev = 0xfffff8010175d380}, container = 0xfffff80101918c00, name = 0xffffffff811e1b57 "xz", version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next
$53 = {link = {tqe_next = 0xfffff8010175d2c0, tqe_prev = 0xfffff8010175d340}, container = 0xfffff80101918c00, name = 0xffffffff8123ecdc "acpi", version = 1}
. . .
until the problematical name field is shown (bad
pointer or non-terminated string.

This should allow reporting what the last good name is and
what the failing example looks like.
Comment 229 Mark Millard 2024-12-15 06:01:45 UTC
I used freebsd-update fetch and-then freebsd-update install
to get a 13.4-RELEASE-p2 (so: a 13./4-RELEASE-p1 kernel)
based on a 13.4-RELEASE install.

However, the result lead to kgdb reporting:

warning: the debug information found in "/usr/lib/debug//boot/kernel/kernel.debug" does not match "/boot/kernel/kernel" (CRC mismatch).

That was because freebsd-update did not update:

usr/lib/debug/boot/kernel/kernel.debug

but did update boot/kernel/kernel .

As stands, it does not look like I can reproduce even that much
of your environment. Looks like I'll continue to be limited to
your reporting results of your experiments done in your
environment. It does not look like my having a crash-core file
would do much good.

# strings 13.4R*/boot/kernel/kernel | grep 13.4-RELEASE
@(#)FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC
FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC
13.4-RELEASE
@(#)FreeBSD 13.4-RELEASE-p1 GENERIC
FreeBSD 13.4-RELEASE-p1 GENERIC
13.4-RELEASE-p1
Comment 230 George Mitchell 2024-12-15 13:21:42 UTC
On top of which I use SCHED_4BSD.

However, I'm happy to give you access to my /usr/lib/debug/boot/kernel/kernel.debug and
/boot/kernel/kernel.
Comment 231 George Mitchell 2024-12-15 18:41:47 UTC
Mark points out I should specifically post this, since I am not using a stock distribution of the kernel:

diff -u sys/amd64/conf/{GENERIC,M5P}
--- sys/amd64/conf/GENERIC    2024-07-03 16:23:56.252550000 -0400
+++ sys/amd64/conf/M5P    2024-07-03 16:25:05.287604000 -0400
@@ -18,12 +18,13 @@
 #

 cpu        HAMMER
-ident        GENERIC
+ident        M5P

 makeoptions    DEBUG=-g        # Build kernel with gdb(1) debug symbols
 makeoptions    WITH_CTF=1        # Run ctfconvert(1) for DTrace support

-options     SCHED_ULE        # ULE scheduler
+#options     SCHED_ULE        # ULE scheduler
+options     SCHED_4BSD        # 4BSD scheduler
 options     NUMA            # Non-Uniform Memory Architecture support
 options     PREEMPTION        # Enable kernel thread preemption
 options     VIMAGE            # Subsystem virtualization, e.g. VNET
Comment 232 Mark Millard 2024-12-15 20:05:04 UTC
(In reply to Mark Millard from comment #228)

MMy crude traversal of the long list of nodes in the list ends with the
sequence:

(kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next
$206 = {link = {tqe_next = 0xfffff8000465bc80, tqe_prev = 0xfffff80004607c40}, container = 0xfffff80004b29a80, name = 0xffffffff829f801d "amdgpu_raven_ce_bin_fw", version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next
$207 = {link = {tqe_next = 0xfffff8000465bbc0, tqe_prev = 0xfffff80004607b80}, container = 0xfffff80004b29780, name = 0xffffffff82e12000 <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw", 
  version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next
$208 = {link = {tqe_next = 0xfffff80004607a00, tqe_prev = 0xfffff8000465bc80}, container = 0xfffff80003868c00, name = 0xffffffff82e1e000 <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw", 
  version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next . . .->link->tqe_next
$209 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1}
(kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next . . .->link->tqe_next
$210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, 
  version = 15660016}

The ones that also show prefix <...> text like:

<mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw"
<xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw"
<se_mask+242> "amdgpu_raven_mec2_bin_fw"

are not the first ones to do so. Also note the duplication of "amdgpu_raven_mec2_bin_fw".

Also note the: tqe_next = 0xfffff80000000007 that, when dereferenced,
ends up with clearly garabge for the purpose of the list:

link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, 
  version = 15660016}

For reference:

(kgdb) print &mmhub_client_ids_vega20
$211 = (<data variable, no debug info> *) 0xffffffff82e12000 <mmhub_client_ids_vega20>
(kgdb) print &xgpu_fiji_mgcg_cgcg_init
$212 = (<data variable, no debug info> *) 0xffffffff82e1de90 <xgpu_fiji_mgcg_cgcg_init>
(kgdb) print &se_mask
$213 = (<data variable, no debug info> *) 0xffffffff82e41d7c <se_mask>

Those addresses are in the .rodata for /boot/modules/amdgpu.ko :

	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko
	0xffffffff82d9a000 - 0xffffffff82eea000 is .rodata in /boot/modules/amdgpu.ko
	0xffffffff82eea000 - 0xffffffff82ef7948 is .bss in /boot/modules/amdgpu.ko
	0xffffffff82ef7950 - 0xffffffff82f064b8 is .data in /boot/modules/amdgpu.ko
	0xffffffff82f064b8 - 0xffffffff82f068d0 is set_sysctl_set in /boot/modules/amdgpu.ko
	0xffffffff82f068d0 - 0xffffffff82f068f8 is set_sysinit_set in /boot/modules/amdgpu.ko
	0xffffffff82f068f8 - 0xffffffff82f06908 is set_sysuninit_set in /boot/modules/amdgpu.ko
	0xffffffff82f06908 - 0xffffffff82f06958 is set_modmetadata_set in /boot/modules/amdgpu.ko
	0xffffffff82f06958 - 0xffffffff82f0697c is .note.gnu.build-id in /boot/modules/amdgpu.ko

That matches up with the node with:

link = {tqe_next = 0xfffff8000465bc80

referencing:

$207 = {link = {tqe_next = 0xfffff8000465bbc0, tqe_prev = 0xfffff80004607b80}, container = 0xfffff80004b29780, name = 0xffffffff82e12000 <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw"

where name has the address 0xffffffff82e12000 .

I'll note that the name = 0xffffffff829f801d "amdgpu_raven_ce_bin_fw" before
the oddities lands between the .bss for the kernel and the .text
for /boot/modules/amdgpu.ko (not in either one):

	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko
. . .
	0xffffffff829a3360 - 0xffffffff829a3384 is .note.gnu.build-id in /boot/modules/ttm.ko

For reference, the first node's name field has:

name = 0xffffffff81184803 "cam"

That is in the kernel's .rodata :

	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata

there are earlier:

name = 0xffffffff8298b0de <drm_ioctls+350> "iic"
and:
name = 0xffffffff8298f24f <orientation_data+6415> "linuxkpi_gplv2"

in:

	0xffffffff82973000 - 0xffffffff82991000 is .rodata in /boot/modules/drm.ko

name = 0xffffffff829a21c2 <global_write_combined+370> "ttm"

in:

	0xffffffff829a2000 - 0xffffffff829a2eb0 is .bss in /boot/modules/ttm.ko

(Not the just prior .rodata for /boot/modules/ttm.ko .)

For reference:

$198 = {link = {tqe_next = 0xfffff80003904d00, tqe_prev = 0xfffff8000465a1c0}, container = 0xfffff8000464b180, name = 0xffffffff8297644b "drmn", version = 2}

has its tqe_next pointing to the ttm using .bss for the name string:

$199 = {link = {tqe_next = 0xfffff8000465bd00, tqe_prev = 0xfffff8000465a3c0}, container = 0xfffff8000469da80, name = 0xffffffff829a21c2 <global_write_combined+370> "ttm", version = 1}

I will note that /boot/modules/ttm.ko is the last (most recent) to
show up in the "info file" kgdb output:

(kgdb) info file
Symbols from "/usr/home/root/failing-kernel-files/usr/lib/debug/boot/kernel/kernel.debug".
Kernel core dump file:
	`/usr/home/root/failing-kernel-files/vmcore.8', file type FreeBSD kernel vmcore.
Local exec file:
	`/usr/home/root/failing-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff80231108 is .hash
	0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash
	0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym
	0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr
	0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811843f8 is .text
	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata
	0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set
	0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set
	0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set
	0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set
	0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips
	0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs
	0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set
	0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set
	0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set
	0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set
	0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set
	0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set
	0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set
	0xffffffff81803620 - 0xffffffff81803660 is usb_host_id
	0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set
	0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset
	0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset
	0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors
	0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set
	0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set
	0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set
	0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset
	0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset
	0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set
	0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set
	0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set
	0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set
	0xffffffff81803820 - 0xffffffff81804881 is kern_conf
	0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id
	0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame
	0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic
	0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding
	0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently
	0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly
	0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line
	0xffffffff81c08000 - 0xffffffff81d51248 is .data
	0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set
	0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set
	0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu
	0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet
	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko
	0xffffffff82d9a000 - 0xffffffff82eea000 is .rodata in /boot/modules/amdgpu.ko
	0xffffffff82eea000 - 0xffffffff82ef7948 is .bss in /boot/modules/amdgpu.ko
	0xffffffff82ef7950 - 0xffffffff82f064b8 is .data in /boot/modules/amdgpu.ko
	0xffffffff82f064b8 - 0xffffffff82f068d0 is set_sysctl_set in /boot/modules/amdgpu.ko
	0xffffffff82f068d0 - 0xffffffff82f068f8 is set_sysinit_set in /boot/modules/amdgpu.ko
	0xffffffff82f068f8 - 0xffffffff82f06908 is set_sysuninit_set in /boot/modules/amdgpu.ko
	0xffffffff82f06908 - 0xffffffff82f06958 is set_modmetadata_set in /boot/modules/amdgpu.ko
	0xffffffff82f06958 - 0xffffffff82f0697c is .note.gnu.build-id in /boot/modules/amdgpu.ko
	0xffffffff82918000 - 0xffffffff82973000 is .text in /boot/modules/drm.ko
	0xffffffff82973000 - 0xffffffff82991000 is .rodata in /boot/modules/drm.ko
	0xffffffff82991000 - 0xffffffff829911e0 is .bss in /boot/modules/drm.ko
	0xffffffff829911e0 - 0xffffffff82992df8 is .data in /boot/modules/drm.ko
	0xffffffff82992df8 - 0xffffffff82992e80 is set_sysinit_set in /boot/modules/drm.ko
	0xffffffff82992e80 - 0xffffffff82992ef0 is set_sysuninit_set in /boot/modules/drm.ko
	0xffffffff82992ef0 - 0xffffffff82992fc0 is set_sysctl_set in /boot/modules/drm.ko
	0xffffffff82992fc0 - 0xffffffff82992fcc is .data.read_mostly in /boot/modules/drm.ko
	0xffffffff82992fd0 - 0xffffffff82993050 is set_modmetadata_set in /boot/modules/drm.ko
	0xffffffff82993050 - 0xffffffff82993074 is .note.gnu.build-id in /boot/modules/drm.ko
	0xffffffff8298d000 - 0xffffffff8298d000 is .text in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298d000 - 0xffffffff8298e000 is .rodata in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298e000 - 0xffffffff8298e0d0 is .data in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298e0d0 - 0xffffffff8298e100 is set_modmetadata_set in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298e100 - 0xffffffff8298e124 is .note.gnu.build-id in /boot/modules/linuxkpi_gplv2.ko
	0xffffffff82991000 - 0xffffffff82996000 is .text in /boot/modules/dmabuf.ko
	0xffffffff82996000 - 0xffffffff82997000 is .rodata in /boot/modules/dmabuf.ko
	0xffffffff82997000 - 0xffffffff82997280 is .data in /boot/modules/dmabuf.ko
	0xffffffff82997280 - 0xffffffff82997290 is set_modmetadata_set in /boot/modules/dmabuf.ko
	0xffffffff82997290 - 0xffffffff829972a8 is set_sysinit_set in /boot/modules/dmabuf.ko
	0xffffffff829972a8 - 0xffffffff829972c0 is set_sysuninit_set in /boot/modules/dmabuf.ko
	0xffffffff829972c0 - 0xffffffff82997358 is .bss in /boot/modules/dmabuf.ko
	0xffffffff82997358 - 0xffffffff8299737c is .note.gnu.build-id in /boot/modules/dmabuf.ko
	0xffffffff82998000 - 0xffffffff829a1000 is .text in /boot/modules/ttm.ko
	0xffffffff829a1000 - 0xffffffff829a2000 is .rodata in /boot/modules/ttm.ko
	0xffffffff829a2000 - 0xffffffff829a2eb0 is .bss in /boot/modules/ttm.ko
	0xffffffff829a2eb0 - 0xffffffff829a32e8 is .data in /boot/modules/ttm.ko
	0xffffffff829a32e8 - 0xffffffff829a3320 is set_sysctl_set in /boot/modules/ttm.ko
	0xffffffff829a3320 - 0xffffffff829a3350 is set_modmetadata_set in /boot/modules/ttm.ko
	0xffffffff829a3350 - 0xffffffff829a3358 is set_sysinit_set in /boot/modules/ttm.ko
	0xffffffff829a3358 - 0xffffffff829a3360 is set_sysuninit_set in /boot/modules/ttm.ko
	0xffffffff829a3360 - 0xffffffff829a3384 is .note.gnu.build-id in /boot/modules/ttm.ko
(kgdb) 

For reference:

$210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>,

happens after previously dereferencing to see over something like
200 nodes in the list.

With that I'll stop this specific note.
Comment 233 Mark Millard 2024-12-15 20:11:00 UTC
(In reply to George Mitchell from comment #231)

Also: the build is based on the -p2 source code (hash 3f40d5821):

# strings boot/kernel/kernel | grep "\-RELEASE"
@(#)FreeBSD 13.4-RELEASE-p2 3f40d5821 M5P
FreeBSD 13.4-RELEASE-p2 3f40d5821 M5P
13.4-RELEASE-p2

Because it is a rebuild, the kernel ends up with -p2 instead
of the official -p1 ( from -p2 not updating boot/kernel/kernel
in the official distributions ).
Comment 234 Mark Millard 2024-12-15 21:08:11 UTC
(In reply to Mark Millard from comment #232)

I mistakenly wrote of a duplicacation in:

QUOTE
<mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw"
<xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw"
<se_mask+242> "amdgpu_raven_mec2_bin_fw"

are not the first ones to do so. Also note the duplication of "amdgpu_raven_mec2_bin_fw".
END QUOTE

amdgpu_raven_mec_bin_fw
vs.
amdgpu_raven_mec2_bin_fw

is not a duplication. Sorry.
Comment 235 Mark Millard 2024-12-15 22:01:19 UTC
For the 3 node sequence (last partially-good and then
just-junk):

$208 = {link = {tqe_next = 0xfffff80004607a00, tqe_prev = 0xfffff8000465bc80}, container = 0xfffff80003868c00, name = 0xffffffff82e1e000 <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw", 
  version = 1}
$209 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1}
$210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, 
  version = 15660016}

it looks like the:

$209 = {link = {tqe_next = 0xfffff80000000007,

is the earliest example of (evidence of) corruption. The
address is outside of (smaller address than) the kernel
start:

Local exec file:
	`/usr/home/root/failing-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp

Having 0000000007 also looks odd.

However, the rest of that node:

tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1}

does not look to have any obvious problems with its content. The
contents of the container are shown as:

$214 = {ops = 0xfffff80003164000, refs = 1, userrefs = 0, flags = 1, link = {tqe_next = 0xfffff8000469ed80, tqe_prev = 0xfffff80003868c18}, filename = 0xfffff80004b22120 "amdgpu_raven_mec2_bin.ko", 
  pathname = 0xfffff80004607a40 "/boot/modules/amdgpu_raven_mec2_bin.ko", id = 20, address = 0xffffffff82e61000 <link_enc_regs+1520> "\203\376\001tL\270\026", size = 276456, ctors_addr = 0x0, 
  ctors_size = 0, dtors_addr = 0x0, dtors_size = 0, ndeps = 3, deps = 0xfffff80004b220e0, common = {stqh_first = 0x0, stqh_last = 0xfffff80004b29680}, modules = {tqh_first = 0xfffff80004b1ff00, 
    tqh_last = 0xfffff80004b1ff10}, loaded = {tqe_next = 0x0, tqe_prev = 0x0}, loadcnt = 20, nenabled = 0, fbt_nentries = 0}

which also seems to not have obvious problems.

The type of vmcore.* does not provide threads, stack content, or
backtrace information. Nor is there any indication of any detailed
point for when the tqe_next = 0xfffff80000000007 became the case.

It is not necessarily obvious if the list was longer before the
0xfffff80000000007 became the case.

There does not seem to be a way to tell if the corrupted value might be
becuase of "raven" specific code vs. more general code. It would be
interesting to know if an alternate card type has the problem vs. not.

As for the raven context, getting vmcore.* captures that fail at a
different stage, such as the failure that mentioned acpi_wmi but did
not get a vmcore.* , would help indicate if where the corruption
happens in the list moves around (relative to other content).
Comment 236 George Mitchell 2024-12-15 23:00:06 UTC
Mark, thank you sincerely for your help in tracking this down.  I have temporarily rearranged my startup script to increase my chance of getting more crashes (it seems loading amdgpu AFTER zfs, acpi_wmi, and vboxnetflt makes the crash more likely, so of late I have been loading amdgpu FIRST so I can get more work done), and if that occurs, I will put some more vmcores in the directory I told you about earlier today.  Your assistance is GREATLY appreciated!
Comment 237 Mark Millard 2024-12-15 23:17:10 UTC
(In reply to Mark Millard from comment #235)

Old comments that reference one or both of:

0xFFFFF80000000000 (also  known as 18446735277616529408)
0xFFFFF80000000007

comment #44
comment #94
comment #148

Example from 44 (that 94 references):

#8  vtozoneslab (va=18446735277616529408, zone=<optimized out>, 
    slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635
#9  free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>)
    at /usr/src/sys/kern/kern_malloc.c:911
#10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, 
    buf=0xfffff80000000007, size=16688648)
    at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216

Example from 148 (an nfsd process context):

#7  0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, 
    dst=<optimized out>, expect=<optimized out>)
    at /usr/src/sys/amd64/include/atomic.h:225
#8  selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007)
    at /usr/src/sys/kern/sys_generic.c:1755
#9  0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00)
    at /usr/src/sys/kern/sys_generic.c:1967

[I'll note that 18446741877726026240 = 0xFFFFFE00B52E9A00 but is likely
from use of dereferencing something based on the 0xfffff80000000007 in
some way.]

The history suggests that 0xfffff80000000007 (or 0xfffff80000000000)
corruption is not limited to a specific place.
Comment 238 Mark Millard 2024-12-15 23:24:26 UTC
(In reply to Mark Millard from comment #237)

I should have noted: the #44 , #94 , #148 comments material
are not tied to the found_modules->tqh_first->. . . list
as far as I can tell.
Comment 239 Mark Millard 2024-12-16 15:40:00 UTC
In:

https://lists.freebsd.org/archives/freebsd-hackers/2024-December/004100.html

Philipp writes:

QUOTE
By simple grep through sys/ I found following comment in sys/amd64/include/vmparam.h:

> /*
>  * Virtual addresses of things.  Derived from the page directory and
>  * page table indexes from pmap.h for precision.
> [...]
>  * 0xfffff80000000000 - 0xfffffbffffffffff   4TB direct map

The direct map is 4TB of virtuall address space mapping the physical
address space 1:1 (minus the base). So I would guess this is caused by
an NULL pointer converted by PHYS_TO_DMAP.
END QUOTE

So either:

PHYS_TO_DMAP(0x0)+7
or:
PHYS_TO_DMAP(0x0+7)

looks likely to be involved for 0xfffff80000000007 showing up in:

$209 = {link = {tqe_next = 0xfffff80000000007,
Comment 240 Mark Millard 2024-12-16 17:19:49 UTC
(In reply to Mark Millard from comment #237)

Intrestingly, the traceback from comment #148 involves
a different list:

/*
 * Remove the references to the thread from all of the objects we were
 * polling.
 */
static void
seltdclear(struct thread *td)
{
        struct seltd *stp;
        struct selfd *sfp;
        struct selfd *sfn;

        stp = td->td_sel;
        STAILQ_FOREACH_SAFE(sfp, &stp->st_selq, sf_link, sfn)
                selfdfree(stp, sfp);
        stp->st_flags = 0;
}

It was a sfp value that ended up being reported as: 0xfffff80000000007
Comment 241 Mark Millard 2024-12-16 19:17:42 UTC
One of the older ("obsolete") crash dump reports is
for:

/*
 *      free:
 * 
 *      Free a block of memory allocated by malloc.
 * 
 *      This routine may not block.
 */
void
free(void *addr, struct malloc_type *mtp)
{
        uma_zone_t zone;
        uma_slab_t slab;
        u_long size;
 
#ifdef MALLOC_DEBUG
        if (free_dbg(&addr, mtp) != 0)
                return;
#endif
        /* free(NULL, ...) does nothing */
        if (addr == NULL)
                return;
 
        vtozoneslab((vm_offset_t)addr & (~UMA_SLAB_MASK), &zone, &slab);
. . .

where addr ended up being 0xfffff80000000007 , in other words
PHYS_TO_DMAP(0x7). The (vm_offset_t)addr & (~UMA_SLAB_MASK)
turned it into 0xfffff80000000000 for vtozoneslab. That in
turn reported a failure.

The presence of a NULL check in the kernel's free suggests to me that
the kernel's free may not be intended to handle DMAP addresses.
Similarly for other kernel code that checks against NULL but not
against PHYS_TO_DMAP(NULL).

How does one tell where DMAP addresses should not appear when
looking around via kgdb?
Comment 242 George Mitchell 2024-12-17 14:38:24 UTC
Mark, I'm extremely grateful for all your recent disassemblies, but don't you need some older kernel/kernel.debug files to match up with those older files?  I'm in the process of extracting them from my old backups ...
Comment 243 George Mitchell 2024-12-17 14:46:50 UTC
Argh, it's worse than I thought -- there's -p2, -p5, etc. etc.
Comment 244 Mark Millard 2024-12-17 15:30:28 UTC
(In reply to George Mitchell from comment #243)

I got what I reported for the obsolete materials via the
attatchments. I can get to source via git.

There are no vmcore.* 's that I'm aware of.

But it still allows seeing that the pointer value
0xfffff80000000007 was showing up in various places,
just based on the kgdb backtrace reports.

I would not worry about providing pre-13.4-RELEASE-p1
vmcore.* files or related kernel or kernel.debug files.
Comment 245 George Mitchell 2024-12-17 17:33:19 UTC
Thanks, Mark; that's a relief!
Comment 246 George Mitchell 2024-12-17 17:34:09 UTC
Meanwhile, I haven't been able to cause any new crashes . . . I'll have to try something different.
Comment 247 Mark Millard 2024-12-17 23:21:43 UTC
(In reply to Mark Millard from comment #235)

I found another context issue that might eventually prove
to be of interest for the vmcore.8 that I've got a copy of.

First remember:

(kgdb) print *found_modules->tqh_first->link.tqe_next->. . .->link.tqe_next
$1 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 "amdgpu_raven_mec2_bin_fw", version = 1}

Note that amdgpu_raven_mec2_bin_fw as the last name. Now:

(kgdb) info sharedlibrary 
From                To                  Syms Read   Shared Object Library
0xffffffff82545000  0xffffffff82552000  Yes         ./boot/kernel/fusefs.ko
0xffffffff8256d000  0xffffffff8256f000  Yes         ./boot/kernel/sem.ko
                                        No          /boot/modules/if_re.ko
                                        No          /boot/modules/amdgpu.ko
                                        No          /boot/modules/drm.ko
0xffffffff8298a000  0xffffffff8298b000  Yes         ./boot/kernel/iic.ko
                                        No          /boot/modules/linuxkpi_gplv2.ko
                                        No          /boot/modules/dmabuf.ko
                                        No          /boot/modules/ttm.ko
                                        No          /boot/modules/amdgpu_raven_gpu_info_bin.ko
                                        No          /boot/modules/amdgpu_raven_sdma_bin.ko
                                        No          /boot/modules/amdgpu_raven_asd_bin.ko
                                        No          /boot/modules/amdgpu_raven_ta_bin.ko
                                        No          /boot/modules/amdgpu_raven_pfp_bin.ko
                                        No          /boot/modules/amdgpu_raven_me_bin.ko
                                        No          /boot/modules/amdgpu_raven_ce_bin.ko
                                        No          /boot/modules/amdgpu_raven_rlc_bin.ko
                                        No          /boot/modules/amdgpu_raven_mec_bin.ko
                                        No          /boot/modules/amdgpu_raven_mec2_bin.ko
                                        No          /boot/modules/amdgpu_raven_vcn_bin.ko
0xffffffff83000000  0xffffffff8324c000  Yes         ./boot/kernel/zfs.ko

So both amdgpu_raven_vcn_bin.ko and zfs.ko are not in the found_modules
list before the failing point in the list --but the failure was not seen
until the activity associated with zfs.ko 's load attempt.

Note: So far, I'm operating without copies of /boot/modules/*.ko or any
debug information for such. my guess is that port builds do not normally
generate debug information for /boot/modules/*.ko files so only public
symbols for linking might show up for such.

I had carelessly worked in a way that was referencing some files from my
live system previously. But those were not from the same vintage of
drm-*-kmod and related.

For reference:

(kgdb) info files
Symbols from "/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/usr/lib/debug/boot/kernel/kernel.debug".
Kernel core dump file:
	`/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/vmcore.8', file type FreeBSD kernel vmcore.
Local exec file:
	`/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff80231108 is .hash
	0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash
	0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym
	0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr
	0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811843f8 is .text
	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata
	0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set
	0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set
	0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set
	0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set
	0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips
	0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs
	0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set
	0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set
	0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set
	0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set
	0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set
	0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set
	0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set
	0xffffffff81803620 - 0xffffffff81803660 is usb_host_id
	0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set
	0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset
	0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset
	0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors
	0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set
	0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set
	0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set
	0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset
	0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset
	0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set
	0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set
	0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set
	0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set
	0xffffffff81803820 - 0xffffffff81804881 is kern_conf
	0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id
	0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame
	0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic
	0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding
	0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently
	0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly
	0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line
	0xffffffff81c08000 - 0xffffffff81d51248 is .data
	0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set
	0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set
	0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu
	0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet
	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82545000 - 0xffffffff82552000 is .text in ./boot/kernel/fusefs.ko
	0xffffffff82552000 - 0xffffffff82554000 is .rodata in ./boot/kernel/fusefs.ko
	0xffffffff82554000 - 0xffffffff82556874 is .data in ./boot/kernel/fusefs.ko
	0xffffffff82556878 - 0xffffffff82556970 is set_sdt_probes_set in ./boot/kernel/fusefs.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82556970 - 0xffffffff82556ba0 is set_sdt_argtypes_set in ./boot/kernel/fusefs.ko
	0xffffffff82556ba0 - 0xffffffff82556bd8 is set_sysinit_set in ./boot/kernel/fusefs.ko
	0xffffffff82556bd8 - 0xffffffff82556bf8 is set_sysuninit_set in ./boot/kernel/fusefs.ko
	0xffffffff82556bf8 - 0xffffffff82556c60 is set_sysctl_set in ./boot/kernel/fusefs.ko
	0xffffffff82556c60 - 0xffffffff82556cc0 is .bss in ./boot/kernel/fusefs.ko
	0xffffffff82556cc0 - 0xffffffff82556cc8 is set_sdt_providers_set in ./boot/kernel/fusefs.ko
	0xffffffff82556cc8 - 0xffffffff82556ce0 is set_modmetadata_set in ./boot/kernel/fusefs.ko
	0xffffffff82556ce0 - 0xffffffff82556d04 is .note.gnu.build-id in ./boot/kernel/fusefs.ko
	0xffffffff8256d000 - 0xffffffff8256f000 is .text in ./boot/kernel/sem.ko
	0xffffffff8256f000 - 0xffffffff82570000 is .rodata in ./boot/kernel/sem.ko
	0xffffffff82570000 - 0xffffffff8257095c is .data in ./boot/kernel/sem.ko
	0xffffffff82570960 - 0xffffffff82570978 is set_sysctl_set in ./boot/kernel/sem.ko
	0xffffffff82570978 - 0xffffffff82570988 is set_sysinit_set in ./boot/kernel/sem.ko
	0xffffffff82570988 - 0xffffffff82570990 is set_sysuninit_set in ./boot/kernel/sem.ko
	0xffffffff82570990 - 0xffffffff82570a10 is .bss in ./boot/kernel/sem.ko
	0xffffffff82570a10 - 0xffffffff82570a28 is set_modmetadata_set in ./boot/kernel/sem.ko
	0xffffffff82570a28 - 0xffffffff82570a4c is .note.gnu.build-id in ./boot/kernel/sem.ko
	0xffffffff8298a000 - 0xffffffff8298b000 is .text in ./boot/kernel/iic.ko
	0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in ./boot/kernel/iic.ko
	0xffffffff8298c000 - 0xffffffff8298c270 is .data in ./boot/kernel/iic.ko
	0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in ./boot/kernel/iic.ko
	0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in ./boot/kernel/iic.ko
	0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in ./boot/kernel/iic.ko
	0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in ./boot/kernel/iic.ko
	0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in ./boot/kernel/iic.ko
	0xffffffff83000000 - 0xffffffff8324c000 is .text in ./boot/kernel/zfs.ko
	0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in ./boot/kernel/zfs.ko
	0xffffffff832dc000 - 0xffffffff832fe228 is .data in ./boot/kernel/zfs.ko
	0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in ./boot/kernel/zfs.ko
	0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in ./boot/kernel/zfs.ko
	0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in ./boot/kernel/zfs.ko
	0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in ./boot/kernel/zfs.ko
	0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in ./boot/kernel/zfs.ko
	0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in ./boot/kernel/zfs.ko
	0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in ./boot/kernel/zfs.ko
	0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in ./boot/kernel/zfs.ko
(kgdb)
Comment 248 George Mitchell 2024-12-17 23:54:29 UTC
Would a debug build of drm_510_kmod be of use here?
Comment 249 Mark Millard 2024-12-18 00:41:09 UTC
(In reply to George Mitchell from comment #248)

A debug build of drm-510-kmod under 13.4-RELEASE would need
to match up with vmcore.* examples that used the debug build.

A non-debug build of drm-510-kmod under 13.4-RELEASE would need
to match up with vmcore.* examples that used the non-debug build.

So it is more driven by the vmcore.* content than anything for
debug vs. non-debug: which was in use?

The only thing that I know of for having the matching *.ko
files for a vmcore.* is that "info files" would likely show
the (correct) address ranges from the sections in the
additional *.ko files that were loaded but are not there
to find in what I now have.
Comment 250 Mark Millard 2024-12-18 04:10:21 UTC
Which did you use for graphics/gpu-firmware-amd-kmod@raven :

latest?
( updated 2024-Dec-14: graphics/gpu-firmware-amd-kmod@raven )
( possibly after the vmcore.8 context )

quarterly?

So I'm not sure if all the boot/modules/*.ko that I now have
are what they should be to match vmcore.8 .
Comment 251 Mark Millard 2024-12-18 06:06:46 UTC
(In reply to Mark Millard from comment #250)

Sorry for mistaken reference. Actually:

2024-Dec-12 source commit
2024-Dec-14 FreeBSD Package distribution

And the most recent actual change to raven looks to be
unclear for how far back it is in the commit sequence.
Comment 252 Mark Millard 2024-12-18 06:38:44 UTC
(In reply to Mark Millard from comment #247)

(Looks like you get if_re.ko via /boot/modules/ as well.)

For reference:

(kgdb) info sharedlibrary 
From                To                  Syms Read   Shared Object Library
0xffffffff82545000  0xffffffff82552000  Yes         ./boot/kernel/fusefs.ko
0xffffffff8256d000  0xffffffff8256f000  Yes         ./boot/kernel/sem.ko
                                        No          /boot/modules/if_re.ko
0xffffffff82a00000  0xffffffff82cf5000  Yes (*)     ./boot/modules/amdgpu.ko
0xffffffff82918000  0xffffffff8296d000  Yes (*)     ./boot/modules/drm.ko
0xffffffff8298a000  0xffffffff8298b000  Yes         ./boot/kernel/iic.ko
0xffffffff8298d000  0xffffffff8298f000  Yes (*)     ./boot/modules/linuxkpi_gplv2.ko
0xffffffff82991000  0xffffffff82996000  Yes (*)     ./boot/modules/dmabuf.ko
0xffffffff82998000  0xffffffff829a2000  Yes (*)     ./boot/modules/ttm.ko
0xffffffff829a5000  0xffffffff829a6000  Yes (*)     ./boot/modules/amdgpu_raven_gpu_info_bin.ko
0xffffffff829a8000  0xffffffff829a9000  Yes (*)     ./boot/modules/amdgpu_raven_sdma_bin.ko
0xffffffff829af000  0xffffffff829b0000  Yes (*)     ./boot/modules/amdgpu_raven_asd_bin.ko
0xffffffff829de000  0xffffffff829df000  Yes (*)     ./boot/modules/amdgpu_raven_ta_bin.ko
0xffffffff829e8000  0xffffffff829e9000  Yes (*)     ./boot/modules/amdgpu_raven_pfp_bin.ko
0xffffffff829f0000  0xffffffff829f1000  Yes (*)     ./boot/modules/amdgpu_raven_me_bin.ko
0xffffffff829f7000  0xffffffff829f8000  Yes (*)     ./boot/modules/amdgpu_raven_ce_bin.ko
0xffffffff82e11000  0xffffffff82e12000  Yes (*)     ./boot/modules/amdgpu_raven_rlc_bin.ko
0xffffffff82e1d000  0xffffffff82e1e000  Yes (*)     ./boot/modules/amdgpu_raven_mec_bin.ko
0xffffffff82e61000  0xffffffff82e62000  Yes (*)     ./boot/modules/amdgpu_raven_mec2_bin.ko
0xffffffff82ea5000  0xffffffff82ea6000  Yes (*)     ./boot/modules/amdgpu_raven_vcn_bin.ko
0xffffffff83000000  0xffffffff8324c000  Yes         ./boot/kernel/zfs.ko


(kgdb) info file
Symbols from "/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/usr/lib/debug/boot/kernel/kernel.debug".
Kernel core dump file:
	`/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/vmcore.8', file type FreeBSD kernel vmcore.
Local exec file:
	`/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd.
	Entry point: 0xffffffff8038e000
	0xffffffff802002a8 - 0xffffffff802002b5 is .interp
	0xffffffff802002b8 - 0xffffffff80231108 is .hash
	0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash
	0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym
	0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr
	0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn
	0xffffffff8038e000 - 0xffffffff811843f8 is .text
	0xffffffff81184400 - 0xffffffff817f68d0 is .rodata
	0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set
	0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set
	0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set
	0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set
	0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips
	0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs
	0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set
	0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set
	0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set
	0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set
	0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set
	0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set
	0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set
	0xffffffff81803620 - 0xffffffff81803660 is usb_host_id
	0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set
	0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset
	0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset
	0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors
	0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set
	0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set
	0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set
	0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset
	0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset
	0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set
	0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set
	0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set
	0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set
	0xffffffff81803820 - 0xffffffff81804881 is kern_conf
	0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id
	0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame
	0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic
	0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding
	0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently
	0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly
	0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line
	0xffffffff81c08000 - 0xffffffff81d51248 is .data
	0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set
	0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set
	0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu
	0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet
	0xffffffff81d82880 - 0xffffffff82200000 is .bss
	0xffffffff82545000 - 0xffffffff82552000 is .text in ./boot/kernel/fusefs.ko
	0xffffffff82552000 - 0xffffffff82554000 is .rodata in ./boot/kernel/fusefs.ko
	0xffffffff82554000 - 0xffffffff82556874 is .data in ./boot/kernel/fusefs.ko
	0xffffffff82556878 - 0xffffffff82556970 is set_sdt_probes_set in ./boot/kernel/fusefs.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82556970 - 0xffffffff82556ba0 is set_sdt_argtypes_set in ./boot/kernel/fusefs.ko
	0xffffffff82556ba0 - 0xffffffff82556bd8 is set_sysinit_set in ./boot/kernel/fusefs.ko
	0xffffffff82556bd8 - 0xffffffff82556bf8 is set_sysuninit_set in ./boot/kernel/fusefs.ko
	0xffffffff82556bf8 - 0xffffffff82556c60 is set_sysctl_set in ./boot/kernel/fusefs.ko
	0xffffffff82556c60 - 0xffffffff82556cc0 is .bss in ./boot/kernel/fusefs.ko
	0xffffffff82556cc0 - 0xffffffff82556cc8 is set_sdt_providers_set in ./boot/kernel/fusefs.ko
	0xffffffff82556cc8 - 0xffffffff82556ce0 is set_modmetadata_set in ./boot/kernel/fusefs.ko
	0xffffffff82556ce0 - 0xffffffff82556d04 is .note.gnu.build-id in ./boot/kernel/fusefs.ko
	0xffffffff8256d000 - 0xffffffff8256f000 is .text in ./boot/kernel/sem.ko
	0xffffffff8256f000 - 0xffffffff82570000 is .rodata in ./boot/kernel/sem.ko
	0xffffffff82570000 - 0xffffffff8257095c is .data in ./boot/kernel/sem.ko
	0xffffffff82570960 - 0xffffffff82570978 is set_sysctl_set in ./boot/kernel/sem.ko
	0xffffffff82570978 - 0xffffffff82570988 is set_sysinit_set in ./boot/kernel/sem.ko
	0xffffffff82570988 - 0xffffffff82570990 is set_sysuninit_set in ./boot/kernel/sem.ko
	0xffffffff82570990 - 0xffffffff82570a10 is .bss in ./boot/kernel/sem.ko
	0xffffffff82570a10 - 0xffffffff82570a28 is set_modmetadata_set in ./boot/kernel/sem.ko
	0xffffffff82570a28 - 0xffffffff82570a4c is .note.gnu.build-id in ./boot/kernel/sem.ko
	0xffffffff82a00000 - 0xffffffff82cf5000 is .text in ./boot/modules/amdgpu.ko
	0xffffffff82cf5000 - 0xffffffff82dfc000 is .rodata in ./boot/modules/amdgpu.ko
	0xffffffff82dfc000 - 0xffffffff82e09378 is .bss in ./boot/modules/amdgpu.ko
	0xffffffff82e09380 - 0xffffffff82e11d74 is .data in ./boot/modules/amdgpu.ko
	0xffffffff82e11d78 - 0xffffffff82e12150 is set_sysctl_set in ./boot/modules/amdgpu.ko
	0xffffffff82e12150 - 0xffffffff82e12178 is set_sysinit_set in ./boot/modules/amdgpu.ko
	0xffffffff82e12178 - 0xffffffff82e12188 is set_sysuninit_set in ./boot/modules/amdgpu.ko
	0xffffffff82e12188 - 0xffffffff82e121e0 is set_modmetadata_set in ./boot/modules/amdgpu.ko
	0xffffffff82e121e0 - 0xffffffff82e12204 is .note.gnu.build-id in ./boot/modules/amdgpu.ko
	0xffffffff82918000 - 0xffffffff8296d000 is .text in ./boot/modules/drm.ko
	0xffffffff8296d000 - 0xffffffff82989000 is .rodata in ./boot/modules/drm.ko
	0xffffffff82989000 - 0xffffffff82989190 is .bss in ./boot/modules/drm.ko
	0xffffffff82989190 - 0xffffffff8298a9a8 is .data in ./boot/modules/drm.ko
	0xffffffff8298a9a8 - 0xffffffff8298aa20 is set_sysinit_set in ./boot/modules/drm.ko
	0xffffffff8298aa20 - 0xffffffff8298aa80 is set_sysuninit_set in ./boot/modules/drm.ko
	0xffffffff8298aa80 - 0xffffffff8298ab50 is set_sysctl_set in ./boot/modules/drm.ko
	0xffffffff8298ab50 - 0xffffffff8298ab5c is .data.read_mostly in ./boot/modules/drm.ko
	0xffffffff8298ab60 - 0xffffffff8298abd8 is set_modmetadata_set in ./boot/modules/drm.ko
	0xffffffff8298abd8 - 0xffffffff8298abfc is .note.gnu.build-id in ./boot/modules/drm.ko
	0xffffffff8298a000 - 0xffffffff8298b000 is .text in ./boot/kernel/iic.ko
	0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in ./boot/kernel/iic.ko
	0xffffffff8298c000 - 0xffffffff8298c270 is .data in ./boot/kernel/iic.ko
	0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in ./boot/kernel/iic.ko
	0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in ./boot/kernel/iic.ko
	0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in ./boot/kernel/iic.ko
	0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in ./boot/kernel/iic.ko
	0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in ./boot/kernel/iic.ko
	0xffffffff8298d000 - 0xffffffff8298f000 is .text in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff8298f000 - 0xffffffff82990000 is .rodata in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff82990000 - 0xffffffff829900c8 is .data in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900c8 - 0xffffffff829900f0 is set_modmetadata_set in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900f0 - 0xffffffff829900f8 is set_sysinit_set in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900f8 - 0xffffffff829900fc is .bss in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff829900fc - 0xffffffff82990120 is .note.gnu.build-id in ./boot/modules/linuxkpi_gplv2.ko
	0xffffffff82991000 - 0xffffffff82996000 is .text in ./boot/modules/dmabuf.ko
	0xffffffff82996000 - 0xffffffff82997000 is .rodata in ./boot/modules/dmabuf.ko
	0xffffffff82997000 - 0xffffffff82997200 is .data in ./boot/modules/dmabuf.ko
	0xffffffff82997200 - 0xffffffff82997210 is set_modmetadata_set in ./boot/modules/dmabuf.ko
	0xffffffff82997210 - 0xffffffff82997228 is set_sysinit_set in ./boot/modules/dmabuf.ko
	0xffffffff82997228 - 0xffffffff82997240 is set_sysuninit_set in ./boot/modules/dmabuf.ko
	0xffffffff82997240 - 0xffffffff829972d8 is .bss in ./boot/modules/dmabuf.ko
	0xffffffff829972d8 - 0xffffffff829972fc is .note.gnu.build-id in ./boot/modules/dmabuf.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82998000 - 0xffffffff829a2000 is .text in ./boot/modules/ttm.ko
	0xffffffff829a2000 - 0xffffffff829a3000 is .rodata in ./boot/modules/ttm.ko
	0xffffffff829a3000 - 0xffffffff829a3500 is .data in ./boot/modules/ttm.ko
	0xffffffff829a3500 - 0xffffffff829a3520 is set_sysinit_set in ./boot/modules/ttm.ko
	0xffffffff829a3520 - 0xffffffff829a3538 is set_sysuninit_set in ./boot/modules/ttm.ko
	0xffffffff829a3540 - 0xffffffff829a4720 is .bss in ./boot/modules/ttm.ko
	0xffffffff829a4720 - 0xffffffff829a4758 is set_modmetadata_set in ./boot/modules/ttm.ko
	0xffffffff829a4758 - 0xffffffff829a477c is .note.gnu.build-id in ./boot/modules/ttm.ko
	0xffffffff829a5000 - 0xffffffff829a6000 is .text in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a6000 - 0xffffffff829a7000 is .rodata in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7000 - 0xffffffff829a713c is rodata in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7140 - 0xffffffff829a71f0 is .data in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a71f0 - 0xffffffff829a7210 is set_modmetadata_set in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7210 - 0xffffffff829a7218 is set_sysinit_set in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a7218 - 0xffffffff829a723c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_gpu_info_bin.ko
	0xffffffff829a8000 - 0xffffffff829a9000 is .text in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829a9000 - 0xffffffff829aa000 is .rodata in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829aa000 - 0xffffffff829ae400 is rodata in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae400 - 0xffffffff829ae4b0 is .data in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4b0 - 0xffffffff829ae4d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4d0 - 0xffffffff829ae4d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829ae4d8 - 0xffffffff829ae4fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_sdma_bin.ko
	0xffffffff829af000 - 0xffffffff829b0000 is .text in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829b0000 - 0xffffffff829b1000 is .rodata in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829b1000 - 0xffffffff829da200 is rodata in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829da200 - 0xffffffff829da2b0 is .data in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829da2b0 - 0xffffffff829da2d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829da2d0 - 0xffffffff829da2d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829da2d8 - 0xffffffff829da2fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_asd_bin.ko
	0xffffffff829de000 - 0xffffffff829df000 is .text in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829df000 - 0xffffffff829e0000 is .rodata in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e0000 - 0xffffffff829e8300 is rodata in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e8300 - 0xffffffff829e83b0 is .data in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e83b0 - 0xffffffff829e83d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e83d0 - 0xffffffff829e83d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e83d8 - 0xffffffff829e83fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_ta_bin.ko
	0xffffffff829e8000 - 0xffffffff829e9000 is .text in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829e9000 - 0xffffffff829ea000 is .rodata in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ea000 - 0xffffffff829ef480 is rodata in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef480 - 0xffffffff829ef530 is .data in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef530 - 0xffffffff829ef550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef550 - 0xffffffff829ef558 is set_sysinit_set in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829ef558 - 0xffffffff829ef57c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_pfp_bin.ko
	0xffffffff829f0000 - 0xffffffff829f1000 is .text in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f1000 - 0xffffffff829f2000 is .rodata in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f2000 - 0xffffffff829f6480 is rodata in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6480 - 0xffffffff829f6530 is .data in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6530 - 0xffffffff829f6550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6550 - 0xffffffff829f6558 is set_sysinit_set in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f6558 - 0xffffffff829f657c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_me_bin.ko
	0xffffffff829f7000 - 0xffffffff829f8000 is .text in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829f8000 - 0xffffffff829f9000 is .rodata in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829f9000 - 0xffffffff829fb480 is rodata in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb480 - 0xffffffff829fb530 is .data in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb530 - 0xffffffff829fb550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb550 - 0xffffffff829fb558 is set_sysinit_set in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff829fb558 - 0xffffffff829fb57c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_ce_bin.ko
	0xffffffff82e11000 - 0xffffffff82e12000 is .text in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e12000 - 0xffffffff82e13000 is .rodata in ./boot/modules/amdgpu_raven_rlc_bin.ko
--Type <RET> for more, q to quit, c to continue without paging--
	0xffffffff82e13000 - 0xffffffff82e1c8e4 is rodata in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c8e8 - 0xffffffff82e1c998 is .data in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c998 - 0xffffffff82e1c9b8 is set_modmetadata_set in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c9b8 - 0xffffffff82e1c9c0 is set_sysinit_set in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1c9c0 - 0xffffffff82e1c9e4 is .note.gnu.build-id in ./boot/modules/amdgpu_raven_rlc_bin.ko
	0xffffffff82e1d000 - 0xffffffff82e1e000 is .text in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e1e000 - 0xffffffff82e1f000 is .rodata in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e1f000 - 0xffffffff82e60710 is rodata in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e60710 - 0xffffffff82e607c0 is .data in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607c0 - 0xffffffff82e607e0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607e0 - 0xffffffff82e607e8 is set_sysinit_set in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e607e8 - 0xffffffff82e6080c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_mec_bin.ko
	0xffffffff82e61000 - 0xffffffff82e62000 is .text in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82e62000 - 0xffffffff82e63000 is .rodata in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82e63000 - 0xffffffff82ea4710 is rodata in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea4710 - 0xffffffff82ea47c0 is .data in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47c0 - 0xffffffff82ea47e0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47e0 - 0xffffffff82ea47e8 is set_sysinit_set in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea47e8 - 0xffffffff82ea480c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_mec2_bin.ko
	0xffffffff82ea5000 - 0xffffffff82ea6000 is .text in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82ea6000 - 0xffffffff82ea7000 is .rodata in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82ea7000 - 0xffffffff82f003e0 is rodata in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82f003e0 - 0xffffffff82f00490 is .data in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82f00490 - 0xffffffff82f004b0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82f004b0 - 0xffffffff82f004b8 is set_sysinit_set in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff82f004b8 - 0xffffffff82f004dc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_vcn_bin.ko
	0xffffffff83000000 - 0xffffffff8324c000 is .text in ./boot/kernel/zfs.ko
	0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in ./boot/kernel/zfs.ko
	0xffffffff832dc000 - 0xffffffff832fe228 is .data in ./boot/kernel/zfs.ko
	0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in ./boot/kernel/zfs.ko
	0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in ./boot/kernel/zfs.ko
	0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in ./boot/kernel/zfs.ko
	0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in ./boot/kernel/zfs.ko
	0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in ./boot/kernel/zfs.ko
	0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in ./boot/kernel/zfs.ko
	0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in ./boot/kernel/zfs.ko
	0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in ./boot/kernel/zfs.ko
Comment 253 George Mitchell 2024-12-18 19:01:27 UTC
(Answering a bunch of comments.)

I'm using non-debug builds of all ports.

I'm using locally built ports, but I will probably be phasing that out.

All the gpu-firmware-amd-kmods are version 20220511, and drm-510-kmod-5.10.163_10.  I am using realtek-re-kmod-197.00, locally built.  My /boot/loader.conf says:

if_re_load="YES"
if_re_name="/boot/modules/if_re.ko"
loader_logo="beastie"
sem_load="YES"
hw.vga.textmode="1"
hw.syscons.disable="1"
fusefs_load="YES"
dumpdev="/dev/ada0p3"

It looks like there's a version 20230625 for graphics/gpu-firmware-amd-kmod; I guess I should update that.  Hmmm - now I'm thoroughly confused, because I typed "portmaster -BDg graphics/gpu-firmware-amd-kmod" and it seems to have compiled the "aldebaran" flavor (version 20230625) while leaving everything else (meaning 85 other installed packages) alone (version 20220511).
Comment 254 Mark Millard 2024-12-18 20:47:25 UTC
(In reply to George Mitchell from comment #253)

QUOTE
Hmmm - now I'm thoroughly confused, because I typed "portmaster -BDg graphics/gpu-firmware-amd-kmod" and it seems to have compiled the "aldebaran" flavor (version 20230625) while leaving everything else (meaning 85 other installed packages) alone (version 20220511).
END QUOTE

The default flavor is the first in the list unless extra work was done
to control it in the port:

PKGNAMESUFFIX=  -${FLAVOR:C/_/-/g}
FLAVORS=        aldebaran \
                arcturus \
                banks \
                beige_goby \
. . .

You implicitly requested that only the aldebaran flavor
be built.

You probably meant something like (at least poudriere has
such @all support):

portmaster -BDg graphics/gpu-firmware-amd-kmod@all

but looking at the man page I see no hint of portmaster
supporting use of @all . May be the ports Makefile
handling does such automatically for @all use? That
might make portmaster also work.

In your case you likely could use:

portmaster -BDg graphics/gpu-firmware-amd-kmod@raven

for what we are investigating.

The extra notation for setting a default that is not
the starting value in FLAVORS is like:

FLAVOR?=	SOMEFLAVOR

in the Makefile .

There is a notation for referencing into the FLAVORS
list to pick out the default:

FLAVOR?=	${FLAVORS:[1]}

The port in question does not seem to have these.
On make command lines, FLAVOR=SOMEFLAVOR can be used.


Side note:

The fact that you use portmaster instead of poudriere
or poudriere-devel (or other such) is likely something
else that should generally be published for a self-built
context: it tells folks some about how much they need to
be worried about odd interactions from an unclean build
environment.

It is commonly more difficult to make portmaster or
Makefile based builds reproducable across systems and
across other context switching: the builds use more of
the variations in the live contexts.
Comment 255 George Mitchell 2024-12-18 21:00:17 UTC
Apparently I overlooked the appearance of flavors in this port.  If I believe pviconf -lv, I do indeed have a Raven Ridge chip.  But if drm-510-kmod is going to preemptively load all the firmware packages, it's very surprising (to me, anyway) that a simple compile of the port doesn't preemptively compile all of the flavors.  Certainly it did that by default before the appearance of flavors in the port.

I'll see if this changes any behavior.
Comment 256 Mark Millard 2024-12-18 21:45:09 UTC
(In reply to George Mitchell from comment #255)

The various ports have been set up to allow avoiding
installing unnecessary gpu-firmware ones. Just install:

graphics/gpu-firmware-amd-kmod@raven

If you have more installed, you might want to
first uninstall what you have and then do the
above so that only raven related things are
present.

The loading behavior shown in out investigative
materials shows that the system is picking out raven
related materials as what to load for the amdgpu_*
naming:

0xffffffff829a5000  0xffffffff829a6000  Yes (*)     ./boot/modules/amdgpu_raven_gpu_info_bin.ko
0xffffffff829a8000  0xffffffff829a9000  Yes (*)     ./boot/modules/amdgpu_raven_sdma_bin.ko
0xffffffff829af000  0xffffffff829b0000  Yes (*)     ./boot/modules/amdgpu_raven_asd_bin.ko
0xffffffff829de000  0xffffffff829df000  Yes (*)     ./boot/modules/amdgpu_raven_ta_bin.ko
0xffffffff829e8000  0xffffffff829e9000  Yes (*)     ./boot/modules/amdgpu_raven_pfp_bin.ko
0xffffffff829f0000  0xffffffff829f1000  Yes (*)     ./boot/modules/amdgpu_raven_me_bin.ko
0xffffffff829f7000  0xffffffff829f8000  Yes (*)     ./boot/modules/amdgpu_raven_ce_bin.ko
0xffffffff82e11000  0xffffffff82e12000  Yes (*)     ./boot/modules/amdgpu_raven_rlc_bin.ko
0xffffffff82e1d000  0xffffffff82e1e000  Yes (*)     ./boot/modules/amdgpu_raven_mec_bin.ko
0xffffffff82e61000  0xffffffff82e62000  Yes (*)     ./boot/modules/amdgpu_raven_mec2_bin.ko
0xffffffff82ea5000  0xffffffff82ea6000  Yes (*)     ./boot/modules/amdgpu_raven_vcn_bin.ko
Comment 257 Mark Millard 2024-12-18 22:26:57 UTC
(In reply to Mark Millard from comment #256)

There is no reason for you to build what you do not
need to install now that such is allowed.

The flavored graphics/gpu-firmware-amd-kmod dates
back to: 2022-05-01 and raven was present at the time.

It is very different for the official package builders:
No user-installation directly but making everything
available for a wide variety of installation contexts:
build all but allow installing just what is needed
based on using flavors to advantage.

Default build procedures are biased to the official-builder
context. That is why there is a graphics/gpu-firmware-kmod
that builds or installs all the gpu-firmware* but also a
graphics/gpu-firmware-amd-kmod that supports using just
graphics/gpu-firmware-amd-kmod@raven to build or install
just the one variant.

Building graphics/drm-510-kmod does not build any
graphics/gpu-firmware*-kmod as far as I know for
how things are now.


I'll note that, in my view, testing the official
builds that FreeBSD does to produce the packages is
appropriate, even if you wanted to go back to building
your own. Why? Being able to compare/contrast the two.
If one way things just work and the other way they fail,
that is significant. Also, if you demonstrate the
failure using official package builds, you are more
likely to get support for the problem.
Comment 258 Mark Millard 2024-12-18 22:46:55 UTC
(In reply to Mark Millard from comment #257)

I forgot to write:

Installing graphics/drm-510-kmod also does not
install any graphics/gpu-firmware*-kmod as far
as I know for how things are now.

And I forgot to list what does have the
dependency structure to bundle it all for
builds or installation: graphics/drm-kmod .

It picks between 510 , 515 , and 61 .
It does depend on: graphics/gpu-firmware-kmod
that in turn has a run-dependency on the
various gnu firmware flavors. It is more biased
to creating a context ready for most anything
supported, much as the official package builders
would want for building, for example. But
drm-kmod does not have to be used.
Comment 259 satanist+freebsd 2024-12-21 09:06:39 UTC
I think you looking at the wrong direction. The question is where does the NULL pointer is from.

So lets look at the 'found_modules->tqh_first->link.tqe_next->. . .->link.tqe_next' instance. This list only managed by sys/kern/kern_linker.c. And only at one point there is an insert:

```
static modlist_t
modlist_newmodule(const char *modname, int version, linker_file_t container)
{
        modlist_t mod;
                
        mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO);
        if (mod == NULL)
                panic("no memory for module list");
        mod->container = container;
        mod->name = modname;
        mod->version = version;
        TAILQ_INSERT_TAIL(&found_modules, mod, link); 
        return (mod);
}
```

So I would guess the +7 is from the TAILQ list and the fake NULL pointer is directly from malloc(9). So a build with MALLOC_DEBUG might help.

Also I have looked a bit a for PHYS_TO_DMAP in sys/compat/linuxkpi and found arch_io_reserve_memtype_wc(). This function is used at drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:

```
                int r = arch_io_reserve_memtype_wc(adev->gmc.aper_base,
                                adev->gmc.aper_size);

                if (r) {
                        DRM_ERROR("Unable to set WC memtype for the aperture base\n");
#ifdef __linux__
                        /*
                         * BSDFIXME: On recent AMD GPU requested area crosses
                         * DMAP boundries resulting in error. Ignore it for now
                         */
                        return r;
#endif
                }
```

This could also sneak in a fake NULL pointer and cause UB.
Comment 260 Mark Millard 2024-12-21 16:28:27 UTC
(In reply to satanist+freebsd from comment #259)

In:

        mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO);
        if (mod == NULL)
                panic("no memory for module list");
        mod->container = container;

if something similar to mod == 0xfffff80000000007 resulted,
it appears to me that the dereference in mod->container
or the like would have gotten a general protection fault,
given the later actual failure that sometimes happens
because of the 0xfffff80000000007 that sometimes happens.

I'll note also that, for example, one of the historical crashes
involving 0xfffff80000000007 was in handling a different list:

/*
 * Remove the references to the thread from all of the objects we were
 * polling.
 */
static void
seltdclear(struct thread *td)
{
        struct seltd *stp;
        struct selfd *sfp;
        struct selfd *sfn;

        stp = td->td_sel;
        STAILQ_FOREACH_SAFE(sfp, &stp->st_selq, sf_link, sfn)
                selfdfree(stp, sfp);
        stp->st_flags = 0;
}

so the issue does not appear to be list specific, even
if one list is more common for failing than others for
some reason.

I do not know if there is some relevant relationship with
the likes of code from:

drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

for alternate failure points.

No simple reproduction test has ever been discovered.


MALLOC_DEBUG is controlled in the kernel via
sys/kern/kern_malloc.c having the code:

#if defined(INVARIANTS) || defined(MALLOC_MAKE_FAILURES) ||             \
    defined(DEBUG_MEMGUARD) || defined(DEBUG_REDZONE)
#define MALLOC_DEBUG    1
#endif

It, in turn leads to definition and use of the kernel's
malloc_dbg() and free_dbg(). I certainly have no objection
to such testing, say via using an INVARIANTS based kernel
build. But I'm not testing, having no context to use to
reproduce the problem with. I'm just looking at vmcore.*
file(s) via kgdb .

But I'll also note, that recently we appear to have learned
that some of the software in use was rather old and not
being updated --so not tracking kernel updates. Testing if
the modern software built to match the kernel in use also
produces the problems seems appropriate, as that is what
would be changed if there is still a bug to be fixed. As I
understand that testing is what is going on now.
Comment 261 George Mitchell 2024-12-21 17:54:53 UTC
Created attachment 256016 [details]
core.txt.9

There have been fewer crashes related to this bug recently.  But I do have two more.  "core.txt.9" is basically the very same as the December 13 "Latest crash dump/text," in that zfs.ko was the other module involved in the crash.  So It reordered the loading of kernel modules, moving zfs.ko after vboxnetflt.ko and acpi_wmi.ko.  Sure enough, in "core.txt.0" (from a few moments ago) it's vboxnetflt.ko instead of zfs.ko that is involved in the crash.
Comment 262 George Mitchell 2024-12-21 17:55:22 UTC
Created attachment 256017 [details]
core.txt.0
Comment 263 Mark Millard 2024-12-21 18:35:24 UTC
(In reply to George Mitchell from comment #261)

Looks like the build is somewhat different, possibly more
of a debug build? I think this is the first time that
mod = 0xfffff80000000007 has been reported for inside
modlist_lookup :

core.txt.9 :

#6  <signal handler called>
No locals.
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:44
No locals.
#8  0xffffffff80bc0ab4 in modlist_lookup (name=0xffffffff83255959 "zfsctrl", 
    ver=1) at /usr/src/sys/kern/kern_linker.c:1488
        mod = 0xfffff80000000007

core.txt.0 :

#6  <signal handler called>
No locals.
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:44
No locals.
#8  0xffffffff80bc0ab4 in modlist_lookup (
    name=0xffffffff829fd0c4 "vboxnetflt", ver=1)
    at /usr/src/sys/kern/kern_linker.c:1488
        mod = 0xfffff80000000007

If so, I'll need to synchronize to any updated files
that I'd previously downloaded, not just the vmcore.[90]
files.

(The mod value is not a surprise. It is from the same
linking field that was found to have 0xfffff80000000007
as its value in the earlier vmcore.8 .)
Comment 264 George Mitchell 2024-12-21 18:45:13 UTC
The last update I made to drm-510-kmod was on December 7.  The only change more recent than that was the order in which I kldload modules at boot time.  For a long time, that was zfs, vboxnetflt, acpi_wmi; and just yesterday I moved zfs to last.
Comment 265 Mark Millard 2024-12-21 19:29:57 UTC
(In reply to George Mitchell from comment #264)

What about amdgpu_raven*.ko and the like from
graphics/gpu-firmware-amd-kmod@raven ( so:
gpu-firmware-amd-kmod-raven-20230625_2 )? Did you
rebuild and install it? Install a official
FreeBSD package? You had indicated that you
had accidentially only been updating @aldebaran
the way that you were doing things with portmaster:

The flavored graphics/gpu-firmware-amd-kmod dates
back to: 2022-05-01 and raven was present at the
time. Rebuilds of graphics/gpu-firmware-amd-kmod
via portmaster without the @raven being explicit
were not rebuilding raven's files ever since.

I'll note that rebuilding/installing drm-510-kmod
does not rebuild/install graphics/gpu-firmware-amd-kmod .

(What does span both a drm-*-kmod and all the
graphics/gpu-firmware-amd-kmod@??? is drm-kmod .
But that builds and installs a lot of unnecessary
materials relative to most personal contexts.
drm-510-kmod and graphics/gpu-firmware-amd-kmod@raven
may be more reasonable.)

I gather: no updates to the FreeBSD kernel files, such
as a GENERIC-DEBUG build and install? Same files that
I used with vmcore.8 ?

I do not think I've heve rhad a copy of your locally
built realtek-re-kmod-197.00 *.ko file.
Comment 266 George Mitchell 2024-12-21 21:41:08 UTC
I did install a new version of gpu-firmware-amd-kmod-raven-20230625.1304000_2 on December 18, after "Latest crash dump/text" but before both "core.txt.9" and "core.txt.0".  No kernel changes since December 4 when I updated from 13.3 to 13.4.