Created attachment 237279 [details] /var/crash/core.txt.1 crash description It doesn't happen every time. If I use kld_list="amdgpu" in /etc/rc.conf, it happens close to 50% of the time. If instead I boot to single user mode and manually kldload amdgpu, it happens maybe 20% of the time. If I have amdgpu_load="YES" in /boot/loader.conf, the module fails to load at all, without saying anything. FreeBSD 13.1-RELEASE-p2, drm-510-kmod-5.10.113_7, AMD Ryzen 3 2200G with Radeon Vega Graphics. Crashes are always general protection fault panics, replete with complaints about drm_modeset_is_locked being false.
This is marginally an improvement from FreeBSD 12, where kldload amdgpu would always immediately totally lock up the machine, with no recovery path short of powering down and back on. And when this crash DOESN'T happen, everything works marvelously well (and considerably better than running in VESA mode), so thanks for the work so far!
I have four more of the /var/crash/core.txt files, and core dumps (very large, too big to attach here even compressed) for each of them.
Thank you, and please note that issues for <https://github.com/freebsd/drm-kmod> are normally raised in GitHub.
Ugh, I don't have a GitHub account and I would rather not open one. (Yes, that does seem selfish of me and I apologize.)
From a _very_ quick look, it does not appear that this is an amdgpu problem. The crash is in the core kernel code and the stack trace has mentions of zfs. #6 <signal handler called> #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:46 #8 0xffffffff80be8c3d in modlist_lookup (name=0xfffff80004b71000 "zfs", ver=0) at /usr/src/sys/kern/kern_linker.c:1487 #9 modlist_lookup2 (name=0xfffff80004b71000 "zfs", verinfo=0x0) at /usr/src/sys/kern/kern_linker.c:1501 #10 linker_load_module (kldname=kldname@entry=0x0, modname=modname@entry=0xfffff80004b71000 "zfs", parent=parent@entry=0x0, verinfo=<optimized out>, verinfo@entry=0x0, lfpp=lfpp@entry=0xfffffe0075fddd90) at /usr/src/sys/kern/kern_linker.c:2165 #11 0xffffffff80beb17a in kern_kldload (td=td@entry=0xfffffe007f505a00, file=<optimized out>, file@entry=0xfffff80004b71000 "zfs", fileid=fileid@entry=0xfffffe0075fddde4) at /usr/src/sys/kern/kern_linker.c:1150 #12 0xffffffff80beb29b in sys_kldload (td=0xfffffe007f505a00, uap=<optimized out>) at /usr/src/sys/kern/kern_linker.c:1173 #13 0xffffffff810ae6ec in syscallenter (td=0xfffffe007f505a00) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189 #14 amd64_syscall (td=0xfffffe007f505a00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 To the reporter: do you by chance have zfs in kld_list ?
Also, how is the root file system tuned? tunefs -p / (In reply to George Mitchell from comment #1) > … from FreeBSD 12 𣀦… Did you run 13.0⋯ for a while, or did you upgrade from 12.⋯ direct to 13.1⋯?
> … immediately after kldload amdgpu … (In reply to George Mitchell from comment #0) If I understand correctly, the attachment shows: 1. kldload amdgpu whilst in single user mode 2. a subsequent, but non-immediate, exit ^D to multi-user mode 3. panic … ugen0.4: <Logitech USB Optical Mouse> at usbus0 <118>Enter full pathname of shell or RETURN for /bin/sh: Cannot read termcap database; <118>using dumb terminal settings. <118>root@:/ # kldload amdgpu <6>[drm] amdgpu kernel modesetting enabled. … <6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0 <118>root@:/ # ^D <118>Setting hostuuid: 032e02b4-0499-0547-c106-430700080009. <118>Setting hostid: 0x82f0750c. Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 stack pointer = 0x28:0xfffffe0075fdda60 frame pointer = 0x28:0xfffffe0075fdda60 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 52 (kldload) …
Thanks for the work so far. "zfs" is not explicitly in the kld_list, but I do use ZFS and zfs_enable is set to "YES". Also: tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) disabled tunefs: soft update journaling: (-j) disabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) disabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 8% tunefs: space to hold for metadata blocks: (-k) 6408 tunefs: optimization preference: (-o) time tunefs: volume label: (-L) I never ran 13.0; I'm always leery of upgrading to x.0 from x-1. (My upgrade was from 12.3-p6.) Also, I still remember a collection of severe crashes from years back with soft updates plus journaling. Are those problems known to be solved now? (Sorry to be getting off the main topic.)
In this particular crash, I manually loaded amdgpu in single-user mode, and then immediately hit control-D.
sysrc -f /etc/rc.conf kld_list – is there amdgpu alone, or are other modules listed? (In reply to George Mitchell from comment #9) Given the brief analysis by avg@ (comment #5), I'm inclined to: * view the load of amdgpu as successful * give thought to other modules, ones that are (or should be) subsequently loaded. Do you use IRC, Matrix (e.g. Element) or Discord?
(In reply to George Mitchell from comment #8) > …crashes from years back with soft updates plus journaling. > Are those problems known to be solved now? … For what's described: without a bug number, it might be impossible for me to tell. > … I never ran 13.0; … 13.1 fixed a bug that involved soft updates _without_ soft update journaling: <https://www.freebsd.org/releases/13.1R/relnotes/#storage-ufs> <https://docs.freebsd.org/en/books/handbook/config/#soft-updates> recommends soft updates. If there's no explicit recommendation to also enable soft update journaling, this could be because (bug 261944) there's not yet, in the Handbook, a suitable explanation of the feature. tunefs(8) <https://www.freebsd.org/cgi/man.cgi?query=tunefs&sektion=8&manpath=FreeBSD> for FreeBSD 13.1-RELEASE lacks a recently added explanation, you can gain this by switching the online view of the manual page to FreeBSD 14.0-CURRENT.
Without amdgpu in the kld_list, kld_list currently is not even defined. Perhaps it's more helpful to show what gets loaded aside from amdgpu int he course of a normal boot: kldstat 1 64 0xffffffff80200000 1f300f0 kernel 2 1 0xffffffff82132000 77e0 sem.ko 3 3 0xffffffff8213a000 8cc90 vboxdrv.ko 4 1 0xffffffff82600000 3df128 zfs.ko 5 2 0xffffffff82518000 4240 vboxnetflt.ko 6 2 0xffffffff8251d000 aac8 netgraph.ko 7 1 0xffffffff82528000 31c8 ng_ether.ko 8 1 0xffffffff8252c000 55e0 vboxnetadp.ko 9 1 0xffffffff82532000 3378 acpi_wmi.ko 10 1 0xffffffff82536000 3218 intpm.ko 11 1 0xffffffff8253a000 2180 smbus.ko 12 1 0xffffffff8253d000 33c0 uslcom.ko 13 1 0xffffffff82541000 4d90 ucom.ko 14 1 0xffffffff82546000 2340 uhid.ko 15 1 0xffffffff82549000 3380 usbhid.ko 16 1 0xffffffff8254d000 31f8 hidbus.ko 17 1 0xffffffff82551000 3320 wmt.ko 18 1 0xffffffff82555000 4350 ums.ko 19 1 0xffffffff8255a000 5af8 autofs.ko 20 1 0xffffffff82560000 2a08 mac_ntpd.ko 21 1 0xffffffff82563000 20f0 green_saver.ko The SU+J thing is totally anecdotal, based on what I used to see on freebsd-hackers. Right now, I format my disks with UFS for root/var/tmp (no more than 8GB for fast fscking), and then a ZFS partition for /usr. I don't use IRC, Matrix, or Element (not sure what those last two are) and on the rare occasions I use Discord, I use the web site.
As of today, with version drm-510-kmod-5.10.113_8: 1. I can reliably prevent a crash by booting to single user mode, manually kldloading amdgpu, and continuing (typing control-d). dmesg then reports: [drm] amdgpu kernel modesetting enabled. drmn0: <drmn> on vgapci0 vgapci0: child drmn0 requested pci_enable_io vgapci0: child drmn0 requested pci_enable_io [drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1458:0xD000 0xC8). drmn0: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [drm] register mmio base: 0xFE600000 [drm] register mmio size: 524288 [drm] add ip block number 0 <soc15_common> [drm] add ip block number 1 <gmc_v9_0> [drm] add ip block number 2 <vega10_ih> [drm] add ip block number 3 <psp> [drm] add ip block number 4 <gfx_v9_0> [drm] add ip block number 5 <sdma_v4_0> [drm] add ip block number 6 <powerplay> [drm] add ip block number 7 <dm> [drm] add ip block number 8 <vcn_v1_0> drmn0: successfully loaded firmware image 'amdgpu/raven_gpu_info.bin' [drm] BIOS signature incorrect 44 f drmn0: Fetched VBIOS from ROM BAR amdgpu: ATOM BIOS: 113-RAVEN-111 drmn0: successfully loaded firmware image 'amdgpu/raven_sdma.bin' [drm] VCN decode is enabled in VM mode [drm] VCN encode is enabled in VM mode [drm] JPEG decode is enabled in VM mode [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit drmn0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used) drmn0: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF drmn0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF [drm] Detected VRAM RAM=2048M, BAR=2048M [drm] RAM width 128bits DDR4 [TTM] Zone kernel: Available graphics memory: 3100774 KiB [TTM] Zone dma32: Available graphics memory: 2097152 KiB [TTM] Initializing pool allocator [drm] amdgpu: 2048M of VRAM memory ready [drm] amdgpu: 3072M of GTT memory ready. [drm] GART: num cpu pages 262144, num gpu pages 262144 [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000). drmn0: successfully loaded firmware image 'amdgpu/raven_asd.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_ta.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_pfp.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_me.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_ce.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_rlc.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_mec.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_mec2.bin' amdgpu: hwmgr_sw_init smu backed is smu10_smu drmn0: successfully loaded firmware image 'amdgpu/raven_vcn.bin' [drm] Found VCN firmware Version ENC: 1.12 DEC: 2 VEP: 0 Revision: 1 drmn0: Will use PSP to load VCN firmware [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR drmn0: RAS: optional ras ta ucode is not available drmn0: RAP: optional rap ta ucode is not available [drm] kiq ring mec 2 pipe 1 q 0 [drm] DM_PPLIB: values for F clock [drm] DM_PPLIB: 400000 in kHz, 3649 in mV [drm] DM_PPLIB: 933000 in kHz, 4074 in mV [drm] DM_PPLIB: 1200000 in kHz, 4399 in mV [drm] DM_PPLIB: 1333000 in kHz, 4399 in mV [drm] DM_PPLIB: values for DCF clock [drm] DM_PPLIB: 300000 in kHz, 3649 in mV [drm] DM_PPLIB: 600000 in kHz, 4074 in mV [drm] DM_PPLIB: 626000 in kHz, 4250 in mV [drm] DM_PPLIB: 654000 in kHz, 4399 in mV [drm] Display Core initialized with v3.2.104! [drm] VCN decode and encode initialized successfully(under SPG Mode). drmn0: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8 [drm] fb mappable at 0x60BCA000 [drm] vram apper at 0x60000000 [drm] size 8294400 [drm] fb depth is 24 [drm] pitch is 7680 VT: Replacing driver "vga" with new "fb". start FB_INFO: type=11 height=1080 width=1920 depth=32 pbase=0x60bca000 vbase=0xfffff80060bca000 name=drmn0 flags=0x0 stride=7680 bpp=32 end FB_INFO drmn0: ring gfx uses VM inv eng 0 on hub 0 drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub 0 drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub 0 drmn0: ring comp_1.2.0 uses VM inv eng 5 on hub 0 drmn0: ring comp_1.3.0 uses VM inv eng 6 on hub 0 drmn0: ring comp_1.0.1 uses VM inv eng 7 on hub 0 drmn0: ring comp_1.1.1 uses VM inv eng 8 on hub 0 drmn0: ring comp_1.2.1 uses VM inv eng 9 on hub 0 drmn0: ring comp_1.3.1 uses VM inv eng 10 on hub 0 drmn0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 drmn0: ring sdma0 uses VM inv eng 0 on hub 1 drmn0: ring sdma0 uses VM inv eng 0 on hub 1 drmn0: ring vcn_dec uses VM inv eng 1 on hub 1 drmn0: ring vcn_enc0 uses VM inv eng 4 on hub 1 drmn0: ring vcn_enc1 uses VM inv eng 5 on hub 1 drmn0: ring jpeg_dec uses VM inv eng 6 on hub 1 vgapci0: child drmn0 requested pci_get_powerstate sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)! [drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0 Is the sysctl_warn_reuse message anything to worry about? 2. Adding amdgpu to the kldlist in rc.conf still crashes more often than not, as previously reported. 3. Attempting to load amdgpu via /boot/loader.conf appears to load the module in memory but not actually make it functional. (X uses VESA mode as if the module isn't there.)
Created attachment 238024 [details] /var/crash/core.txt.3 crash description from today Contrary to comment #13, today I got a crash despite booting to single user mode, typing "kldload amdgpu", and then control-d. But it looks indistinguishable from the other /var/crash/core.txt.1 description. Next I'll try booting to single user mode and kldloading zfs before kldloading amdgpu.
Created attachment 238075 [details] Another crash This time, I booted into single user mode and typed "kldload zfs amdgpu" with no problems. Then when I typed ctrl-d I got this crash (which looks pretty much the same as all the other ones, except the places in the backtrace that used to refer to zfs now refer to vboxnetflt, which I load for VirtualBox). So it seems likely that the crash has nothing to do with any specific other kernel loadable module might be cited in the backtrace.
The following comment is based on zero actual knowledge of how kernel loadable modules work. Still, based on what I'm seeing with this bug, I hypothesize that after one module is loaded, there is a mechanism by which the next module (and maybe other later ones) call back to modules already loaded in order to prevent incompatible modules (whatever that might mean) from trying to coexist. And somewhere in that path in the amdgpu module, it is detected that some lock that was taken while amdgpu was loading was erroneously not released. (Most of the time, the lock IS released, and I don't know exactly under what circumstances it isn't.) I hope this is helpful.
I've discovered how to avoid this crash (at least the last 20-30 times I have booted up): boot into single user mode, type <ENTER> to run /bin/sh, type "kldload amdgpu," and then (key step!) wait at least five seconds before typing ctrl-D to exit single user mode. Since I don't know why this helps, I guess it falls into the voodoo category, but maybe it's a clue.
I hate to say again how little I know about kernel module loading, but by any chance is there multithreading in the code that gets called when amdgpu.ko is first loaded? I can't help thinking that perhaps that code is returning prematurely, before some initialization is completely finished and all locks released. If I knew where to put it, I would throw in a five-second delay at the end of whatever gets called to load amdgpu.ko.
Created attachment 238668 [details] Another crash summary; looks like all the earlier ones Contrary to my comment #17, I got this same crash this morning, even waiting five seconds after loading amdgpu.ko before proceeding. So the delay doesn't prevent the crash.
I've figured out why this crash is timing related, and also why ZFS is involved. My system has a 1 TB USB disk, which contains a ZFS file system. When I power my system on, it takes a variable amount of time for that disk to become ready and for ZFS to take note of it. (I'm booting from a SATA disk with a traditional old UFS file system.) So if the USB disk becomes ready while amdgpu is still initializing, apparently this crash happens. I have no clue why that is true, but I am pretty sure this explains why the the crash happens only part of the time and is timing dependent. It remains true that the most reliable way to cause the crash is to include amdgpu in the kld_list in /etc/rc.conf and simply boot normally (and to have a *ZFS-formatted USB* disk attached to the system).
I've updated to version drm-510-kmod-5.10.113_8 and it hasn't crashed yet, but I've only had time for one test so far.
(In reply to George Mitchell from comment #21) If a crash _does_ occur/recur, then maybe test for reproducibility with this in your /boot/loader.conf kern.smp.disabled=1 <https://www.freebsd.org/cgi/man.cgi?query=smp&sektion=4&manpath=FreeBSD> (Be prepared for significantly reduced performance after restarting with SMP disabled.) This is a gut feeling, more than anything (apologies for the noise), partly based on experiences with virtual hardware …
Thanks! So far, I've booted four times with amdgpu in my kld_list, which previously would likely have yielded at least one crash, with no crash. So I have my fingers crossed, but I'll try your hack if it crashes again (and your theory certainly sounds plausible).
Created attachment 238802 [details] New core.txt The latest version definitely crashes less often, but I just now got a new crash that (to me) looks different than the earlier one. I was just about ready to mark this fixed!
After further consideration (and a partly sleepless night), I've decided that the latest crash is not an instance of this bug and possibly isn't related to amdgpu.ko at all. So I'm going to close this bug and maybe open a new one when I understand the new one better. Anyone looking at this bug in the future should pay no attention to "New core.txt" attachment, but should refer to the obsolete attachments.
Created attachment 238849 [details] A new crash I regret to say I'm going to have to reopen this bug. But I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall).
Created attachment 238850 [details] A new instance of the same crash I regret to say the crash has happened again. I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
Reopening bug.
(In reply to George Mitchell from comment #26) > FreeBSD court 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 752f813d6 M5P amd64 Please update the OS. ---- Given comment #5 from avg@, and (for example) comment #24 the different types of kernel panic: fs@ x11@ please: if panics recur with an updated OS, would you recommend continuing with this report (267028)? Or start afresh, with a new report for the more recent type of panic?
This is not drm related, the drm message are noises that we should fix one day when we switch ttys.
(In reply to Graham Perrin from comment #29) I'm on the release branch, not the stable branch. So you are suggesting I update from 13.1-RELEASE-p2 to 13.1-RELEASE-p5? And then recompile the kernel module as well, I assume?
For what it's worth, I'm doing this testing on a desktop machine, so setting kern.smp.disabled=1 actually doesn't impact operation too much -- except for Thunderbird. And so far I haven't seen the crash with that setting.
Does switch to graphics/drm-510-kmod and updating graphics/gpu-firmware-amd-kmod helps?
In fact, switching to graphics/drm-510-kmod from the generic VESA driver is what originally triggered this bug. Without using amdgpu.ko there is no problem.
All your reports show that it's from zfs, again the drm messages are noises.
Created attachment 238886 [details] Crash after updating kernel/world to 13.1-RELEASE-p5 This is after updating my kernel and world to 13.1-RELEASE-p5. I grant you the backtrace here sure points to the openzfs code, but why does the crash happen only with graphics/drm-510-kmod installed and amdgpu.ko loaded, but not otherwise? For the time being, I will be running WITHOUT amdgpu.ko in my kld_list, and I am confident this crash will not occur. I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2. But based on an earlier comment I updated to 13.1-RELEASE-p5. Then after going back to kern.smp.disabled=0 I got another of the crash. I did observe that something in sys/contrib/openzfs/module/zfs got updated between p2 and p5, but it doesn't seem to have fixed this crash. Compiling graphics/drm-510-kmod under p5 yielded an amdgpu.ko that was identical to amdgpu.ko compiled under p2.
I'm still having this problem, though I can reduce its frequency by booting in single-user mode, kldloading amdgpu, waiting five or ten seconds, and then going to multi-user mode with control-D. I've updated the title to emphasize that the bug happens only when amdgpu.ko (from graphics/drm-510-kmod version 5.10.113_8) and ZFS are both in use. Also, it happens during booting, or else never.
I don't want the title to become too wordy, but also I'll note again that my 1TB USB disk (GPT formatted with one ZFS partition only) that takes a measurable, variable amount of time to become ready may be the main reason this crash doesn't always happen.
(In reply to George Mitchell from comment #37) grep -e solaris -e zfs /boot/loader.conf grep zfs /etc/rc.conf What's reported?
(In reply to Graham Perrin from comment #39) > grep -e solaris -e zfs /boot/loader.conf > grep zfs /etc/rc.conf zfs_enable="YES" # Set to YES to automatically mount ZFS file systems
(In reply to George Mitchell from comment #40) (In reply to George Mitchell from comment #20) > … timing related, … Please add to /boot/loader.conf zfs_load="YES"
Created attachment 239336 [details] Crash dump Well, this helps a bit. By adding that line to /boot/loader.conf and restoring kld_list="amdgpu" to my /etc/rc.conf, I was able to reboot without the crash four times in a row, whereas before it would crash about every other time. But it crashed on the fifth time. (See attached core.txt.0.)
In the new core.txt.0, there are about 19 lines of text from the previous shutdown near the beginning of the file. But the substance of the backtrace looks identical to all the previous ones. So loading ZFS early mitigates the problem but does not fix it.
I think that in these frames we clearly see a bogus pointer / address: #7 <signal handler called> #8 vtozoneslab (va=18446735277616529408, zone=<optimized out>, slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635 #9 free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>) at /usr/src/sys/kern/kern_malloc.c:911 #10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, buf=0xfffff80000000007, size=16688648) at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216 I'd recommend poking around frames 11-13 to see from where that address comes. Also, I don't get an impression that the latest crash is similar to earlier ones. kern_reboot / zfs__fini vs dbuf_evict_thread.
It appears I could mitigate this problem if I could load amdgpu.ko from /boot/loader.conf, which currently doesn't work. See bug #268962. Alternatively, at present I can completely avoid this crash by: 1. having zfs_load="YES" in /boot/loader.conf. 2. booting into single user mode. 3. typing kldload amdgpu. 4. typing control-D.
(In reply to George Mitchell from comment #45) Correction to comment #45: I can avoid the problem around 95% of the time with the specified steps, but not 100%.
Created attachment 239752 [details] Latest crash dump The last couple of crashes strongly resemble all the earlier ones, but they are now less frequent with zfs.ko being loaded at /boot/loader.conf time and amdgpu.ko loaded while booted into single user mode. The difference (see core.txt.2 from today's date) is that the backtrace line where modlist2_lookup is called is now looking up vboxnetflt instead of zfs. My rcorder list shows: /etc/rc.d/dumpon /etc/rc.d/sysctl /etc/rc.d/natd /etc/rc.d/dhclient /etc/rc.d/hostid /etc/rc.d/ddb /etc/rc.d/ccd /etc/rc.d/gbde /etc/rc.d/geli /etc/rc.d/zpool /etc/rc.d/swap /etc/rc.d/zfskeys /etc/rc.d/fsck /etc/rc.d/zvol /etc/rc.d/growfs /etc/rc.d/root /etc/rc.d/sppp /etc/rc.d/mdconfig /etc/rc.d/hostid_save /etc/rc.d/serial /etc/rc.d/mountcritlocal /etc/rc.d/zfsbe /etc/rc.d/tmp /etc/rc.d/zfs /etc/rc.d/var /etc/rc.d/cfumass /etc/rc.d/cleanvar /etc/rc.d/FILESYSTEMS /etc/rc.d/geli2 /etc/rc.d/ldconfig /etc/rc.d/kldxref /etc/rc.d/adjkerntz /etc/rc.d/hostname /etc/rc.d/ip6addrctl /etc/rc.d/ippool /etc/rc.d/netoptions /etc/rc.d/opensm /etc/rc.d/random /etc/rc.d/iovctl /etc/rc.d/rctl /usr/local/etc/rc.d/vboxnet /etc/rc.d/ugidfw /etc/rc.d/autounmountd /etc/rc.d/mixer /etc/rc.d/ipsec /usr/local/etc/rc.d/uuidd /etc/rc.d/kld /etc/rc.d/ipfilter /etc/rc.d/devmatch /etc/rc.d/addswap /etc/rc.d/ipnat /etc/rc.d/ipmon /etc/rc.d/ipfs /etc/rc.d/netif /etc/rc.d/ppp /etc/rc.d/pfsync /etc/rc.d/pflog /etc/rc.d/rtsold /etc/rc.d/static_ndp /etc/rc.d/static_arp /etc/rc.d/devd /etc/rc.d/resolv /etc/rc.d/stf /etc/rc.d/ipfw /etc/rc.d/routing /etc/rc.d/bridge /etc/rc.d/zfsd /etc/rc.d/defaultroute /etc/rc.d/routed /etc/rc.d/pf /etc/rc.d/route6d /etc/rc.d/ipfw_netflow /etc/rc.d/blacklistd /etc/rc.d/netwait /etc/rc.d/local_unbound /etc/rc.d/NETWORKING /etc/rc.d/kdc /etc/rc.d/tlsservd /etc/rc.d/iscsid /etc/rc.d/pppoed /etc/rc.d/ctld /etc/rc.d/nfsuserd /etc/rc.d/tlsclntd /etc/rc.d/kfd /usr/local/etc/rc.d/sndiod /etc/rc.d/gssd /etc/rc.d/nfscbd /etc/rc.d/ipropd_master /etc/rc.d/ipropd_slave /etc/rc.d/kadmind /etc/rc.d/kpasswdd /etc/rc.d/iscsictl /etc/rc.d/mountcritremote /etc/rc.d/archdep /etc/rc.d/dmesg /etc/rc.d/wpa_supplicant /etc/rc.d/hostapd /etc/rc.d/accounting /etc/rc.d/mdconfig2 /etc/rc.d/devfs /etc/rc.d/gptboot /etc/rc.d/virecover /etc/rc.d/os-release /etc/rc.d/motd /etc/rc.d/cleartmp /etc/rc.d/newsyslog /etc/rc.d/syslogd /etc/rc.d/linux /etc/rc.d/sysvipc /etc/rc.d/hastd /etc/rc.d/localpkg /etc/rc.d/auditd /etc/rc.d/bsnmpd /etc/rc.d/ntpdate /etc/rc.d/watchdogd /etc/rc.d/savecore /etc/rc.d/pwcheck /etc/rc.d/power_profile /etc/rc.d/auditdistd /etc/rc.d/SERVERS /etc/rc.d/rpcbind /etc/rc.d/nisdomain /etc/rc.d/nfsclient /etc/rc.d/ypserv /etc/rc.d/ypupdated /etc/rc.d/ypxfrd /etc/rc.d/ypbind /etc/rc.d/ypldap /etc/rc.d/ypset /etc/rc.d/keyserv /etc/rc.d/automountd /etc/rc.d/yppasswdd /etc/rc.d/quota /etc/rc.d/automount /etc/rc.d/mountd /etc/rc.d/nfsd /etc/rc.d/statd /etc/rc.d/lockd /etc/rc.d/DAEMON /etc/rc.d/rwho /etc/rc.d/utx /etc/rc.d/bootparams /etc/rc.d/hcsecd /etc/rc.d/ftp-proxy /etc/rc.d/local /usr/local/etc/rc.d/git_daemon /etc/rc.d/lpd /usr/local/etc/rc.d/dbus /etc/rc.d/mountlate /etc/rc.d/nscd /etc/rc.d/ntpd /etc/rc.d/powerd /usr/local/etc/rc.d/slurmd /usr/local/etc/rc.d/slurmctld /etc/rc.d/ubthidhci /etc/rc.d/rarpd /etc/rc.d/sdpd /etc/rc.d/apm /etc/rc.d/rtadvd /etc/rc.d/moused /etc/rc.d/rfcomm_pppd_server /usr/local/etc/rc.d/avahi-daemon /etc/rc.d/swaplate /etc/rc.d/bthidd /etc/rc.d/bluetooth /usr/local/etc/rc.d/avahi-dnsconfd /etc/rc.d/LOGIN /etc/rc.d/sshd /usr/local/etc/rc.d/vboxheadless /etc/rc.d/syscons /etc/rc.d/sysctl_lastload /usr/local/etc/rc.d/xdm /usr/local/etc/rc.d/vboxwatchdog /etc/rc.d/inetd /usr/local/etc/rc.d/dnetc /usr/local/etc/rc.d/munged /etc/rc.d/sendmail /etc/rc.d/ftpd /usr/local/etc/rc.d/rsyncd /usr/local/etc/rc.d/saned /etc/rc.d/cron /etc/rc.d/msgs /etc/rc.d/othermta /etc/rc.d/jail /etc/rc.d/bgfsck /usr/local/etc/rc.d/smartd /etc/rc.d/securelevel The vboxnetflt.ko module is loaded by /usr/local/etc/rc.d/vboxnet.
And the list of kernel modules loaded by a non-crashing boot is: kernel sem.ko zfs.ko if_re.ko vboxdrv.ko amdgpu.ko drm.ko linuxkpi_gplv2.ko dmabuf.ko ttm.ko amdgpu_raven_sdma_bin.ko amdgpu_raven_asd_bin.ko amdgpu_raven_ta_bin.ko amdgpu_raven_pfp_bin.ko amdgpu_raven_me_bin.ko amdgpu_raven_ce_bin.ko amdgpu_raven_rlc_bin.ko amdgpu_raven_mec_bin.ko amdgpu_raven_mec2_bin.ko amdgpu_raven_vcn_bin.ko vboxnetflt.ko (and a whole bunch more) In other words, when the crash happens, it always involves a call to modlist_lookup2 from whatever kernel module gets loaded following amdgpu.
*** Bug 268416 has been marked as a duplicate of this bug. ***
Created attachment 239967 [details] Crash after loading vboxnetflt early by hand Since the previous crash included a reference to vboxnetflt.ko, I experimented a few times with amdgpu.ko added to my kld_lst in /etc/rc.conf, and loading vboxnetflt by hand after booting to single user mode. I think it's pretty clear at this point that there is no problem in ZFS code. It's a lock mismanagement problem of some sort in amggpu.ko (from graphics/drm-510-kmod). If I have permission to change the assignee of this bug, I will.
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
(In reply to Graham Perrin from comment #10) It does appear that amdgpu.ko always loads successfully. But then the loading of some other module subsequently (which might be zfs.ko or vboxnetflt.ko or maybe something else) somehow causes an unexpected call back into the amdgpu code. I have no idea how. The current situation: 1. zfs.ko is loaded from /boot/loader.conf. 2. I always boot into single user mode. 3. The last few times, I had kld_list="amdgpu.ko" in my /etc/rc.conf, but for now I'm taking it back out. 4. So I'm loading amdgpu.ko manually in single user mode and then waiting ten seconds or so before going multiuser. It's voodoo but it usually avoids the crash.
(In reply to George Mitchell from comment #52) No it's not, I've told you already that what's printed by drm is not the panic it's noise when we switch ttys during a panic. All you crash logs talk about zfs dbufs, this isn't amdgpu.
(In reply to Emmanuel Vadot from comment #54) If I boot up without loading amdgpu.ko at all, then I NEVER get the crash. Confirmed many many times.
(In reply to Emmanuel Vadot from comment #54) I think that George's point was not about anything that gets printed, but what happens depending on whether amdgpu gets loaded (and when) or not. It's not unimaginable that an exotic bug in one module (or in the module loading code or the code for resolving symbols) results in a memory corruption and a crash elsewhere. A very wild guess, but I'd check if there are any duplicate symbols between amdgpu and zfs.ko... and even kernel itself.
(In reply to Andriy Gapon from comment #56) But then anyone else using zfs+amdgpu will have the same problem and that's not the case (I use both on multiple machine running either 13.1, stable/13 or CURRENT).
If it is ZFS, then the only exotic factor on my system is an external USB one-terabyte drive (WDC WD10EZEX-08WN4A0), formatted with GPT and one ZFS partition, that seems to take a variable amount of time to come on line at power up. I theorized at one point that tasting that drive at an unpredictable time was a factor in the crash. Your mileage may vary.
(In reply to Emmanuel Vadot from comment #54) QUOTE All you crash logs talk about zfs dbufs END QUOTE Not true: "Crash dump" and "Latest crash dump" have no examples of "dbuf" in the submitted text. Also: The backtrace in "Latest crash dump" makes no mention of "zfs" at all. (It does occur in other text.)
(In reply to George Mitchell from comment #58) Could a test be formed on your hardware, loading ZFS but having no actual import of any pool, possibly not even a pool to find (empty "zpool import")? As stands your context is hard for anyone else to make an analogous context for testing. Finding a failure in a simpler to replicate context could help with avoiding your having the only known failure context. So any other variations that are simpler contexts for others to replicate and test would be a good thing. But, also, if such effort ends up unable to replicate the problem in your environment, that might be usefull information as well.
(In reply to Mark Millard from comment #60) In addition to my external USB ZFS drive, also my /usr file system is a ZFS slice. My main hard drive has a very small UFS root (and /var and /tmp) slice, because I have a superstitious fear of ZFS on root. The /usr slice (the rest of the drive) is big enough to take an annoying amount of time to fsck, so when I first added this drive to my system (which was also when I updated from 12 to 13), I chose ZFS for /usr to minimize that time. For a while, I suppose I could copy my /usr slice onto the /usr slice from my old internal drive and mount that in place of the current /usr slice for some tests, and I could do without the external drive. I'll have to think about this.
(In reply to George Mitchell from comment #61) If you can boot an external USB3 drive or some such, may be a minimal separate drive: UFS 13.1-RELEASE with enough added to also have amdgpu.ko . With such a context, do you still manage to see boot failures? Progressing from the simplest independent context towards an independent one more like your normal context might be easier --and might avoid needing to change your normal context as much. Just a test context, not a normal use one. Fewer constraints on the configuration that way. Food for thought.
I had the same problem on 13.1-STABLE, vbox module load caused immediate kernel panic, I had rolled back to 13.1-STABLE because of this. In the bare loader when kernel was loaded the vbox drivers load was okay. When vbox drivers was part of /boot/loader.conf or /etc/rc.conf is caused immediate kernel panic (no dump). virtualbox-ose-kmod was recompiled from ports on a newly installer kernel and system. Not sure if this is amdgpu nor zfs related though..?
*rolled back to 13.1-RELEASE sorry :-) All works fine here. Might be vbox + amdgpu api desync?
(In reply to George Mitchell from comment #36) Have you ever gotten a crash with kern.smp.disabled=1 ? If not, how many tests did you try?
(In reply to George Mitchell from comment #53) A test might be to load something simple or unusual for your context after amdgpu.ko and seeing if it still crashes. I'm not sure it is a good example, but does, say, loading amdgpu.ko and then filemon.ko also lead to a crash (not loading more after that)?
(In reply to Emmanuel Vadot from comment #57) Only "good", easy bugs are like that. That's why I said that this one must be exotic. But there must be something specific about George's environment too. Maybe configuration, maybe build, maybe specific hardware, maybe even a hardware glitch. E.g., maybe if the graphics is active the RAM is more likely to randomly flip a bit.
(In reply to Mark Millard from comment #65) Yes, I got the crash. See comment #26.
I have a spare disk I can use for a test without ZFS. It's currently at 12.0-RELEASE so it will take me a while to update it to 13. Possibly I won't have a chance today, but I will try it.
(In reply to George Mitchell from comment #68) #26 and #27 indicate that you would try the workaround kern.smp.disabled=1 , not the result of trying: #26: I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall) #27: I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall). That is part of why I asked. Did the failure result with kern.smp.disabled=1 seem the same/similar to the other failures --or was it distinct in some way?
(In reply to George Mitchell from comment #24) It looks to me like the backtrace in "Latest crash dump": KDB: stack backtrace: #0 0xffffffff80c66ec5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbcf at vpanic+0x17f #2 0xffffffff80c1ba43 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff81084fb8 at calltrap+0x8 #5 0xffffffff80be8c3d at linker_load_module+0x17d #6 0xffffffff80beb17a at kern_kldload+0x16a #7 0xffffffff80beb29b at sys_kldload+0x5b #8 0xffffffff810ae6ec at amd64_syscall+0x10c #9 0xffffffff810858cb at fast_syscall_common+0xf8 basically matches the 4 attachments that have been set to be Obsolete. Should the Obsolete status be undone on the 4? Vs.: Should "Latest crash dump" be made to also be Obsolete? I'm guessing that none of the attachments should be obsolete at this point.
(In reply to Mark Millard from comment #70) There was also #36 with: QUOTE I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2. But based on an earlier comment I updated to 13.1-RELEASE-p5. Then after going back to kern.smp.disabled=0 I got another of the crash. END QUOTE It only reported not getting a crash for kern.smp.disabled=1 .
(In reply to Mark Millard from comment #70) I should have referred you to comment #27, not #26. But I definitely got the crash with smp.disabled=1. (In reply to Mark Millard from comment #71) I could make a case for obsoleting all but two of them, but possibly I would be throwing away useful information. To my unpracticed eye, though, the ones I DID obsolete were pretty redundant with the ones I kept. They all look pretty similar to me.
(In reply to George Mitchell from comment #73) But they are all the examples were the backtraces having nothing from zfs or dbuf. Having 5 of 11 reports that way looks rather different from 1 out of 7. I'd say that the frequency is notable.
(In reply to George Mitchell from comment #73) #27: QUOTE I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall). END QUOTE It still says that you will try in the future, not explicitly that you had a failure with kern.smp.disabled=1 . #36 reports not having failures with kern.smp.disabled=1 . I did not find any wording I could interpret as reporting a failure with kern.smp.disabled=1 (prior to #73). Do you remember noticing anything distinct? (Probably not, or you would have commented in #73. But just to be sure . . .)
(In reply to Mark Millard from comment #75) It's close to two months ago, so my memory may be misleading me, since my age is beginning to resemble the number of this comment. But I'm pretty sure smp.disabled=1 did not prevent the bug. I could be wrong.
I have been remiss in testing this without ZFS, because I will have to shuffle a couple of disks around. I apologize for the delay. I hope to be able to try this test later this week.
Although I have not yet managed to test this without ZFS, I have established that with zfs_load="YES" but without "vboxnet_enable="YES"" in /etc/rc.conf (zfs.ko and vboxnetflt.ko seeming to be the two modules with which amdgpu.ko has, um, personality conflicts), I can now boot up without crashing (so far). Does anyone have any idea what zfs.ko and vboxnetflt.ko do that other modules don't do?
I omitted an important phrase. It should have said, "with with zfs_load="YES" in /boot/loader.conf ..."
Created attachment 240427 [details] New version of the crash, from acpi_wmi Here's another module that doesn't get along well with amdgpu.ko on my system: acpi_wmi.ko. Other than that this crash looks identical to all the earlier ones, as far as I can tell. It's been about a dozen boot-up tries since I put zfs_load="YES" into /boot/loader.conf (so that ZFS gets loaded early to minimize its interaction with amdgpu.ko) and vboxnet_enable="NO" in /etc/rc.conf (so that vboxnetflt.ko doesn't get its chance to cause trouble either) until I got this new crash. I'll mention again that this crash always happens within a minute of booting up, or else never. Anyone have any ideas about what acpi_wmi.ko has in common with zfs.ko and vboxnetflt.ko?
(In reply to George Mitchell from comment #80) There are multiple, distinct backtraces in your various examples. This one matches the 4 still-listed-as Obsolete ones and the "Latest crash dump" one, but not the others (if I remember right). So it is another example where there is no mention of dbuf or of zfs in the backtrace's text, unlike some other backtraces. So far as I can tell, there still has been no evidence gathering seeing if the problem can happen absent zfs being loaded or zfs loaded but no pools ever imported. If I gather correctly, we now do have evidence that the specific type of backtrace can happen without vboxnetflt.ko ever having been loaded, proving it is not necessary for that kind of failure. That is a form of progress as far as evidence goes. It also suggests that merely being listed in a backtrace does not mean that fact necessarily tells one much about the basic problem. There is some possibility here that there is more than one basic problem and some of the backtrace variability is associated with that.
Using the gdb-based backtrace information: #8 0xffffffff80be8c5d in modlist_lookup (name=0xfffff80006217400 "acpi_wmi", ver=0) at /usr/src/sys/kern/kern_linker.c:1487 is for the strcmp code line in: static modlist_t modlist_lookup(const char *name, int ver) { modlist_t mod; TAILQ_FOREACH(mod, &found_modules, link) { if (strcmp(mod->name, name) == 0 && (ver == 0 || mod->version == ver)) return (mod); } return (NULL); } We also see that strcmp was called via: #6 <signal handler called> #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:46 We also see name was accessible, as shown in the "#8" line above. We see from #7 that strcmp was considered called, suggesting that the mod->name of itself did not fail. The implication would be that that value of name in mod->name was a bad pointer when strcmp tried to use the value. Nothing says that mod->name was or should have been for acpi_wmi at all. The "acpi_wmi" side of the comparison need not be relevant information. Other backtraces that look similar may well have a similar status for the name in the right had argument to the strcmp. This might be a useful hint to someone with appropriate background or suggest some way of detecting the bad value in mod->name earlier when that earlier context might be of more use for investigations.
I have set up a disk with FREEBSD 13.1-RELEASE-p7 and drm-510-kmod 5.10.113_8 WITHOUT ZFS and vbox-anything. I don't know how to avoid loading acpi-wmi.ko. So far it hasn't crashed, but I will try a whole bunch of reboots tomorrow with that disk.
(In reply to George Mitchell from comment #83) I found the following text on https://cateee.net/lkddb/web-lkddb/ACPI_WMI.html : QUOTE ACPI-WMI is a proprietary extension to ACPI to expose parts of the ACPI firmware to userspace - this is done through various vendor defined methods and data blocks in a PNP0C14 device, which are then made available for userspace to call. The implementation of this in Linux currently only exposes this to other kernel space drivers. This driver is a required dependency to build the firmware specific drivers needed on many machines, including Acer and HP laptops. END QUOTE So, I expect that if acpi_wmi.ko is being loaded by FreeBSD, it may well be a requirement for that machine to boot and/or operate via ACPI. But I'm not familiar with the details.
I have a new crash, but I did not get a dump because of an issue I will explain below. For those who came in late, here's a summary of my system. dmesg says I have:CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics (3493.71-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x810f10 Family=0x17 Model=0x11 Stepping=0 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM> AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX> Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB> SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768 TSC: P-state invariant, performance statistics My motherboard is a Gigabyte B450M D53H. BIOS is American Megatrends version F4, dated 1/25/2019. pciconf -lv says: vgapci0@pci0:6:0:0: class=0x030000 rev=0xc8 hdr=0x00 vendor=0x1002 device=0x15dd subvendor=0x1458 subdevice=0xd000 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]' class = display subclass = VGA Until recently, when I was running FBSD 12-RELEASE, my box had one hard drive. I added a new drive when I upgraded to FBSD 13-RELEASE so I would still have FBSD 12 as an emergency backup. Part of the upgrade is that on the new disk I created a small UFS slice for /, /var, and /tmp, and most of the rest of the disk is a ZFS slice for /usr (so I wouldn't have to wait for fsck on reboot after crashes). That means that it isn't practical to do a test without ZFS on that new disk (I'll call it my regular disk now). So I installed FBSD 13 (same version as my regular disk) on the old disk (I'll call it the test disk now), which had (and still has) a small UFS slice for /, /var, and /tmp and a big UFS slice for /usr. To boot from the test disk, I use the BIOS boot menu, since (unsurprisingly) I have set the default boot disk to my regular disk. I removed all mentions of ZFS and VBOX from /boot/loader.conf and /etc/rc.conf on the test disk. Then I booted up a whole bunch of times. On the thirteenth try, I got the crash. Unfortunately, I don't have a crash summary from it because the system rebooted from my regular disk instead of the test disk while I was still staring at the crash message on the screen. Subsequently, I booted 20 more times from the test disk without getting the crash again. What I saw (for a few seconds) on the screen from the one crash sure looked like the same old backtrace, and I have to say, to an ignorant yokel like myself, it seemed to be saying that there's a locking problem in amdgpu. There was absolutely no virtual terminal switching, because I had not started an X server and I did not type ALT+Fn. I'll try getting a proper crash dump later (possibly tomorrow). My thanks to all of you for your patience.
(In reply to George Mitchell from comment #85) Where does dumpdev point for "test disk"? Someplace also on the "test disk" that a "regular disk" boot would not change? If yes, the first boot of the "test disk" after the crash should have picked up the dump information, even if the "regular disk" was booted between times. But if the dumpdev place is common to both types of boot, then the regular disk boot would have processed the dump. likely using a different /var/crash/ place to store things. Another question would be if there is sufficient room for /var/crash/ to contain the saved vmcore.* and related files. Yet another question is if the test disk has /usr/local/bin/gdb installed vs. not. ( When present, /usr/local/bin/gdb is used to provides one of the forms of backtrace, the one with source file references and line numbers and such. Much nicer to deal with.) If a vmcore.* was saved but some related information was not for some reason, it should be possible to have the related information produced based on the vmcore.* file. Side note: In case it is relevant, I'll note that defining dumpdev in /boot/loader.conf in a form the kernel chan handle, instead of in /etc/rc.conf , can be used to allow the system to produce dumps for earlier crashes. (But I'm guessing the crash was not that earliy to need such.)
(In reply to George Mitchell from comment #85) For booting the test disk, getting the kldstat output from a successful boot might prove useful reference material at some point: it should show what to expect to be loaded by the kernel and in what order. Since you got a crash before starting the X server and had not used ALT+Fn, that would be appropriate context for the kldstat relative to the known UFS-only crash. Other time frames for kldstat may be relevant at some point.
I booted a ThreadRipper 1950X system via its UFS-only boot media alternative. The system is not set up for X. For example, no use/installation of amdgpu.ko for use with its video card. For reference: # kldstat Id Refs Address Size Name 1 58 0xffffffff80200000 295a5a0 kernel 2 1 0xffffffff83210000 3370 acpi_wmi.ko 3 1 0xffffffff83214000 3210 intpm.ko 4 1 0xffffffff83218000 2178 smbus.ko 5 1 0xffffffff8321b000 2220 cpuctl.ko 6 1 0xffffffff8321e000 3360 uhid.ko 7 1 0xffffffff83222000 4364 ums.ko 8 1 0xffffffff83227000 33a0 usbhid.ko 9 1 0xffffffff8322b000 32a8 hidbus.ko 10 1 0xffffffff8322f000 4d00 ng_ubt.ko 11 6 0xffffffff83234000 ab28 netgraph.ko 12 2 0xffffffff8323f000 a238 ng_hci.ko 13 4 0xffffffff8324a000 2668 ng_bluetooth.ko 14 1 0xffffffff8324d000 8380 uftdi.ko 15 1 0xffffffff83256000 4e48 ucom.ko 16 1 0xffffffff8325b000 3340 wmt.ko 17 1 0xffffffff8325f000 e250 ng_l2cap.ko 18 1 0xffffffff8326e000 1bf08 ng_btsocket.ko 19 1 0xffffffff8328a000 38b8 ng_socket.ko 20 1 0xffffffff8328e000 2a50 mac_ntpd.ko # uname -apKU FreeBSD amd64_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #61 main-n261026-d04c86717c8c-dirty: Sun Feb 19 15:03:52 PST 2023 root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400081 1400081
After getting another instance of my crash on my test disk and then booting from the correct disk, I got a crash summary that said: Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/lib/debug/boot/kernel/kernel.debug] It occurred to me that when I updated my test disk from FBSD 12 to 13 I had forgotten to run mergemaster. So I did so today. But I haven't been able to reproduce the crash in 25 tries since then. I'm convinced that running mergemaster did not fix the crash, which is after all highly random. So I will try some more tomorrow. I appreciate everybody's patience.
(In reply to George Mitchell from comment #89) What vintage/version of *gdb was in use? (If it was gdb that complained.) Was it /usr/local/bin/*gdb ? /usr/libexec/*gdb ? Actually, for the backtrace activity, it is kgdb that is used, not gdb. Thus my use of "*gdb" notation. But a core.txt.* file in my context shows: GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD] which would be for /usr/local/bin/*gdb ( not /usr/libexec/*gdb ). This is because I have: # pkg info gdb gdb-12.1_3 Name : gdb Version : 12.1_3 . . . installed. (I had to a livecore.* to have something to reference/illustrate with, having no example vmcore.* files around for a long time.) A significantly older gdb might indicate use of an old /usr/libexec/*gdb that had not been cleaned out. I'll note that I got no DWARF complaints from kgdb and: # llvm-dwarfdump -r 1 /usr/lib/debug/boot/kernel/kernel.debug | grep DWARF | head -1 0x00000000: Compile Unit: length = 0x000001d3, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x000001d7) indicates version = 0x0004 . This leads me to expect that you have an old gdb (kgdb) around that is in use. It sounds like you got a savecore into /var/crash/ . It should be possible to try investigating that without having to cause another crash, presuming the system is not updated (so that it matches the crash contents). For example, the same sort of command that crashinfo uses on the saved system-core file could be manually tried, possibly with a more modern kgdb vintage being used that would handle the more recent dwarf version. Attaching your core.txt.* file content might prove useful.
Created attachment 240591 [details] A new but related crash (I think) This one was at shutdown time rather than boot-up time, so potentially virtual terminal switching was involved. But once again there are references to "WARNING !drm_modeset_is_locked(&plane->mutex) failed" along with a mention of ZFS. I don't know what it means.
(In reply to George Mitchell from comment #91) So, apparently, this was not one of the UFS-only experiments. The gdb backtrace is messy: . . . #7 <signal handler called> . . . #27 <signal handler called> #28 0x00000000002881da in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd688 This indicates that we are not seeing evidence from the earlier problem that got #27. That, in turn, may or may not have been the original problem. The context looks to be a very different context than prior reports. But not seeing what lead to #27 makes forming solid judgments problematical. I see from this that a modern gdb (kgdb) was in use for this failure for the crashinfo generation after the savecore operation, having no problems with DWARF 4 vs. 2. But it would seem to be the boot media normally used with ZFS instead of the boot media intended for UFS-only testing. The two might be different for what is around for gdb (kgdb) for crashinfo to use.
(In reply to Mark Millard from comment #92) Looking at it some more and comparing to #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff8214d251 at spl_nvlist_free+0x61 #7 0xffffffff8220d740 at fm_nvlist_destroy+0x20 #8 0xffffffff822e6e95 at zfs_zevent_post_cb+0x15 #9 0xffffffff8220cd02 at zfs_zevent_drain+0x62 #10 0xffffffff8220cbf8 at zfs_zevent_drain_all+0x58 #11 0xffffffff8220ede9 at fm_fini+0x19 #12 0xffffffff82243b94 at spa_fini+0x54 #13 0xffffffff822ee303 at zfs_kmod_fini+0x33 #14 0xffffffff8215fb3b at zfs_shutdown+0x2b #15 0xffffffff80c1b76c at kern_reboot+0x3dc #16 0xffffffff80c1b381 at sys_reboot+0x411 #17 0xffffffff810ae6ec at amd64_syscall+0x10c both #27 and #28 in: #26 amd64_syscall (td=0xfffffe000f43ca00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #27 <signal handler called> #28 0x00000000002881da in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd688 are possibly just the normal difficulty with finding where to stop listing.
(In reply to Mark Millard from comment #93) #7 <signal handler called> #8 vtozoneslab (va=18446735277616529408, zone=<optimized out>, slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635 looks to be the "*slab" line in: static __inline void vtozoneslab(vm_offset_t va, uma_zone_t *zone, uma_slab_t *slab) { vm_page_t p; p = PHYS_TO_VM_PAGE(pmap_kextract(va)); *slab = p->plinks.uma.slab; *zone = p->plinks.uma.zone; } For reference: 18446735277616529408 == 0xFFFFF80000000000
Created attachment 240622 [details] Another crash summary; looks like all the earlier ones Quick summary: I can't cause this crash on my test setup (amdgpu but no ZFS) over close to 50 tries. In more detail: I deleted all ports from my test setup and then added drm-510-kmod and gpu-firmware-amd-kmod, and (most importantly) gdb. I then made many fruitless attempts to reproduce the crash. Experimentally, I added "zfs" to my mod_list in /etc/rc.conf and got another instance of the crash after 11 attempts (see attachment). This crash looks like all the ones from my regular setup, but at least it appears to be in the right format to get a backtrace, etc. I then took "zfs" out of my mod_list and tried another 20 times to get the crash to recur. It did not recur.
(In reply to Mark Millard from comment #94) The "signal handler called" line hides a function call. I think the crash is due to a null pointer dereference ("fault virtual address = 0x0") in pmap_kextract called from the line above. Tracking down the PC address 0xffffffff80bf3727 in the kernel image should clarify.
(In reply to George Mitchell from comment #95) But, as I understand, comments #85 and #89 reported crashes of the test setup (no ZFS), as I understand. (I ignore #91 that was at shutdown and looks different.) If true, we do have some existence-proof type evidence for without ZFS involved. It just may be less common. (Unfortunately some detail was not available for validating a context match.) You may not want to spend all your time with the no-ZFS style tests, but spending some time on occasion could eventually prove useful. Any big, complicated thing (like ZFS) that can be eliminated may help isolate the problem.
(In reply to John F. Carr from comment #96) As I understand it, "fault virtual address = 0x0" is for #7 and not for number #27. As far as I can tell what lead to #27 and its specific type is not available to us.
(In reply to George Mitchell from comment #95) FYI: "Another crash summary; looks like all the earlier ones" is a crash when it is getting ready to load ZFS, not after ZFS has been loaded. So ZFS had not been started yet. So it is evidence for a problem without having ZFS in operation at all.
(In reply to Mark Millard from comment #98) Frame 27 is the entry into the kernel via the system call trap. We know this because it calls amd64_syscall. Frame 28 is a user program. We know this because the addresses are at the user address space and not the kernel address space (program counter at 0x2881da, stack frame at 0x7fffffffd688).
(In reply to George Mitchell from comment #95) FYI: "Another crash summary; looks like all the earlier ones" is a crash when it is getting ready to load ZFS, not after ZFS has been loaded. So ZFS had not been started yet. So it is evidence for a problem without having had ZFS in operation at all.
(In reply to Mark Millard from comment #97) You are correct that I did get two dumps without ZFS, but they did not appear to have decipherable dumps. I'll keep trying for another dump without ZFS now that I know we will obtain a usable dump on the test setup. (In reply to Mark Millard from comment #101) That's why we stopped seeing the reference to ZFS when I took "zfs" out of mod_list and put "zfs_load="YES"" in /boot/loader.conf in response to comment #41.
(In reply to John F. Carr from comment #100) Ahh, so kgdb ends up with fast_syscall_common+0xf8 or the like translated to a <signal handler called> . For this part, beleive and look at the kernel's backtrace for the area that says fast_syscall_common+0xf8 (or whatever). Good to know. Thanks.
If the problem is memory corruption running a debug kernel might find the corruption closer to when it happens. Are you able to build and run your own kernel with a configuration file like include GENERIC ident DEBUG options INVARIANTS options INVARIANT_SUPPORT ?
(In reply to George Mitchell from comment #102) So are all the load-time crashes with things loaded via use of: kld_list (str) A whitespace-separated list of kernel modules to load right after the local disks are mounted, without any .ko ex- tension or path. Loading modules at this point in the boot process is much faster than doing it via /boot/loader.conf for those modules not necessary for mounting local disks. and never with things that are loaded via /boot/loader.conf activity? It is a possible distinction in the test results that I'd managed to miss. (I'll note that the "for those modules not necessary for mounting local disks" may make zfs being listed kld_list unusual. That, in turn, might help explain why, so far, you are the only one known to be having the load-time crash problem examples.)
(In reply to John F. Carr from comment #104) I will try this today. By the way, perhaps I should have mentioned already that I use SCHED_4BSD (I'm the guy who periodically rants that it should be the default, or at least that the scheduler should be a kernel loadable module), though it's hard to see how that could be a factor. (In reply to Mark Millard from comment #105) Yes, I had an occurrence of brain fade when I put zfs into mod_list. I promise never to have brain fade ever again.
Created attachment 240642 [details] Crash without any use of ZFS, with acpi_wmi Here's a crash from my test setup with no use of ZFS at all. It looks like the earlier crash with acpi_wmi, without which I suspect this hardware won't run. Also, this kernel had INVARIANTS and INVARIANTS_SUPPORT compiled in (confirmed by the config shown in the summary), though I couldn't tell from anything I saw on the screen. Next I'll attach the relevant part of /var/log/messages, though I didn't see anything there either.
Created attachment 240643 [details] Relevant part of /var/log/messages Here's the log from the time of the crash, up to now.
(In reply to George Mitchell from comment #107) I'll note that in the example kldstat that I reported earlier the order started with: # kldstat Id Refs Address Size Name 1 58 0xffffffff80200000 295a5a0 kernel 2 1 0xffffffff83210000 3370 acpi_wmi.ko . . . So acpi_wm.ko appears to be the first module loaded in my context. I'd guess that is true for your context as well. This would mean that prior module loads are not required for the problem to happen (loading the first of the modules). That shold narrow the range of possibilities (for someone sufficiently knowledgeable in the subject area).
Created attachment 240683 [details] New instance This is from running my regular setup, not the debug setup. Almost immediately after I got this dump, my system crashed two more times in a row; see next attachment, which appears to contain a summary of both crashes (the 2nd and the 3rd). None of the stack dumps seem to have a call to modlist_lookup2, so possibly all three of these are some new amdgpu crash.
Created attachment 240684 [details] Crashes 2 and 3 The second crash was very late in the boot process, unlike most of the others. Running meld on these files might prove enlightening.
(In reply to George Mitchell from comment #110) The backtraces mentioning "zap_evict_sync" are not new. You submitted prior examples as attachments, such as "New core.txt". The backtrace(s) with "spa_all_configs" may well be new. I do not remember such.
Would it help if I attached my system log from the period of time yesterday when I got three crashes in a row?
Created attachment 240729 [details] Another instance of attachment #240591 [details] crash at shutdown time For the sake of completeness I'm attaching one more instance of the crash I see every few days at shutdown time instead of boot-up time. My plan for now is to restore my configuration to the one that most frequently provokes the crash: namely, I load ZFS with zfs_enable in /etc/rc.conf instead of zfs_load in /boot/loader.conf, and I'm adding vbox_enable="YES' back into /etc/rc.conf. Also, I'm updating from drm-510-kmod-5.10.113_8 to drm-510-kmod-5.10.163_2 since it's available, and I'll see if that crashes still. If so, then I will stop using amdgpu for a week and verify, for the purpose of maintaining my own sanity, that the crashes stop. And I'll report back here.
(In reply to George Mitchell from comment #114) All of the crashes that listed "acpi_wmi" were before amdgpu could have been involved: acpi_wmi loads first amdgpu would be later.
Created attachment 240731 [details] After upgrading to v5.10.163_2 I re-enabled the crashes (i.e. stopped loading ZFS early and turned vbox_enable back on) and got a crash on my very first reboot. Now I have disabled amdgpu and I'll be astonished if I get a crash before the twelfth of never. This crash does look slightly different, though, and seems to have had a trap 22 in ZFS code.
(In reply to Mark Millard from comment #115) Be that as it may, over the period of time from when I first upgraded to FBSD 13.1 until I started seriously trying to use drm-510-kmod, I never saw any occurrences at all of the ZFS crash, the vboxnetflt crash, or the acpi_wmi crash. And I don't expect to see any of them as long as I don't load amdgpu.ko.
(In reply to George Mitchell from comment #117) Yea, my expectation that acpi_wmi would always be loaded first was just wrong. Sorry. With the ZFS boot media, I see: Id Refs Address Size Name 1 94 0xffffffff80200000 295a9b0 kernel 2 1 0xffffffff82b5b000 5b80d8 zfs.ko 3 1 0xffffffff83115000 76f8 cryptodev.ko 4 1 0xffffffff83a10000 3370 acpi_wmi.ko . . . I looked at all your attachments again. It appears amdgpu was already present before the first crash point in all of them.
For: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 objdump -d --prefix-addresses /boot/kernel/kernel | less shows: ffffffff80d1786b <qsort+0x12ab> mov %esi,0x4(%r11,%rdx,4) ffffffff80d17870 <qsort+0x12b0> mov 0x8(%rcx,%rdx,4),%esi As for other "instruction pointer" examples . . . Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff80d17890 ffffffff80d1788f <qsort+0x12cf> mov %esi,0xc(%r11,%rdx,4) ffffffff80d17894 <qsort+0x12d4> add $0x4,%rdx Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3701 <free+0x11> je ffffffff80bf378d <free+0x9d> ffffffff80bf3707 <free+0x17> mov %rsi,%r14 Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff82231ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3701 <free+0x11> je ffffffff80bf378d <free+0x9d> ffffffff80bf3707 <free+0x17> mov %rsi,%r14 Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 ffffffff80bf3722 <free+0x32> call ffffffff80f66670 <PHYS_TO_VM_PAGE> ffffffff80bf3727 <free+0x37> mov (%rax),%r13 Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80d0cea0 ffffffff80d0ce9c <vn_ioctl+0x1fc> jne ffffffff80d0cff2 <vn_ioctl+0x352> ffffffff80d0cea2 <vn_ioctl+0x202> movzwl 0x2(%r13),%ecx
(In reply to Mark Millard from comment #119) [Sorry for the accidental duplication of the block that had "instruction pointer = 0x20:0xffffffff80bf3707".] The qsort, free, and vn_ioctl addresses do not look to match up with any of the multi-level backtraces. So we have very little evidence about what the context was. I've no clue for the addresses that were outside the kernel.
(In reply to Mark Millard from comment #120) Ugg. I just realized that I'd not looked at an official releng/13.1 build. So using a download of an official kernel.txz this time . . . (the subroutines stay the same but the detailed code is different). Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 ffffffff80d1786d <qsort+0x130d> mov -0x38(%rbp),%rdi ffffffff80d17871 <qsort+0x1311> mov %dl,(%rdi,%rsi,1) As for other "instruction pointer" examples . . . Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff80d17890 ffffffff80d1788f <qsort+0x132f> cmp $0x3,%r8 ffffffff80d17893 <qsort+0x1333> jae ffffffff80d17910 <qsort+0x13b0> Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 The above is outside the kernel's code. Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff82231ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3700 <free+0x70> mov %gs:0xb0,%rax ffffffff80bf3709 <free+0x79> add %r15,0x8(%rcx,%rax,1) Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 ffffffff80bf3724 <free+0x94> cmpb $0x0,0x128(%rbx) ffffffff80bf372b <free+0x9b> jne ffffffff80bf3777 <free+0xe7> Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80d0cea0 ffffffff80d0ce9a <vn_ioctl+0x25a> mov %r14,-0xc8(%rbp) ffffffff80d0cea1 <vn_ioctl+0x261> cmpb $0x0,0xaf417e(%rip) # ffffffff81801026 <sdt_probes_enabled>
(In reply to George Mitchell from comment #117) Would it be reasonable to have some testing with amdgpu.ko loaded but never having a desktop environment active? Or, may be I should form the idea as questions: What is the minimal form of having amdgpu.ko loaded in the system? Can that be tested (if it has not been already)? Does this minimal form behave any differently than more involved use of amdgpu.ko (and the associated card firmware)? In a different direction . . . In/for a separate context, I once built amdgpu and its firmware and installed it. But I did not set up an automatic load. For the rare test, I manually loaded amdgpu and then started lumina. (It is an old memory. I might not have the details correct.) This procedure might have largely avoided later loads of kernel modules and, so, avoided discovering a problem.
All my so-called test setup tests were run without starting a desktop environment (by which I assume you mean not starting X). There were still crashes such as in comment #107, attachment #240642 [details]. With my normal setup, kldloading amdgpu manually instead of automatically noticeably reduced the incidence of crashes but did not eliminate them.
(In reply to George Mitchell from comment #123) "kldloading amdgpu manually": there are two possibilities: A) Using boot -s and doing kldload and then exiting to normal mode. There are examples in your attachments of doing this. B) Getting to normal mode, logging in, and only after that doing the first kldload of amdgpu. I do not remember any of the attachments clearly indicating such a sequence. It puts the amdgpu load after other other normal loads.
Well, I was going to try testing in an environment were I've got a serial console: an aarch64 main [so: 14] context. But it turns out that there is at least one missing function declaration foor the type of context at this point: /wrkdirs/usr/ports/graphics/drm-515-kmod/work/drm-kmod-drm_v5.15.25/drivers/gpu/drm/drm_cache.c:362:10: error: call to undeclared function 'in_interrupt'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] WARN_ON(in_interrupt()); ^ 1 error generated. *** [drm_cache.o] Error code 1 as is visible in the official build log: http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-515-kmod-5.15.25.log Turns out the drm-510-kmod variant allowed for releng/13.1 and later is missing possible macro definitions for aarch64: /wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:741:3: error: call to undeclared function 'DC_FP_START'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] DC_FP_START(); ^ /wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:743:3: error: call to undeclared function 'DC_FP_END'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] DC_FP_END(); ^ 2 errors generated. *** [dc.o] Error code 1 as is visible in: http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-510-kmod-5.10.163_2.log (It is not just my builds that have such issues: official builds have the problems as well.) I was hoping I'd be able to do some testing in the alternative type of context (likely never starting X11). That looks to not be in the cards at this time.
(In reply to Mark Millard from comment #125) Picking the drm-515-kmod one: it looks like the source file referenced needs to include the content of the file providing the #define : /usr/main-src/sys/compat/linuxkpi/common/include/linux/preempt.h:#define in_interrupt() \ There are overall, some other uses: drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c: if (r < 1 && in_interrupt()) drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt())) drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt())) drm-kmod//drivers/gpu/drm/drm_cache.c: if (WARN_ON(in_interrupt())) { drm-kmod//drivers/gpu/drm/drm_cache.c: WARN_ON(in_interrupt()); I have not checked if any others of those do get preempt.h already. amd64 might be working via header pollution in some way that aarch64 does not?
(In reply to Mark Millard from comment #126) Further inspection of what comes next after making drm_cache.c pick up the in_interrupt definition suggests that trying builds of aarch64 is premature at this point, making the type of test I was intending also premature.
(In reply to George Mitchell from comment #114) > [...] I will stop using amdgpu for a week and verify, for the purpose > of maintaining my own sanity, that the crashes stop. [...] Back in amd64 land, since the time of that comment, I have rebooted my system 25 times and there have been no crashes at all. I guess I'm sane.
(In reply to George Mitchell from comment #12) Could you also share your "kldstat" output for when amdgpu has been loaded? More than just amdgpu might be added to what is loaded before amdgpu compared to when amdgpu is not loaded at all. For example some of: # find /boot/ker*/ -name 'linux*' -print | more /boot/kernel/linux64.ko /boot/kernel/linux_common.ko /boot/kernel/linuxkpi.ko /boot/kernel/linuxkpi_wlan.ko might be involved, not just amdgpu. Loading only some prerequisites for amdgpu, but not amdgpu itself, might prove a useful isolation test.
(In reply to Mark Millard from comment #129) I wrote "what is loaded before" relative to amdgpu. But what amdgpu in turn leads to loading that is listed after amdgpu in "kld stat" output likely is just as relevant. For all I know all of it may be from after amdgpu's position in the "kld stat" list.
(In reply to Mark Millard from comment #130) Based on drm-515-kmod related materials on/for amd64 runing main [so: 14] and the type of card that happened to be present, I saw: 22 1 0xffffffff83c00000 4fd918 amdgpu.ko 23 2 0xffffffff83a8e000 79f50 drm.ko 24 1 0xffffffff83b08000 22a8 iic.ko 25 3 0xffffffff83b0b000 30d8 linuxkpi_gplv2.ko 26 4 0xffffffff83b0f000 6320 dmabuf.ko 27 3 0xffffffff83b16000 3360 lindebugfs.ko 28 1 0xffffffff83b1a000 b350 ttm.ko 29 1 0xffffffff83b26000 a118 amdgpu_polaris11_k_mc_bin.ko 30 1 0xffffffff83b31000 6370 amdgpu_polaris11_pfp_2_bin.ko 31 1 0xffffffff83b38000 6370 amdgpu_polaris11_me_2_bin.ko 32 1 0xffffffff83b3f000 4370 amdgpu_polaris11_ce_2_bin.ko 33 1 0xffffffff83b44000 7978 amdgpu_polaris11_rlc_bin.ko 34 1 0xffffffff83b4c000 42380 amdgpu_polaris11_mec_2_bin.ko 35 1 0xffffffff83b8f000 42380 amdgpu_polaris11_mec2_2_bin.ko 36 1 0xffffffff83bd2000 5270 amdgpu_polaris11_sdma_bin.ko 37 1 0xffffffff83bd8000 5270 amdgpu_polaris11_sdma1_bin.ko 38 1 0xffffffff840fe000 5db58 amdgpu_polaris11_uvd_bin.ko 39 1 0xffffffff8415c000 2ac78 amdgpu_polaris11_vce_bin.ko 40 1 0xffffffff83bde000 21d90 amdgpu_polaris11_k_smc_bin.ko This was from deliberately using kldload amdgpu after all the normal boot/login load activity. No kld_list= use involved at all. I wonder how much your environment would crash for amdgpu loaded this late. FYI: The prior load activity was: Id Refs Address Size Name 1 132 0xffffffff80200000 295b050 kernel 2 1 0xffffffff82b5d000 76f8 cryptodev.ko 3 1 0xffffffff82b65000 5b80d8 zfs.ko 4 1 0xffffffff83a10000 3370 acpi_wmi.ko 5 1 0xffffffff83a14000 3210 intpm.ko 6 1 0xffffffff83a18000 2178 smbus.ko 7 1 0xffffffff83a1b000 2220 cpuctl.ko 8 1 0xffffffff83a1e000 3360 uhid.ko 9 1 0xffffffff83a22000 4364 ums.ko 10 1 0xffffffff83a27000 33a0 usbhid.ko 11 1 0xffffffff83a2b000 32a8 hidbus.ko 12 1 0xffffffff83a2f000 4d00 ng_ubt.ko 13 6 0xffffffff83a34000 ab28 netgraph.ko 14 2 0xffffffff83a3f000 a238 ng_hci.ko 15 4 0xffffffff83a4a000 2668 ng_bluetooth.ko 16 1 0xffffffff83a4d000 8380 uftdi.ko 17 1 0xffffffff83a56000 4e48 ucom.ko 18 1 0xffffffff83a5b000 3340 wmt.ko 19 1 0xffffffff83a5f000 e250 ng_l2cap.ko 20 1 0xffffffff83a6e000 1bf08 ng_btsocket.ko 21 1 0xffffffff83a8a000 38b8 ng_socket.ko
(In reply to Mark Millard from comment #129) When I boot up to single-user mode, kldstat says: Id Refs Address Size Name 1 7 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko After "kldload amdgpu," it says: Id Refs Address Size Name 1 59 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko 5 1 0xffffffff82600000 417220 amdgpu.ko 6 2 0xffffffff82518000 739e0 drm.ko 7 3 0xffffffff8258c000 5220 linuxkpi_gplv2.ko 8 4 0xffffffff82592000 62d8 dmabuf.ko 9 1 0xffffffff82599000 c758 ttm.ko 10 1 0xffffffff825a6000 2218 amdgpu_raven_gpu_info_bin.ko 11 1 0xffffffff825a9000 64d8 amdgpu_raven_sdma_bin.ko 12 1 0xffffffff825b0000 2e2d8 amdgpu_raven_asd_bin.ko 13 1 0xffffffff825df000 93d8 amdgpu_raven_ta_bin.ko 14 1 0xffffffff825e9000 7558 amdgpu_raven_pfp_bin.ko 15 1 0xffffffff825f1000 6558 amdgpu_raven_me_bin.ko 16 1 0xffffffff825f8000 4558 amdgpu_raven_ce_bin.ko 17 1 0xffffffff82a18000 b9c0 amdgpu_raven_rlc_bin.ko 18 1 0xffffffff82a24000 437e8 amdgpu_raven_mec_bin.ko 19 1 0xffffffff82a68000 437e8 amdgpu_raven_mec2_bin.ko 20 1 0xffffffff82aac000 5a638 amdgpu_raven_vcn_bin.ko But after a full boot without amdgpu, it says: Id Refs Address Size Name 1 66 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 ff4b8 if_re.ko 3 3 0xffffffff82230000 8cc90 vboxdrv.ko 4 1 0xffffffff822bd000 77e0 sem.ko 5 1 0xffffffff82600000 3df128 zfs.ko 6 2 0xffffffff82518000 4240 vboxnetflt.ko 7 2 0xffffffff8251d000 aac8 netgraph.ko 8 1 0xffffffff82528000 31c8 ng_ether.ko 9 1 0xffffffff8252c000 55e0 vboxnetadp.ko 10 1 0xffffffff82532000 3378 acpi_wmi.ko 11 1 0xffffffff82536000 3218 intpm.ko 12 1 0xffffffff8253a000 2180 smbus.ko 13 1 0xffffffff8253d000 33c0 uslcom.ko 14 1 0xffffffff82541000 4d90 ucom.ko 15 1 0xffffffff82546000 2340 uhid.ko 16 1 0xffffffff82549000 3380 usbhid.ko 17 1 0xffffffff8254d000 31f8 hidbus.ko 18 1 0xffffffff82551000 3320 wmt.ko 19 1 0xffffffff82555000 4350 ums.ko 20 1 0xffffffff8255a000 5af8 autofs.ko 21 1 0xffffffff82560000 2a08 mac_ntpd.ko 22 1 0xffffffff82563000 20f0 green_saver.ko
(In reply to George Mitchell from comment #132) I wonder if, in your context, the following boot sequencing might sidestep the boot-crash issue: "A full boot without amdgpu" then: "kldload amdgpu" then: normal use. Basically: doing the amdgpu load as late as possible relative to everything else loaded, limiting what all loads after amdgpu.
Okay, my machine is set up as you requested. It boots to multiuser mode without starting an X session, at which point I load amdgpu and then start my normal XFCE session. I'll run it this way for a week. Undoubtedly, it won't exhibit the bootup crash in this mode of operation, but I won't be surprised if I still get a shutdown crash or two. And in any case this isn't a fix for the underlying bug. Not sure what new information this is likely to yield.
(In reply to George Mitchell from comment #134) Having the kldstat output for this combination would help identify what module is initially involved in any crash. Part of what may be of use is how often you see the dbuf_evict_thread type of backtrace and what module the first "instruction pointer =" references in such cases (if any). Another would be if new crash contexts show up that have not been seen before. So far there is no evidence for how many bugs there are, given the varying failure-structures that show up. There could even be the possibility of unreliable memory or bugs specific to amdgpu_raven_*.ko files (such as sometimes trashing some memory). I've yet to induce any failure in the amdgpu_polaris11_*.ko based amd64 context that I have access to (a ThreadRipper 1950X), although by no means is it a close match to your context. To my knowledge, you still have the only known examples of any of the failures. To some extent, if trying new things leads to new forms of failure for you, it potentially gives me new sequences to try on the ThreadRipper 1950X. How (un)likely that is to yield useful information I do not know. (My hope to also try on aarch64, where I've access to a serial console, did not pan out.)
Sorry, meant to put these in yesterday. After booting to single-user mode, kldstat reports: Id Refs Address Size Name 1 7 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko If I boot to single-user mode and kldload amdgpu, kldstat reports: Id Refs Address Size Name 1 59 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko 5 1 0xffffffff82600000 417220 amdgpu.ko 6 2 0xffffffff82518000 739e0 drm.ko 7 3 0xffffffff8258c000 5220 linuxkpi_gplv2.ko 8 4 0xffffffff82592000 62d8 dmabuf.ko 9 1 0xffffffff82599000 c758 ttm.ko 10 1 0xffffffff825a6000 2218 amdgpu_raven_gpu_info_bin.ko 11 1 0xffffffff825a9000 64d8 amdgpu_raven_sdma_bin.ko 12 1 0xffffffff825b0000 2e2d8 amdgpu_raven_asd_bin.ko 13 1 0xffffffff825df000 93d8 amdgpu_raven_ta_bin.ko 14 1 0xffffffff825e9000 7558 amdgpu_raven_pfp_bin.ko 15 1 0xffffffff825f1000 6558 amdgpu_raven_me_bin.ko 16 1 0xffffffff825f8000 4558 amdgpu_raven_ce_bin.ko 17 1 0xffffffff82a18000 b9c0 amdgpu_raven_rlc_bin.ko 18 1 0xffffffff82a24000 437e8 amdgpu_raven_mec_bin.ko 19 1 0xffffffff82a68000 437e8 amdgpu_raven_mec2_bin.ko 20 1 0xffffffff82aac000 5a638 amdgpu_raven_vcn_bin.ko If I boot to multi-user mode without kldloading amdgpu, kldstat reports; Id Refs Address Size Name 1 66 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 ff4b8 if_re.ko 3 1 0xffffffff82231000 77e0 sem.ko 4 3 0xffffffff82239000 8cc90 vboxdrv.ko 5 1 0xffffffff82600000 3df128 zfs.ko 6 2 0xffffffff82518000 4240 vboxnetflt.ko 7 2 0xffffffff8251d000 aac8 netgraph.ko 8 1 0xffffffff82528000 31c8 ng_ether.ko 9 1 0xffffffff8252c000 55e0 vboxnetadp.ko 10 1 0xffffffff82532000 3378 acpi_wmi.ko 11 1 0xffffffff82536000 3218 intpm.ko 12 1 0xffffffff8253a000 2180 smbus.ko 13 1 0xffffffff8253d000 33c0 uslcom.ko 14 1 0xffffffff82541000 4d90 ucom.ko 15 1 0xffffffff82546000 2340 uhid.ko 16 1 0xffffffff82549000 3380 usbhid.ko 17 1 0xffffffff8254d000 31f8 hidbus.ko 18 1 0xffffffff82551000 3320 wmt.ko 19 1 0xffffffff82555000 4350 ums.ko 20 1 0xffffffff8255a000 5af8 autofs.ko 21 1 0xffffffff82560000 2a08 mac_ntpd.ko 22 1 0xffffffff82563000 20f0 green_saver.ko If I then kldload amdgpu, it says the same as above, plus: 23 1 0xffffffff82a00000 417220 amdgpu.ko 24 2 0xffffffff82566000 739e0 drm.ko 25 3 0xffffffff825da000 5220 linuxkpi_gplv2.ko 26 4 0xffffffff825e0000 62d8 dmabuf.ko 27 1 0xffffffff825e7000 c758 ttm.ko 28 1 0xffffffff825f4000 2218 amdgpu_raven_gpu_info_bin.ko 29 1 0xffffffff825f7000 64d8 amdgpu_raven_sdma_bin.ko 30 1 0xffffffff82e18000 2e2d8 amdgpu_raven_asd_bin.ko 31 1 0xffffffff829e0000 93d8 amdgpu_raven_ta_bin.ko 32 1 0xffffffff829ea000 7558 amdgpu_raven_pfp_bin.ko 33 1 0xffffffff829f2000 6558 amdgpu_raven_me_bin.ko 34 1 0xffffffff829f9000 4558 amdgpu_raven_ce_bin.ko 35 1 0xffffffff82e47000 b9c0 amdgpu_raven_rlc_bin.ko 36 1 0xffffffff82e53000 437e8 amdgpu_raven_mec_bin.ko 37 1 0xffffffff82e97000 437e8 amdgpu_raven_mec2_bin.ko 38 1 0xffffffff82edb000 5a638 amdgpu_raven_vcn_bin.ko
Created attachment 241022 [details] Four boot-time crashes in a row For some reason, I just got four boot-up crashes immediately in a row. After I cycled power, I was able to boot up without crashing. I think I'm going to load zfs.ko from /boot/loader.conf to get it loaded earlier, which mitigates this pronlem. (It's currently loaded with zfs_enable="YES" in /etc/rc.conf.)
(In reply to George Mitchell from comment #137) Your upload ended up being: application/octet-stream this time, instead of text/plain .
Yes. It's a compressed tar file with four core.txt files for the price of one. They are different enough that I thought I'd better attach them all, though mainly the later ones include increasing portions of the earlier ones because they were on immediately successive boots.
(In reply to George Mitchell from comment #137) All 4 are examples related to dbuf_evict_thread (a.k.a. zfs dbuf related crashes), as I feared. All 4 look like: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 Looks to be in: 5 1 0xffffffff82600000 3df128 zfs.ko panic: page fault cpuid = 1 time = 1679349400 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff827ac768 at zap_evict_sync+0x68 #7 0xffffffff8267d74a at dbuf_destroy+0xba #8 0xffffffff82683129 at dbuf_evict_one+0xf9 #9 0xffffffff8267b43d at dbuf_evict_thread+0x31d #10 0xffffffff80bd8abe at fork_exit+0x7e #11 0xffffffff8108604e at fork_trampoline+0xe #6 0xffffffff810ade4f in trap_pfault (frame=0xfffffe00b3bb6d00, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 avl_destroy_nodes (tree=tree@entry=0xfffff8001a80b5a0, cookie=cookie@entry=0xfffffe00b3bb6dd0) at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023 #9 0xffffffff827ac768 in mze_destroy (zap=0xfffff8001a80b480) at /usr/src/sys/contrib/openzfs/module/zfs/zap_micro.c:402 A question would be if this repeats based on amdgpu having been loaded (again last) but no X11 like activity having ever been started: limiting amdgpu use to just the load activity or as close to that limited of use as is possible. (This is separate from your zfs load time adjustment test.) My guess is that the content of some memory area(s) is being trashed in your context. I'm not sure how to track down what is doing the trashing or were all the trashed area(s) are if that is what is going on. At least we now have a clue how to get the specific type of crash. Before I had no clue what an example initial-context might be like. Note: Changing the load order should get a matching kldstat report to indicate the address ranges that end up involved.
(In reply to George Mitchell from comment #139) The upload did not look compressed to me: I just had to use tools that would tolerate the binary content at the start and end. The rest looked like normal text without me doing anything to decompress the file. But, looking, the prefix text does look like a partially-binary header, likely added by a tool. The tail end might just be binary padding. At least I've a clue for next time.
So I should just boot up to multi-user mode and kldload amdgpu, but not start XFCE? And repeat until it crashes again?
(In reply to George Mitchell from comment #142) Seeing if that no-XFCE context crashes vs. not would be a good idea. If it crashes similarly, then XFCE activity is not likely to be involved. If it does not crash, then XFCE activity is likely involved. FYI: all 4 crashes had: fault virtual address = 0x7 (the same small offset from a NULL pointer in C terms). This does not look like random trashing of memory (for the few examples available).
Created attachment 241027 [details] Another shutdown-time crash I got another shutdown-time crash. The part of this file that is relevant to this crash starts around line 1400; all the earlier stuff appears to be from the crashes earlier today.
(In reply to George Mitchell from comment #144) Looking at your full list of attachments, it appears that . . . All the shutdown time crashes have: fault virtual address = 0x0 (And we might now have a known type of context for getting the type of failure: late amdgpu but no XFCE.) All the dbuf_evict_thread related crashes have: fault virtual address = 0x7 (Late admgpu but having used XFCE.) All the kldload related crashes have: Fatal trap 9: general protection fault while in kernel mode (but no explicit fault address listed) (Early amdgpu loading.) My guess is something is trashing memory in a way that involves writing zeros over some pointer values that it should not be touching. Later code extracts such zeros and applies any offset and then tries to dereference the result, resulting in a crash. That you got "fault virtual address = 0x0" for shutdown without having involved XFCE, suggests that a problem is already in place before XFCE is potentially involved: XFCE is not required. (XFCE use might lead to more trashed memory than otherwise, leading to the 0x7 fault address cases.) But I do not see how to get solid evidence for or against such the hypothesis (or related ones). The only thing I can identify that is likely unique to your context --but is involved with amdgpu-- is the involvement of the amdgpu_raven_gpu_*.ko modules. Unfortunately moving your context to a different system that avoids such module use or finding someone with a separate system that does have such (and is willing to set up experiments), is non-trivial for both directions of testing. Beyond possibly some checking on the degree/ease of repeatability, I do not see how to gather better information, much less get anywhere near directly actionable information for fixing the crashes. The one thing we have not looked at is the crash dumps themselves, examining what memory looks like and such. But I do not know what to do for that either, relative to known-useful information. Such a direction would be very exploratory and likely very time consuming.
(In reply to Mark Millard from comment #145) For the: fault virtual address = 0x7 examples, it looks like the value stored in RAM has the 0x7 in it instead of being a later offset addition. The loop in question in avl_destroy_nodes just uses "mov (%rdi),%rdi" with no offset involved: NOTE: Loop starts below 0x0000000000000ba0 <+64>: mov %rdi,%rax 0x0000000000000ba3 <+67>: mov %rdx,%rcx 0x0000000000000ba6 <+70>: mov (%rdi),%rdi 0x0000000000000ba9 <+73>: mov %rax,%rdx 0x0000000000000bac <+76>: test %rdi,%rdi 0x0000000000000baf <+79>: jne 0xba0 <avl_destroy_nodes+64> NOTE: The above is the loop end
Created attachment 241046 [details] Crash at shutdown time Another occurrence of the crash at shutdown time rather than boot time. I'm reluctant to post a vmcore file here, but I can make it available to anyone who thinks it will be useful.
(In reply to George Mitchell from comment #147) That crash is difference from all prior ones. It crashed in nfsd via a: Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80c895cb stack pointer = 0x28:0xfffffe00b555dba0 frame pointer = 0x28:0xfffffe00b555dbb0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1109 (nfsd) None of the prior kldstat outputs have shown nfsd as loaded. For reference: panic: general protection fault cpuid = 1 time = 1679441112 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff81084fd8 at calltrap+0x8 #5 0xffffffff80c8866b at seltdclear+0x2b #6 0xffffffff80c88355 at kern_select+0xbd5 #7 0xffffffff80c88456 at sys_select+0x56 #8 0xffffffff810ae6ec at amd64_syscall+0x10c #9 0xffffffff810858eb at fast_syscall_common+0xf8 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b7ec in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bc5e in vpanic (fmt=0xffffffff811b2f41 "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c1ba63 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810addf5 in trap_fatal (frame=0xfffffe00b555dae0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:944 #6 <signal handler called> #7 0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, dst=<optimized out>, expect=<optimized out>) at /usr/src/sys/amd64/include/atomic.h:225 #8 selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007) at /usr/src/sys/kern/sys_generic.c:1755 #9 0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00) at /usr/src/sys/kern/sys_generic.c:1967 #10 0xffffffff80c88355 in kern_select (td=<optimized out>, td@entry=0xfffffe00b52e9a00, nd=7, fd_in=<optimized out>, fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized out>, tvp@entry=0x0, abi_nfdbits=64) at /usr/src/sys/kern/sys_generic.c:1210 #11 0xffffffff80c88456 in sys_select (td=0xfffffe00b52e9a00, uap=0xfffffe00b52e9de8) at /usr/src/sys/kern/sys_generic.c:1014 #12 0xffffffff810ae6ec in syscallenter (td=0xfffffe00b52e9a00) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189 #13 amd64_syscall (td=0xfffffe00b52e9a00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #14 <signal handler called> #15 0x00000008011a373a in ?? () Note: 18446741877726026240 == 0xfffffe00b52e9a00
(In reply to Mark Millard from comment #148) > None of the prior kldstat outputs have shown nfsd as loaded. That's because they weren't verbose kldstats. nfsd is statically linked into the kernel. kldstat -v definitely shows that nfsd is present.
In order to reconfirm my sincere belief that the key factor in these crashes is amdgpu (and also because I need a respite from the crashes), I'm running without amdgpu (and running X in VESA mode) for a while. I fully expect that the crashes will stop as a result.
(In reply to George Mitchell from comment #150) Sounds appropriate. "amdgpu" is really the sort of bundle: 23 1 0xffffffff82a00000 417220 amdgpu.ko 24 2 0xffffffff82566000 739e0 drm.ko 25 3 0xffffffff825da000 5220 linuxkpi_gplv2.ko 26 4 0xffffffff825e0000 62d8 dmabuf.ko 27 1 0xffffffff825e7000 c758 ttm.ko 28 1 0xffffffff825f4000 2218 amdgpu_raven_gpu_info_bin.ko 29 1 0xffffffff825f7000 64d8 amdgpu_raven_sdma_bin.ko 30 1 0xffffffff82e18000 2e2d8 amdgpu_raven_asd_bin.ko 31 1 0xffffffff829e0000 93d8 amdgpu_raven_ta_bin.ko 32 1 0xffffffff829ea000 7558 amdgpu_raven_pfp_bin.ko 33 1 0xffffffff829f2000 6558 amdgpu_raven_me_bin.ko 34 1 0xffffffff829f9000 4558 amdgpu_raven_ce_bin.ko 35 1 0xffffffff82e47000 b9c0 amdgpu_raven_rlc_bin.ko 36 1 0xffffffff82e53000 437e8 amdgpu_raven_mec_bin.ko 37 1 0xffffffff82e97000 437e8 amdgpu_raven_mec2_bin.ko 38 1 0xffffffff82edb000 5a638 amdgpu_raven_vcn_bin.ko I'm still at a loss for getting any improved type of evidence. Spending time related to the dnetc related scheduler benchmarking today has been a nice break from pondering this.
As expected, I have had no crashes since avoiding drm-510-kmod and running in VESA mode. Might it be worth updating 5.10.163_2 to 5.10.163_3? Notes I haven't mentioned recently: Prior to FBSD 13, whenever I tried drm-510-kmod, my machine would lock up hard and not respond to anything other than cycling power. I have a AMD Ryzen 3 2200G with Radeon Vega Graphics running on a Gigabyte B450M D53H motherboard. Every time I boot up, I see the following ACPI warnings, which don't otherwise seem to affect operation: Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796) ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) Do any of you understand these?
(In reply to George Mitchell from comment #152) I'm not sure what all is involved in setting up the VESA usage test, but it sounds like it was a great test for isolating the problem to the material associated with amdgpu loading for your Radeon Vega Graphics context. Are there any negative consequences to the use of VESA? If the notes are simple/short could you supply instructions so that I could try the analogous thing in the Polaris 11 context that I have access to?
(In reply to George Mitchell from comment #152) Looked at my ACPI boot warning/error messages and I get just (with a little context shown from the grep for ACPI lines): acpi_wmi0: <ACPI-WMI mapping> on acpi0 ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361) acpi_wmi1: <ACPI-WMI mapping> on acpi0 ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361) acpi_wmi2: <ACPI-WMI mapping> on acpi0 But I do not get anything analogous to your reported: Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796) So that last has some chance of being involved in your context since I've been unable to reproduce your problems and the message is unioque to your context. (Only suggestive.) Any chance that there is an UEFI update available for your machine?
(In reply to Mark Millard from comment #153) Hmm. I see that: https://docs.freebsd.org/en/books/handbook/x11/#x-install reports: "VESA module must be used when booting in BIOS mode and SCFB module must be used when booting in UEFI mode." My context is UEFI so VESA looks to be inappropriate for my context. Your using BIOS (non-UEFI) vs. my using UEFI (not-BIOS) is another context difference relative to my not managing to reproduce the problems.
Ironically, I am presently forced back into using amdgpu.ko because the xorg-server update from 21.1.6,1 to 21.1.7,1 broke the VESA driver (bug #270509).
I forgot to mention earlier: Whenever I start chrome from a terminal window, I see the message: amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description. Probably not related to this bug, but I thought I'd better mention it.
(In reply to George Mitchell from comment #32) > … I'm doing this testing on a desktop machine, … (In reply to George Mitchell from comment #152) > … not respond to anything other than cycling power. … In that situation, does the system respond to a normal (not long) press on the power button? ---- On my everyday notebook here, I have this in sysctl.conf(5): hw.acpi.power_button_state="S5"
(In reply to Graham Perrin from comment #158) When I referred to cycling power, I meant by a long press of the power button, which worked just fine (except that I was going to have to run fsck on the next boot). Also, that was when I was running FBSD 12 and I'm not in a position to repeat that test any more. Thanks for the input.
I also use vbox + zfs + zmdgpu. On 13.2-STABLE I had kernel panic on vboxdrv / vboxnetadp load. So I switched to 13.1-RELEASE. Now after upgrading to 13.2 I have this problem again. Maybe related? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270809
(In reply to Tomasz "CeDeROM" CEDRO from comment #160) The package builds via 13.2-RELEASE have not even started yet. Systems using/needing kernel-specific ports should wait to upgrade to 13.2-RELEASE until the packages are known to be available if they are updating via binary packages. This is normal when a new release happens. FreeBSD does not hold the release until after the packages are available. 13.1-RELEASE is still supported for some time but cannot use 13.2-RELEASE based packages generally.
Thanks Mark :-) The problem is that build from ports kernel crashes on module load :-(
(In reply to Tomasz "CeDeROM" CEDRO from comment #162) Crashing from having a wrong module vintage for the kernel is normal/historical as I understand. So, unfortunately, not anything new. The package build servers will not start building based on 13.2-RELEASE until 13.1-RELEASE goes EOL as I understand. Prior to that building from source is what is supported when such kernel-dependent ports are involved. FreeBSD still has some build-from-source biases in its handling of things. Resource limitations may well still be forcing such, for all I know. So, either wait to use 13.2-RELEASE or build and install (some) ports via source based builds if you require ports with kernel-dependent modules.
(In reply to Tomasz "CeDeROM" CEDRO from comment #162) Sorry that I misinterpreted some of the context/wording. And nice to see that the 13.1-RELEASE build is rejected with a message, now that I look again.
Created attachment 241523 [details] Crash that happened neither at startup nor shutdown Perhaps not related to my original crash, but undoubtedly a crash that happened in amdgpu code. I was watching a movie using vlc. I decided I was finished watching and I typed control-q. The screen froze with a frame from the movie still showing, and after a few seconds the machine rebooted and saved a coredump, with the attached crash summary that really doesn't resemble any of the earlier ones saved here. Does anyone have any words of wisdom? To avoid the startup crash, I had booted to single user mode and had kldloaded vboxnetflt and amdgpu before continuing to multiuser mode.
I got tired of all those VirtualBox problems. I do not really care anymore about that program, if its problems are related to amdgpu or zfs. I have switched to bhyve that can be easily managed from a shell with vm utility [1]. I recommend doing the same. [1] https://github.com/churchers/vm-bhyve
This is an amdgpu problem. Although vboxnetflt is one of the kernel modules that can, in cooperation with amdgpu, exhibit the crash, zfs and acpi_vmi have also exhibited the same failure -- and the most recent crash summary contains no reference to vboxnetflt participating in the crash. (It does show that I manually typed "kldload vboxnetflt" in single-user mode about an hour and a half before the crash occurred.)
After upgrading to 5.10.163_5 today, I haven't yet had this crash -- but I've booted only a couple of times so far and it's too soon to jump to any conclusions.
Created attachment 241741 [details] Shutdown crash with version 5.10.163_5 5.10.163_5 still crashes. This time it was a shutdown time.
Created attachment 241750 [details] And another plain old boot time crash I had thought I could artificially provoke the crash by booting to single user mode, loading the amdgpu, zfs, vboxnetflt, and acpi_wmi kernel modules in quick succession, and then continuing to multiuser mode. But that didn't do it. So yesterday I went back to the old way of loading zfs with "zfs_enable="YES"" in rc.conf instead of "zfs_load="YES"" in /boot/loader.conf, and loading amdgpu by setting kld_list="amdgpu" in rc.conf. And now I get the crashes again.
(In reply to George Mitchell from comment #170) I'm unclear on the contrasting case: when you use /boot/loader.conf material instead of /etc/rc.conf material what happens these days? No crashes? Fairly rare crashes of the usual types? Fairly rare crashes of other types? A mix of fairly rare crashes of the 2 categories? (I may well not be thinking of everything that would be of note. So take the questions as just illustrative.)
One of the things that makes this hard to analyze is that the first failure quickly leads to other failures and most of the evidence for for the later failure. For example, in the following note that original trap number is 12 but the backtrace is for/after a later trap, of type-number 22 instead. There is very little information directly about the original trap type-number 12: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 stack pointer = 0x28:0xfffffe000e1a7ba0 frame pointer = 0x28:0xfffffe000e1a7bd0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1 (init) trap number = 12 WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:619 . . . WARNING !drm_modeset_is_locked(&plane->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:894 kernel trap 22 with interrupts disabled kernel trap 22 with interrupts disabled panic: page fault cpuid = 0 time = 1682435560 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff8261d251 at spl_nvlist_free+0x61 #7 0xffffffff826dd740 at fm_nvlist_destroy+0x20 #8 0xffffffff827b6e95 at zfs_zevent_post_cb+0x15 #9 0xffffffff826dcd02 at zfs_zevent_drain+0x62 #10 0xffffffff826dcbf8 at zfs_zevent_drain_all+0x58 #11 0xffffffff826dede9 at fm_fini+0x19 #12 0xffffffff82713b94 at spa_fini+0x54 #13 0xffffffff827be303 at zfs_kmod_fini+0x33 #14 0xffffffff8262fb3b at zfs_shutdown+0x2b #15 0xffffffff80c1b76c at kern_reboot+0x3dc #16 0xffffffff80c1b381 at sys_reboot+0x411 #17 0xffffffff810ae6ec at amd64_syscall+0x10c . . . The primary hint about what code execution context lead to the original instance of trap type 12 above is basically: instruction pointer = 0x20:0xffffffff80bf3727 amdgpu does not leave in place a clean context for debugging kernel crashes. Trying to keep the video context operational for a kernel that has crashed, while not messing up the analysis context for the original problem is problematical. My guess would be that normal analysis of such tries to have the problem occur in a virtual machine sort of context where another (outer) context is available that is independent and can look at the details from outside the failing context. But even that would require the failing context in the VM to stop before amdgpu or the like messed up the evidence in the VM. (Not that I've ever done that type of evidence gathering.)
Here are a collection of points in response to Mark Millard's request. 1. Regardless of the order in which I load kernel modules by hand in single-user mode, I can't ever duplicate the crash. 2. The crash never happens if amdgpu.ko is not loaded. 3. Emmanuel Vadot categorically states that the many, many references to drm_modeset_is_locked failures in the crash summaries are noise and don't indicate drm failures and are caused by virtual terminal switching. But I still get crashes even when there are no virtual terminal switches (because I didn't start X windows and I didn't type ALT-Fn). 4. The crash always happens after amdgpu.ko is loaded, and (in terms of time of occurrence) at about the time vboxnetflt.ko or acpi_wmi.ko is loaded. The seeming zfs crash can happen even when zfs.ko is loaded before amdgpu.ko, and I theorize that it happens when my large (1TB) USB ZFS-formatted drive comes on line and gets tasted (after amdgpu.ko is loaded). 5. But I can't come up with any theory in which I can blame the actual crash on vboxnetflt.ko, acpi_wmi.ko, or zfs.ko. This bug should not be assigned to freebsd-fs. But I can't tell you to whom it should be assigned.
Since my last note on April 27, I have been booting up in this manner: 1. Boot to single user mode. 2. Run a script that loads amdgpu.ko, zfs.ko, vboxnetflt.ko, and acpi-wmi.ko in immediate succession. 3. Exit to multiuser mode. In the course of roughly 50-60 bootups, there have been only two crashes during single user mode, but regrettably they leave no trace because the root partition is still mounted read-only. At least I think that's why there's no dump. So something about single-user mode makes the crash much less likely to occur. Anyway, jumping through these hoops does enable me to run my graphics with the improved driver.
(In reply to George Mitchell from comment #174) > … crashes during single user mode, but regrettably they leave no trace > … the root partition is still mounted read-only. … Hint (whilst in single-user mode): mount -uw / && zfs mount -a sysrc dumpdev – you'll probably find a different device, typically the swap partition. sysrc dumpdir – you'll probably find /var/crash. service dumpon describe – if you boot in single user mode after a kernel panic, then /var/crash will not yet include information about the panic. service savecore describe
Is this still relevant?
Yes, even after updating to 13.3-RELEASE-p4. I'm not brave enough yet to upgrade to 14. I work around the problem by booting in single-user mode, running this script: #!/bin/sh mount -u / mount -r /usr kldload amdgpu.ko kldload zfs.ko kldload vboxnetflt.ko kldload acpi_wmi.ko sleep 3 mount -u /usr which 99% of the time doesn't crash, and then exiting to multiuser. I haven't yet figured out how to get a crash dump with /, /tmp/, and /var/ mounted R/W (they're all on one physical partition) and /usr/ mounted RO. Probably irrelevant fact: every time I start chrome from the command line, I get the message: ammdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description. If they do, bad things may happen! But in fact there seem to be no ill effects.
(In reply to George Mitchell from comment #177) I'm from vbox@ and after partially reading the comments I would say that it is unlikely that the root of the problem is VirtualBox. It looks like the problem is with the amdgpu. In extreme cases, it could be a fundamental issue in the kernel in handling modules. Or a floating hardware issue… There is graphics/drm-515-kmod for 14.0+ and graphics/drm-61-kmod for 14.1+. Maybe you can check that without upgrade to 14.1.
Of course it's highly unlikely that the problem is in VirtualBox -- or zfs or acpi_wmi. But that's a minority view here. If I can get a proper core dump when in single-user mode (with RO /usr), it would surely clarify the issue. It seems unlikely in the extreme to me that the 14 version would work with 13, given that a 13.2 compile of the port would not work with a 13.3 kernel.
(In reply to George Mitchell from comment #179) > It seems unlikely in the extreme to me that the 14 version would work with 13, given that a 13.2 compile of the port would not work with a 13.3 kernel. I said about run 14.1 without upgrade current system (install on different empty HDD/SSD) and test more recent version of the amdgpu.
After upgrading my system from 13.3-RELEASE-p8 to 13.4-RELEASE-p2 and recompiling drm_510_kmod (getting version drm-510-kmod-5.10.163_10 as reported by pkg info, despite the different version string seen below), I got a crah in single-user mode that at least left intelligible text on the screen, though I did not get a dump. Here, manually transcribed (unfortunately), are the things I saw: (once) Fatal Trap 9 in kernel mode (five times) kernel trap 22 with interrupts disabled Backtrace: kdb_backtrace vpanic panic trap_fatal call_trap linker_load_dependencies link_elf_load_file linker_load_module kern_kldload sys_kldload amd64_syscall fast_syscall_common The string "v5.10.163_7" once. I don't know where it came from, and my system log definitely says: Dec 7 18:04:33 court pkg[1782]: drm-510-kmod-5.10.163_9 deinstalled Dec 7 18:04:41 court pkg[1786]: drm-510-kmod-5.10.163_10 installed That's all I have, regrettably. Probably before the end of the year I will be upgrading to 14-RELEASE. Perhaps this helps. Despite the involvement of the zfs, vboxnetflt, and acpi_wmi kernel modules, none of those modules ever causes any trouble when drm-510-kmod is not present, and my (long) software engineering experience tells me that the simplest explanation is a bug in drm-510-kmod.
I forgot to mention the most obvious thing on the screen, which has been the hallmark of this crash all along: fifty-plus repetitions of the famous "WARNING !drm_modeset_is_locked(&crtc->mutex) failed" message.
(In reply to George Mitchell from comment #182) Over the 2+ years of failures, what is just the first failure-indicating message from each failing boot? Well, likely you could only approximate that. The point is to try to ignore later messages from failing boots that could just be consequences of prior failure activity for which there is already evidence. For example, if "WARNING !drm_modeset_is_locked(&crtc->mutex) failed" is never first, it is less likely to be of interest. But if it is always first it could be more likely to be of interest. (This is just for illustration, not special to the specific message.) Going the other way, again just as an example message: does it sometimes occur even when no overall failure happens? Such would also make such a message somewhat less likely to be of interest. Of course, different boots might not get the same kind of first failure-indicating message. But the list and relative frequency of occurrence might be of some use. Another issue could be that you might not have good evidence for first failure-indicating from the failing boot attempts: no way to answer the question then.
To answer many of your questions about the drm_modeset_is_locked message, may I direct your attention to the second attachment to the bug: https://bz-attachments.freebsd.org/attachment.cgi?id=238849. TL;DR: It's never literally first, but it's pretty close to first and there are always multiple, multiple copies of it that are impossible to ignore.
And I buried the lede. I never see any evidence of drm_modeset_is_locked during normal operation.
(In reply to George Mitchell from comment #184) Then I'm afraid some information from earlier in the failure sequence will prove essential to identifying and fixing the overall problem: Necessary --but such information need not be sufficient on its own.
(In reply to Mark Millard from comment #186) Which of the following also happen for boots where there is no evidence of a problem? [drm] BIOS signature incorrect 53 7 . . . sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)! . . . ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) . . . acpi_wmi1: cannot find EC device . . . ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) driver bug: Unable to set devclass (class: ppc devname: (unknown)) Do any of those only happen when there is a failure?
(In reply to Mark Millard from comment #187) I managed to not list as one of the messages to ask about: Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)
ALL of the messages you cited occur all of the time.
From the stack trace: #7 <signal handler called> #8 avl_destroy_nodes (tree=tree@entry=0xfffff8001b6c0420, cookie=cookie@entry=0xfffffe00b3bcddd0) at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023 I used to have a repeatable crash with a bad pointer in ZFS AVL code. See bug #268909. Eventually it went away, whether due to disk data changes or bug fixes I can't say. I did not have any graphics drivers loaded.
(In reply to John F. Carr from comment #190) Note the attachment named "Crash without any use of ZFS, with acpi_wmi" ( see also comment #107 and "Relevant part of /var/log/messages" ). There is not even the likes of: ZFS filesystem version: 5 ZFS storage pool version: features support (5000) in the log, much less a backtrace with zfs content. Also, "Latest crash dump" is one that makes no mention of ZFS in its backtraces, as I remember. Each exposes-failure context had a similar "does not need to be in the backtrace for a failure to occur" status, at least given other example failure-reporting code was involved. My memory of the history was that failure never happened without drm-510-kmod being in use but the initial exposure of the problem was never via a backtrace involving drm-510-kmod .
I got the crash again. I still haven't figured out how to get a dump in single user mode, but at least I got pictures: https://www.m5p.com/public/george/267028/IMG_20241210_093221099.jpg https://www.m5p.com/public/george/267028/IMG_20241210_093239744_HDR.jpg I'm sorry for the quality of the first one; I'll try to get a better one on the next crash. Apparently, now that I'm on 13.4 instead of 13.3 (and with the latest version of the kernel module), I'll get at least the text screen instead of the immediate reboot.
(In reply to George Mitchell from comment #192) Viewed at actual size or zoomed in on an iPad I was able to read some of the text in: https://www.m5p.com/public/george/267028/IMG_20241210_093239744_HDR.jpg . . . vgapci0: child drmn0 requested pci_get_powerstate sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)! <6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0 It panicked here instead of getting to the point of showing what would normally be next: "Autoloading module: acpi_wmi" However, the <6> is also not normal and may well be sigificant. The panic was something like: Fatal trap 9: general protection fault while in kernel mode cpuid = 1: apic id = 01 instruction pointer = 0x20:0xffffffff00cf0110 . . . current process = 25 (kldload) trap number = 9 (I doubt that I could tell c vs. e or 0 vs. 8 .) In the presence of the panic context I do not expect the drm_modeset locking is able to operate correctly. In other words: I expect that the !drm_modeset_is_locked notices are expected, given the Fatal trap 9 and its being handling in the kernel. That in turn messes things up even more for later --or so I guess. I expect that the later trap 22 (in the other picture) is from the lack of handling during the attempt to present the trap 9 information. The other picture's backtrace suggests for the instruction pointer reported above (general range of upper address bits): instruction pointer = 0x2?:0xffffffff80?f?11? where the "?"'s are about the 0 vs. 8 and c vs. e question.
I'll note that drm-510-kmod has the status: .if ${OPSYS} == FreeBSD && ${OSVERSION} >= 1401501 IGNORE= not supported on FreeBSD 14.2 and higher .endif drm-515-kmod has the status: .if ${OPSYS} == FreeBSD && ${OSVERSION} < 1400081 IGNORE= not supported on older than 14.0, no kernel support .endif drm-61-kmod has the status: .if ${OPSYS} == FreeBSD && !( ${OSVERSION} >= 1500008 || ( ${OSVERSION} >= 1400508 && ${OSVERSION} < 1500000 )) IGNORE= not supported on older than 14-STABLE 1400508, no kernel support .endif So drm-510-kmod will not last through all of 14.* .
(In reply to George Mitchell from comment #192) As for having a swap partition ready for use for dumping as early as possible: assign dumpdev in /boot/loader.conf . An example from one of my contexts: # grep -i dump /boot/loader.conf dumpdev="/dev/gpt/OptBswp364" The kernel does have to be far enough along to put it to use but the above is the only way to have things configured for the earliest point at which dumping is supported, as far as I know. For dumping to swap partitions and this earlier time frame, to my knowledge single user mode vs. not boot attempts would not make a difference. Actually, the above may do one thing that interferes: /dev/gpt/ use might not be the best idea for earliest possible dumps. Use an exact device reference instead. In my context, as it is now, /dev/gpt/OptBswp364 is actually /dev/nda2p2 . So: dumpdev="/dev/nda2p2" would likely be better for earliest-dump-possible in my context.
Mark, I honestly thank you sincerely for putting dumpdev in /boot/loader.conf -- that is almost certainly the piece of the puzzle I have been missing! I've added dumpdev="/dev/ada0p2" And we'll see what happens. Presumably things are far enough along at that point that it can find /var/crash in that partition. I already knew that I would have to go to drm-515-kmod when updating to 14.2-RELEASE. But I started having problems with this hardware at least a couple of iterations of operating system and display driver back. I'll be pleasantly surprised if drm-515-kmod does better.
(In reply to George Mitchell from comment #196) I think I may have guess wrong about the staging that you were referring to: A ) Before the reboot the dump is put in the swap partition as raw data. There is a dump command that can be used at the db> prompt. B) Then rebooting analyzes the dump in the swap partition and puts the information in /var/crash . My notes did not over (B), just (A).
My swap partition is /dev/ada0p3; is that what I should set dumpdev to?
(In reply to George Mitchell from comment #198) (In reply to George Mitchell from comment #196) For getting the crash dump out of the swap partition and into /var/crash ( replacing the: ???? ) with its analysis: # savecore /var/crash /dev/???? # crashinfo -b (I think that last will automatically pick up the latest saved core. Otherwise add the path to the dump file after the -b argument.) See: man 8 savecore man 8 crashinfo Note that /var/crash/ would need to be writable. For single user mode, you first need to deal with making it writable as I remember. This gets into your partitioning and mount point usage and ZFS vs. UFS. If everything is in one UFS partition, then: # mount -w / should be sufficient, as an example.
(In reply to George Mitchell from comment #196) I do not know if you have all the right tools installed for crashinfo : you need devel/gdb (that installs a kgdb as well). QUOTE Once crashinfo has located a core dump and kernel, it uses several utilities to analyze the core including dmesg(8), fstat(1), iostat(8), ipcs(1), kgdb(1) (ports/devel/gdb), netstat(1), nfsstat(1), ps(1), pstat(8), and vmstat(8). Note that kgdb must be installed from the devel/gdb port or gdb package. END QUOTE
(In reply to George Mitchell from comment #198) /boot/loader.conf should have: dumpdev="/dev/ada0p3" You may have to explicitly request the dump at the db> prompt.
Are you building an installing your own drm-510-kmod? vs. Are you using the FreeBSD packaged mod-510-kmod? The official 13.*-RELEASE builds are still built via jail 133amd64-default . In other words: the builds are for 13.3-RELEASE, not for 13.4-RELEASE . For now, when you are running 13.4-RELEASE you need to have built drm-510-kmod in a 13.4-RELEASE context (such as via a 13.4-RELEASE poudriere jail) and to have installed that build. This includes making sure that your poudriere jail content is 13.4-RELEASE, not 13.3-RELEASE .
In reply to comment #202, allow me to refer you to comment #181. (Off topic: this bug sets my personal record for bug with most comments.)
(In reply to George Mitchell from comment #203) Just for paranoia, what did the install message show for %%OPSYS%% and %%OSREL%% in the new installation message text: +Please note that this package was built for %%OPSYS%% %%OSREL%%. +If this is not your current running version, please rebuild +it from ports to prevent panics when loading the module. ?
pkg info -D drm-510-kmod [...] Please note that this package was built for FreeBSD 13.4. If this is not your current running version, please rebuild it from ports to prevent panics when loading the module. With regard to getting crash dumps, I've never (before this bug) had to do anything other than rely on 'dumpdev="AUTO"' to do the right thing automagically, so that's why I've had trouble with that. But I had figured out that I needed my root partition to be writable; see comment #177.
(In reply to George Mitchell from comment #205) With the over 2 years of effort at creating your "personal record for bug with most comments" (with my help!), I've gotten to the point that I no longer search it all to find some of information that I do not remember if it is present or not. It was easier to remember back when I was more active for the issue: I've jumped from 2023-04-26 23:24:29 UTC to 2024-12-08 21:20:06 UTC with no involvement in the middle, not normally having drm-*-kmod in use in my environment and not having to deal with the issue in my environment. One issue that the dump will likely have is being made after multiple traps (such as 22's after the 9) and such, instead of just after the initial one. The later activity will likely corrupt some of the information from the initial fault's time frame. It would be nice if one could configure to have the initial trap (likely 9) initiate the dump directly and get to the ddb> prompt after the dump was done, then allowing for a reboot.
For the trap 9's: instruction pointer = 0x2?:0xffffffff80?f?11? Looking at the kernel code's for around: 0xffffffff80cf011? I find the code in that area is in qsort. The old comment # 121 found such as well: 0xffffffff80cf00ff <+6047>: jae 0xffffffff80cf0470 <qsort+6928> 0xffffffff80cf0105 <+6053>: mov %rbx,%rax 0xffffffff80cf0108 <+6056>: shr $0x2,%rax 0xffffffff80cf010c <+6060>: mov %rbx,%r15 0xffffffff80cf010f <+6063>: shr $0x3,%r15 0xffffffff80cf0113 <+6067>: lea -0x1(%rbx),%rdx 0xffffffff80cf0117 <+6071>: mov %rdx,-0xa0(%rbp) 0xffffffff80cf011e <+6078>: lea -0x1(%rax),%rdx 0xffffffff80cf0122 <+6082>: mov %rdx,-0x98(%rbp) (Not that the code details inside qsort match.) Other alternatives: (kgdb) disass 0xffffffff80cf8110 Dump of assembler code for function deflate_slow: 0xffffffff80cf80f8 <+1048>: je 0xffffffff80cf812b <deflate_slow+1099> 0xffffffff80cf80fa <+1050>: mov 0x18(%r13),%rdi 0xffffffff80cf80fe <+1054>: mov 0x20(%r15),%rsi 0xffffffff80cf8102 <+1058>: mov %r12d,%edx 0xffffffff80cf8105 <+1061>: call 0xffffffff80cfeea0 <zmemcpy> 0xffffffff80cf810a <+1066>: mov %r12d,%eax 0xffffffff80cf810d <+1069>: add %rax,0x18(%r13) 0xffffffff80cf8111 <+1073>: add %rax,0x20(%r15) 0xffffffff80cf8115 <+1077>: add %rax,0x28(%r13) 0xffffffff80cf8119 <+1081>: sub %r12d,0x20(%r13) 0xffffffff80cf811d <+1085>: sub %rax,0x28(%r15) 0xffffffff80cf8121 <+1089>: jne 0xffffffff80cf812b <deflate_slow+1099> (kgdb) disass 0xffffffff80ef0110 Dump of assembler code for function mac_vnode_check_write_impl: 0xffffffff80ef00f7 <+71>: je 0xffffffff80ef00e0 <mac_vnode_check_write_impl+48> 0xffffffff80ef00f9 <+73>: mov 0x188(%rbx),%rcx 0xffffffff80ef0100 <+80>: mov %r12,%rdi 0xffffffff80ef0103 <+83>: mov %r14,%rsi 0xffffffff80ef0106 <+86>: mov %rbx,%rdx 0xffffffff80ef0109 <+89>: call *%rax 0xffffffff80ef010b <+91>: mov %eax,%edi 0xffffffff80ef010d <+93>: mov %r15d,%esi 0xffffffff80ef0110 <+96>: call 0xffffffff80edefb0 <mac_error_select> 0xffffffff80ef0115 <+101>: mov %eax,%r15d 0xffffffff80ef0118 <+104>: jmp 0xffffffff80ef00e0 <mac_vnode_check_write_impl+48> 0xffffffff80ef011a <+106>: cmpq $0x0,0x11d029e(%rip) # 0xffffffff820c03c0 <mac_policy_list> 0xffffffff80ef0122 <+114>: je 0xffffffff80ef017f <mac_vnode_check_write_impl+207> (kgdb) disass 0xffffffff80ef8110 Dump of assembler code for function ffs_blkfree_cg: 0xffffffff80ef80fa <+106>: jbe 0xffffffff80ef81aa <ffs_blkfree_cg+282> 0xffffffff80ef8100 <+112>: mov %rdi,-0x30(%rbp) 0xffffffff80ef8104 <+116>: mov 0x38(%rax),%r15 0xffffffff80ef8108 <+120>: lea -0x38(%rbp),%r8 0xffffffff80ef810c <+124>: lea -0x98(%rbp),%r9 0xffffffff80ef8113 <+131>: mov %rbx,%rdi 0xffffffff80ef8116 <+134>: mov %r10,-0x58(%rbp) 0xffffffff80ef811a <+138>: mov %r10,%rsi 0xffffffff80ef811d <+141>: mov %rdx,-0x48(%rbp) 0xffffffff80ef8121 <+145>: mov $0x80,%ecx
Hmm. The backtrace in: https://www.m5p.com/public/george/267028/IMG_20241210_093221099.jpg is incoherent. Dump of "disass/s" code report for function link_elf_load_file: /usr/src/sys/kern/link_elf.c: 952 { 0xffffffff80c1e5a0 <+0>: push %rbp . . . 0xffffffff80c1ef00 <+2400>: jmp 0xffffffff80c1ecf4 <link_elf_load_file+1876> End of assembler dump. Compare that +2400 to the backtrace's: link_elf_load_file+0x115c (note: 0x115c == 4444)
Mark, I sincerely appreciate the help you have provided on this bug. My major character flaw is a tendency to lapse into sarcasm at the drop of a hat. I'm working on it.
Does your 13.4-RELEASE environment have the likes of: /usr/lib/debug/boot/kernel/kernel.debug in addition to /boot/kernel/kernel ? If yes, is the kernel.debug file up to date with the kernel file?
The two files have the same time stamp (Dec 6 00:40, when I updated to 13.4), so I assume they are in sync.
For a successful boot, could you report the local equivalent of: # kgdb # So: the live system (not what I was actually doing) . . . (kgdb) disass btext Dump of assembler code for function btext: 0xffffffff8038e000 <+0>: push $0x2 0xffffffff8038e002 <+2>: popf 0xffffffff8038e003 <+3>: mov %rsp,%rbp 0xffffffff8038e006 <+6>: mov 0x4(%rbp),%edi 0xffffffff8038e009 <+9>: mov 0x8(%rbp),%esi 0xffffffff8038e00c <+12>: mov $0xffffffff81d84580,%rsp 0xffffffff8038e013 <+19>: xor %ebp,%ebp 0xffffffff8038e015 <+21>: call 0xffffffff8108dca0 <hammer_time> 0xffffffff8038e01a <+26>: mov %rax,%rsp 0xffffffff8038e01d <+29>: call 0xffffffff80b7d260 <mi_startup> 0xffffffff8038e022 <+34>: hlt 0xffffffff8038e023 <+35>: jmp 0xffffffff8038e022 <btext+34> 0xffffffff8038e025 <+37>: cs nopw 0x0(%rax,%rax,1) 0xffffffff8038e02f <+47>: nop End of assembler dump. (kgdb) info files Symbols from "/usr/home/root/artifacts/13.4R/usr/lib/debug/boot/kernel/kernel.debug". Local exec file: `/usr/home/root/artifacts/13.4R/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff802310f0 is .hash 0xffffffff802310f0 - 0xffffffff8025f9c0 is .gnu.hash 0xffffffff8025f9c0 - 0xffffffff802f2450 is .dynsym 0xffffffff802f2450 - 0xffffffff8036d0c4 is .dynstr 0xffffffff8036d0c8 - 0xffffffff8038da68 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811863f8 is .text 0xffffffff81186400 - 0xffffffff817f8c20 is .rodata . . . (past the first .text section anyway) . . . Note that my output above is for the kernel file instead of for a live system. For all I know a live system might relocate sections to distinct address ranges vs. the above. That is part of what I'm checking for your context. Also: (kgdb) disass 0xffffffff80cf0110 . . . (various pages later) . . . A range of lines spanning at least from somewhat before 0xffffffff80cf011? to somewhat after it. . . . (The above will likely name what function it is displaying -- the one that spans the address listed.)
This is what I get: kgdb GNU gdb (GDB) 15.1 [GDB v15.1 for FreeBSD] Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.4". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Reading symbols from /boot/kernel/sem.ko... Reading symbols from /usr/lib/debug//boot/kernel/sem.ko.debug... Reading symbols from /boot/modules/if_re.ko... (No debugging symbols found in /boot/modules/if_re.ko) Reading symbols from /boot/kernel/fusefs.ko... Reading symbols from /usr/lib/debug//boot/kernel/fusefs.ko.debug... Reading symbols from /boot/modules/amdgpu.ko... (No debugging symbols found in /boot/modules/amdgpu.ko) Reading symbols from /boot/modules/drm.ko... (No debugging symbols found in /boot/modules/drm.ko) Reading symbols from /boot/kernel/iic.ko... Reading symbols from /usr/lib/debug//boot/kernel/iic.ko.debug... Reading symbols from /boot/modules/linuxkpi_gplv2.ko... (No debugging symbols found in /boot/modules/linuxkpi_gplv2.ko) --Type <RET> for more, q to quit, c to continue without paging-- Reading symbols from /boot/modules/dmabuf.ko... (No debugging symbols found in /boot/modules/dmabuf.ko) Reading symbols from /boot/modules/ttm.ko... (No debugging symbols found in /boot/modules/ttm.ko) Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_gpu_info_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_sdma_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_asd_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_ta_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_pfp_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_me_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_ce_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_rlc_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_mec_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_mec2_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_vcn_bin.ko) Reading symbols from /boot/kernel/zfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug... Reading symbols from /boot/kernel/netgraph.ko... Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug... Reading symbols from /boot/kernel/acpi_wmi.ko... --Type <RET> for more, q to quit, c to continue without paging-- Reading symbols from /usr/lib/debug//boot/kernel/acpi_wmi.ko.debug... Reading symbols from /boot/kernel/intpm.ko... Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug... Reading symbols from /boot/kernel/smbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug... Reading symbols from /boot/kernel/uhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug... Reading symbols from /boot/kernel/usbhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug... Reading symbols from /boot/kernel/hidbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug... Reading symbols from /boot/kernel/wmt.ko... Reading symbols from /usr/lib/debug//boot/kernel/wmt.ko.debug... Reading symbols from /boot/kernel/ums.ko... Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug... Reading symbols from /boot/kernel/autofs.ko... Reading symbols from /usr/lib/debug//boot/kernel/autofs.ko.debug... Reading symbols from /boot/kernel/mac_ntpd.ko... Reading symbols from /usr/lib/debug//boot/kernel/mac_ntpd.ko.debug... Reading symbols from /boot/kernel/green_saver.ko... Reading symbols from /usr/lib/debug//boot/kernel/green_saver.ko.debug... sched_switch (td=td@entry=0xffffffff82043780 <thread0_st>, flags=flags@entry=260) at /usr/src/sys/kern/sched_4bsd.c:1085 1085 SDT_PROBE0(sched, , , on__cpu); (kgdb) disass btext Dump of assembler code for function btext: 0xffffffff8038e000 <+0>: push $0x2 0xffffffff8038e002 <+2>: popf 0xffffffff8038e003 <+3>: mov %rsp,%rbp 0xffffffff8038e006 <+6>: mov 0x4(%rbp),%edi 0xffffffff8038e009 <+9>: mov 0x8(%rbp),%esi 0xffffffff8038e00c <+12>: mov $0xffffffff81d83880,%rsp 0xffffffff8038e013 <+19>: xor %ebp,%ebp 0xffffffff8038e015 <+21>: call 0xffffffff8108bca0 <hammer_time> 0xffffffff8038e01a <+26>: mov %rax,%rsp 0xffffffff8038e01d <+29>: call 0xffffffff80b7d260 <mi_startup> 0xffffffff8038e022 <+34>: hlt 0xffffffff8038e023 <+35>: jmp 0xffffffff8038e022 <btext+34> 0xffffffff8038e025 <+37>: cs nopw 0x0(%rax,%rax,1) 0xffffffff8038e02f <+47>: nop End of assembler dump. (kgdb) info files Symbols from "/boot/kernel/kernel". Kernel core dump file: `/dev/mem', file type FreeBSD kernel vmcore. Local exec file: `/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff80231108 is .hash 0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash 0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym 0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr 0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811843f8 is .text 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata 0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set 0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set 0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set 0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set 0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips 0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs 0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set 0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set 0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set 0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set 0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set 0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set 0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set 0xffffffff81803620 - 0xffffffff81803660 is usb_host_id 0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set 0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset 0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors 0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set 0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set 0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set 0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset 0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset 0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set 0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set 0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set 0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set 0xffffffff81803820 - 0xffffffff81804881 is kern_conf 0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id 0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame 0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic 0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding 0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently 0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly 0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line 0xffffffff81c08000 - 0xffffffff81d51248 is .data 0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set 0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set 0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu 0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82545000 - 0xffffffff82547000 is .text in /boot/kernel/sem.ko 0xffffffff82547000 - 0xffffffff82548000 is .rodata in /boot/kernel/sem.ko 0xffffffff82548000 - 0xffffffff8254895c is .data in /boot/kernel/sem.ko 0xffffffff82548960 - 0xffffffff82548978 is set_sysctl_set in /boot/kernel/sem.ko 0xffffffff82548978 - 0xffffffff82548988 is set_sysinit_set in /boot/kernel/sem.ko 0xffffffff82548988 - 0xffffffff82548990 is set_sysuninit_set in /boot/kernel/sem.ko 0xffffffff82548990 - 0xffffffff82548a10 is .bss in /boot/kernel/sem.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82548a10 - 0xffffffff82548a28 is set_modmetadata_set in /boot/kernel/sem.ko 0xffffffff82548a28 - 0xffffffff82548a4c is .note.gnu.build-id in /boot/kernel/sem.ko 0xffffffff8254d000 - 0xffffffff825d4000 is .text in /boot/modules/if_re.ko 0xffffffff825d4000 - 0xffffffff825db000 is .rodata in /boot/modules/if_re.ko 0xffffffff825db000 - 0xffffffff825db4a0 is .data in /boot/modules/if_re.ko 0xffffffff825db4a0 - 0xffffffff825db4f0 is set_sysctl_set in /boot/modules/if_re.ko 0xffffffff825db4f0 - 0xffffffff825db500 is set_modmetadata_set in /boot/modules/if_re.ko 0xffffffff825db500 - 0xffffffff825db508 is set_sysinit_set in /boot/modules/if_re.ko 0xffffffff825db508 - 0xffffffff825db528 is .bss in /boot/modules/if_re.ko 0xffffffff825db528 - 0xffffffff825db54c is .note.gnu.build-id in /boot/modules/if_re.ko 0xffffffff8264d000 - 0xffffffff8265a000 is .text in /boot/kernel/fusefs.ko 0xffffffff8265a000 - 0xffffffff8265c000 is .rodata in /boot/kernel/fusefs.ko 0xffffffff8265c000 - 0xffffffff8265e874 is .data in /boot/kernel/fusefs.ko 0xffffffff8265e878 - 0xffffffff8265e970 is set_sdt_probes_set in /boot/kernel/fusefs.ko 0xffffffff8265e970 - 0xffffffff8265eba0 is set_sdt_argtypes_set in /boot/kernel/fusefs.ko 0xffffffff8265eba0 - 0xffffffff8265ebd8 is set_sysinit_set in /boot/kernel/fusefs.ko 0xffffffff8265ebd8 - 0xffffffff8265ebf8 is set_sysuninit_set in /boot/kernel/fusefs.ko 0xffffffff8265ebf8 - 0xffffffff8265ec60 is set_sysctl_set in /boot/kernel/fusefs.ko 0xffffffff8265ec60 - 0xffffffff8265ecc0 is .bss in /boot/kernel/fusefs.ko 0xffffffff8265ecc0 - 0xffffffff8265ecc8 is set_sdt_providers_set in /boot/kernel/fusefs.ko 0xffffffff8265ecc8 - 0xffffffff8265ece0 is set_modmetadata_set in /boot/kernel/fusefs.ko 0xffffffff8265ece0 - 0xffffffff8265ed04 is .note.gnu.build-id in /boot/kernel/fusefs.ko 0xffffffff82a00000 - 0xffffffff82cf4000 is .text in /boot/modules/amdgpu.ko 0xffffffff82cf4000 - 0xffffffff82dfa000 is .rodata in /boot/modules/amdgpu.ko 0xffffffff82dfa000 - 0xffffffff82e07378 is .bss in /boot/modules/amdgpu.ko 0xffffffff82e07380 - 0xffffffff82e0fd74 is .data in /boot/modules/amdgpu.ko 0xffffffff82e0fd78 - 0xffffffff82e10150 is set_sysctl_set in /boot/modules/amdgpu.ko 0xffffffff82e10150 - 0xffffffff82e10178 is set_sysinit_set in /boot/modules/amdgpu.ko 0xffffffff82e10178 - 0xffffffff82e10188 is set_sysuninit_set in /boot/modules/amdgpu.ko 0xffffffff82e10188 - 0xffffffff82e101e0 is set_modmetadata_set in /boot/modules/amdgpu.ko 0xffffffff82e101e0 - 0xffffffff82e10204 is .note.gnu.build-id in /boot/modules/amdgpu.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82918000 - 0xffffffff8296c000 is .text in /boot/modules/drm.ko 0xffffffff8296c000 - 0xffffffff82988000 is .rodata in /boot/modules/drm.ko 0xffffffff82988000 - 0xffffffff82988190 is .bss in /boot/modules/drm.ko 0xffffffff82988190 - 0xffffffff829899a8 is .data in /boot/modules/drm.ko 0xffffffff829899a8 - 0xffffffff82989a20 is set_sysinit_set in /boot/modules/drm.ko 0xffffffff82989a20 - 0xffffffff82989a80 is set_sysuninit_set in /boot/modules/drm.ko 0xffffffff82989a80 - 0xffffffff82989b50 is set_sysctl_set in /boot/modules/drm.ko 0xffffffff82989b50 - 0xffffffff82989b5c is .data.read_mostly in /boot/modules/drm.ko 0xffffffff82989b60 - 0xffffffff82989bd8 is set_modmetadata_set in /boot/modules/drm.ko 0xffffffff82989bd8 - 0xffffffff82989bfc is .note.gnu.build-id in /boot/modules/drm.ko 0xffffffff8298a000 - 0xffffffff8298b000 is .text in /boot/kernel/iic.ko 0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in /boot/kernel/iic.ko 0xffffffff8298c000 - 0xffffffff8298c270 is .data in /boot/kernel/iic.ko 0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in /boot/kernel/iic.ko 0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in /boot/kernel/iic.ko 0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in /boot/kernel/iic.ko 0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in /boot/kernel/iic.ko 0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in /boot/kernel/iic.ko 0xffffffff8298d000 - 0xffffffff8298f000 is .text in /boot/modules/linuxkpi_gplv2.ko 0xffffffff8298f000 - 0xffffffff82990000 is .rodata in /boot/modules/linuxkpi_gplv2.ko 0xffffffff82990000 - 0xffffffff829900c8 is .data in /boot/modules/linuxkpi_gplv2.ko 0xffffffff829900c8 - 0xffffffff829900f0 is set_modmetadata_set in /boot/modules/linuxkpi_gplv2.ko 0xffffffff829900f0 - 0xffffffff829900f8 is set_sysinit_set in /boot/modules/linuxkpi_gplv2.ko 0xffffffff829900f8 - 0xffffffff829900fc is .bss in /boot/modules/linuxkpi_gplv2.ko 0xffffffff829900fc - 0xffffffff82990120 is .note.gnu.build-id in /boot/modules/linuxkpi_gplv2.ko 0xffffffff82991000 - 0xffffffff82996000 is .text in /boot/modules/dmabuf.ko 0xffffffff82996000 - 0xffffffff82997000 is .rodata in /boot/modules/dmabuf.ko 0xffffffff82997000 - 0xffffffff82997240 is .data in /boot/modules/dmabuf.ko 0xffffffff82997240 - 0xffffffff82997250 is set_modmetadata_set in /boot/modules/dmabuf.ko 0xffffffff82997250 - 0xffffffff82997268 is set_sysinit_set in /boot/modules/dmabuf.ko 0xffffffff82997268 - 0xffffffff82997280 is set_sysuninit_set in /boot/modules/dmabuf.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82997280 - 0xffffffff82997318 is .bss in /boot/modules/dmabuf.ko 0xffffffff82997318 - 0xffffffff8299733c is .note.gnu.build-id in /boot/modules/dmabuf.ko 0xffffffff82998000 - 0xffffffff829a2000 is .text in /boot/modules/ttm.ko 0xffffffff829a2000 - 0xffffffff829a3000 is .rodata in /boot/modules/ttm.ko 0xffffffff829a3000 - 0xffffffff829a3500 is .data in /boot/modules/ttm.ko 0xffffffff829a3500 - 0xffffffff829a3520 is set_sysinit_set in /boot/modules/ttm.ko 0xffffffff829a3520 - 0xffffffff829a3538 is set_sysuninit_set in /boot/modules/ttm.ko 0xffffffff829a3540 - 0xffffffff829a4720 is .bss in /boot/modules/ttm.ko 0xffffffff829a4720 - 0xffffffff829a4758 is set_modmetadata_set in /boot/modules/ttm.ko 0xffffffff829a4758 - 0xffffffff829a477c is .note.gnu.build-id in /boot/modules/ttm.ko 0xffffffff829a5000 - 0xffffffff829a6000 is .text in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a6000 - 0xffffffff829a7000 is .rodata in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7000 - 0xffffffff829a713c is rodata in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7140 - 0xffffffff829a71f0 is .data in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a71f0 - 0xffffffff829a7210 is set_modmetadata_set in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7210 - 0xffffffff829a7218 is set_sysinit_set in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7218 - 0xffffffff829a723c is .note.gnu.build-id in /boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a8000 - 0xffffffff829a9000 is .text in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829a9000 - 0xffffffff829aa000 is .rodata in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829aa000 - 0xffffffff829ae400 is rodata in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae400 - 0xffffffff829ae4b0 is .data in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4b0 - 0xffffffff829ae4d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4d0 - 0xffffffff829ae4d8 is set_sysinit_set in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4d8 - 0xffffffff829ae4fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829af000 - 0xffffffff829b0000 is .text in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829b0000 - 0xffffffff829b1000 is .rodata in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829b1000 - 0xffffffff829dd200 is rodata in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829dd200 - 0xffffffff829dd2b0 is .data in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829dd2b0 - 0xffffffff829dd2d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829dd2d0 - 0xffffffff829dd2d8 is set_sysinit_set in /boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829dd2d8 - 0xffffffff829dd2fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_asd_bin.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff829de000 - 0xffffffff829df000 is .text in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829df000 - 0xffffffff829e0000 is .rodata in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e0000 - 0xffffffff829e7300 is rodata in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e7300 - 0xffffffff829e73b0 is .data in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e73b0 - 0xffffffff829e73d0 is set_modmetadata_set in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e73d0 - 0xffffffff829e73d8 is set_sysinit_set in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e73d8 - 0xffffffff829e73fc is .note.gnu.build-id in /boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e8000 - 0xffffffff829e9000 is .text in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829e9000 - 0xffffffff829ea000 is .rodata in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ea000 - 0xffffffff829ef480 is rodata in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef480 - 0xffffffff829ef530 is .data in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef530 - 0xffffffff829ef550 is set_modmetadata_set in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef550 - 0xffffffff829ef558 is set_sysinit_set in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef558 - 0xffffffff829ef57c is .note.gnu.build-id in /boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829f0000 - 0xffffffff829f1000 is .text in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f1000 - 0xffffffff829f2000 is .rodata in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f2000 - 0xffffffff829f6480 is rodata in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6480 - 0xffffffff829f6530 is .data in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6530 - 0xffffffff829f6550 is set_modmetadata_set in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6550 - 0xffffffff829f6558 is set_sysinit_set in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6558 - 0xffffffff829f657c is .note.gnu.build-id in /boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f7000 - 0xffffffff829f8000 is .text in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829f8000 - 0xffffffff829f9000 is .rodata in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829f9000 - 0xffffffff829fb480 is rodata in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb480 - 0xffffffff829fb530 is .data in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb530 - 0xffffffff829fb550 is set_modmetadata_set in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb550 - 0xffffffff829fb558 is set_sysinit_set in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb558 - 0xffffffff829fb57c is .note.gnu.build-id in /boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff82e11000 - 0xffffffff82e12000 is .text in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e12000 - 0xffffffff82e13000 is .rodata in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e13000 - 0xffffffff82e1c8e4 is rodata in /boot/modules/amdgpu_raven_rlc_bin.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82e1c8e8 - 0xffffffff82e1c998 is .data in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c998 - 0xffffffff82e1c9b8 is set_modmetadata_set in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c9b8 - 0xffffffff82e1c9c0 is set_sysinit_set in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c9c0 - 0xffffffff82e1c9e4 is .note.gnu.build-id in /boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1d000 - 0xffffffff82e1e000 is .text in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e1e000 - 0xffffffff82e1f000 is .rodata in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e1f000 - 0xffffffff82e60710 is rodata in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e60710 - 0xffffffff82e607c0 is .data in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607c0 - 0xffffffff82e607e0 is set_modmetadata_set in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607e0 - 0xffffffff82e607e8 is set_sysinit_set in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607e8 - 0xffffffff82e6080c is .note.gnu.build-id in /boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e61000 - 0xffffffff82e62000 is .text in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82e62000 - 0xffffffff82e63000 is .rodata in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82e63000 - 0xffffffff82ea4710 is rodata in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea4710 - 0xffffffff82ea47c0 is .data in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47c0 - 0xffffffff82ea47e0 is set_modmetadata_set in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47e0 - 0xffffffff82ea47e8 is set_sysinit_set in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47e8 - 0xffffffff82ea480c is .note.gnu.build-id in /boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea5000 - 0xffffffff82ea6000 is .text in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82ea6000 - 0xffffffff82ea7000 is .rodata in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82ea7000 - 0xffffffff82eff560 is rodata in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82eff560 - 0xffffffff82eff610 is .data in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82eff610 - 0xffffffff82eff630 is set_modmetadata_set in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82eff630 - 0xffffffff82eff638 is set_sysinit_set in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82eff638 - 0xffffffff82eff65c is .note.gnu.build-id in /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff83000000 - 0xffffffff8324c000 is .text in /boot/kernel/zfs.ko 0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in /boot/kernel/zfs.ko 0xffffffff832dc000 - 0xffffffff832fe228 is .data in /boot/kernel/zfs.ko 0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in /boot/kernel/zfs.ko 0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in /boot/kernel/zfs.ko 0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in /boot/kernel/zfs.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in /boot/kernel/zfs.ko 0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in /boot/kernel/zfs.ko 0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in /boot/kernel/zfs.ko 0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in /boot/kernel/zfs.ko 0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in /boot/kernel/zfs.ko 0xffffffff82f05000 - 0xffffffff82f0d000 is .text in /boot/kernel/netgraph.ko 0xffffffff82f0d000 - 0xffffffff82f0f000 is .rodata in /boot/kernel/netgraph.ko 0xffffffff82f0f000 - 0xffffffff82f0f900 is .data in /boot/kernel/netgraph.ko 0xffffffff82f0f900 - 0xffffffff82f0f918 is set_modmetadata_set in /boot/kernel/netgraph.ko 0xffffffff82f0f918 - 0xffffffff82f0f960 is set_sysinit_set in /boot/kernel/netgraph.ko 0xffffffff82f0f960 - 0xffffffff82f0f9a0 is set_sysuninit_set in /boot/kernel/netgraph.ko 0xffffffff82f0f9a0 - 0xffffffff82f0f9d8 is set_vnet in /boot/kernel/netgraph.ko 0xffffffff82f0f9d8 - 0xffffffff82f0fa98 is .bss in /boot/kernel/netgraph.ko 0xffffffff82f0fa98 - 0xffffffff82f0fac8 is set_sysctl_set in /boot/kernel/netgraph.ko 0xffffffff82f0fac8 - 0xffffffff82f0faec is .note.gnu.build-id in /boot/kernel/netgraph.ko 0xffffffff829fc000 - 0xffffffff829fe000 is .text in /boot/kernel/acpi_wmi.ko 0xffffffff829fe000 - 0xffffffff829ff000 is .rodata in /boot/kernel/acpi_wmi.ko 0xffffffff829ff000 - 0xffffffff829ff2f8 is .data in /boot/kernel/acpi_wmi.ko 0xffffffff829ff2f8 - 0xffffffff829ff310 is set_sysinit_set in /boot/kernel/acpi_wmi.ko 0xffffffff829ff310 - 0xffffffff829ff320 is set_sysuninit_set in /boot/kernel/acpi_wmi.ko 0xffffffff829ff320 - 0xffffffff829ff350 is set_modmetadata_set in /boot/kernel/acpi_wmi.ko 0xffffffff829ff350 - 0xffffffff829ff378 is .bss in /boot/kernel/acpi_wmi.ko 0xffffffff829ff378 - 0xffffffff829ff39c is .note.gnu.build-id in /boot/kernel/acpi_wmi.ko 0xffffffff82f00000 - 0xffffffff82f02000 is .text in /boot/kernel/intpm.ko 0xffffffff82f02000 - 0xffffffff82f03000 is .rodata in /boot/kernel/intpm.ko 0xffffffff82f03000 - 0xffffffff82f031c8 is .data in /boot/kernel/intpm.ko 0xffffffff82f031c8 - 0xffffffff82f03200 is set_modmetadata_set in /boot/kernel/intpm.ko 0xffffffff82f03200 - 0xffffffff82f03210 is set_sysinit_set in /boot/kernel/intpm.ko 0xffffffff82f03210 - 0xffffffff82f03218 is .bss in /boot/kernel/intpm.ko 0xffffffff82f03218 - 0xffffffff82f0323c is .note.gnu.build-id in /boot/kernel/intpm.ko 0xffffffff82f10000 - 0xffffffff82f11000 is .text in /boot/kernel/smbus.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82f11000 - 0xffffffff82f12000 is .rodata in /boot/kernel/smbus.ko 0xffffffff82f12000 - 0xffffffff82f1216c is .data in /boot/kernel/smbus.ko 0xffffffff82f12170 - 0xffffffff82f12178 is set_modmetadata_set in /boot/kernel/smbus.ko 0xffffffff82f12178 - 0xffffffff82f12180 is .bss in /boot/kernel/smbus.ko 0xffffffff82f12180 - 0xffffffff82f121a4 is .note.gnu.build-id in /boot/kernel/smbus.ko 0xffffffff82f13000 - 0xffffffff82f15000 is .text in /boot/kernel/uhid.ko 0xffffffff82f15000 - 0xffffffff82f16000 is .rodata in /boot/kernel/uhid.ko 0xffffffff82f16000 - 0xffffffff82f162a4 is .data in /boot/kernel/uhid.ko 0xffffffff82f162a8 - 0xffffffff82f162b8 is set_sysctl_set in /boot/kernel/uhid.ko 0xffffffff82f162b8 - 0xffffffff82f162e8 is set_modmetadata_set in /boot/kernel/uhid.ko 0xffffffff82f162e8 - 0xffffffff82f162f0 is set_sysinit_set in /boot/kernel/uhid.ko 0xffffffff82f162f0 - 0xffffffff82f16300 is .bss in /boot/kernel/uhid.ko 0xffffffff82f16300 - 0xffffffff82f16340 is usb_host_id in /boot/kernel/uhid.ko 0xffffffff82f16340 - 0xffffffff82f16364 is .note.gnu.build-id in /boot/kernel/uhid.ko 0xffffffff82f17000 - 0xffffffff82f19000 is .text in /boot/kernel/usbhid.ko 0xffffffff82f19000 - 0xffffffff82f1a000 is .rodata in /boot/kernel/usbhid.ko 0xffffffff82f1a000 - 0xffffffff82f1a290 is .data in /boot/kernel/usbhid.ko 0xffffffff82f1a290 - 0xffffffff82f1a2a8 is set_sysctl_set in /boot/kernel/usbhid.ko 0xffffffff82f1a2a8 - 0xffffffff82f1a2e0 is set_modmetadata_set in /boot/kernel/usbhid.ko 0xffffffff82f1a2e0 - 0xffffffff82f1a2e8 is set_sysinit_set in /boot/kernel/usbhid.ko 0xffffffff82f1a2e8 - 0xffffffff82f1a2f8 is .bss in /boot/kernel/usbhid.ko 0xffffffff82f1a300 - 0xffffffff82f1a380 is usb_host_id in /boot/kernel/usbhid.ko 0xffffffff82f1a380 - 0xffffffff82f1a3a4 is .note.gnu.build-id in /boot/kernel/usbhid.ko 0xffffffff82f1b000 - 0xffffffff82f1d000 is .text in /boot/kernel/hidbus.ko 0xffffffff82f1d000 - 0xffffffff82f1e000 is .rodata in /boot/kernel/hidbus.ko 0xffffffff82f1e000 - 0xffffffff82f1e008 is .bss in /boot/kernel/hidbus.ko 0xffffffff82f1e008 - 0xffffffff82f1e258 is .data in /boot/kernel/hidbus.ko 0xffffffff82f1e258 - 0xffffffff82f1e298 is set_modmetadata_set in /boot/kernel/hidbus.ko 0xffffffff82f1e298 - 0xffffffff82f1e2b0 is set_sysinit_set in /boot/kernel/hidbus.ko 0xffffffff82f1e2b0 - 0xffffffff82f1e2d4 is .note.gnu.build-id in /boot/kernel/hidbus.ko 0xffffffff82f1f000 - 0xffffffff82f21000 is .text in /boot/kernel/wmt.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82f21000 - 0xffffffff82f22000 is .rodata in /boot/kernel/wmt.ko 0xffffffff82f22000 - 0xffffffff82f22290 is .data in /boot/kernel/wmt.ko 0xffffffff82f22290 - 0xffffffff82f222a8 is set_sysctl_set in /boot/kernel/wmt.ko 0xffffffff82f222a8 - 0xffffffff82f222e0 is set_modmetadata_set in /boot/kernel/wmt.ko 0xffffffff82f222e0 - 0xffffffff82f222e8 is set_sysinit_set in /boot/kernel/wmt.ko 0xffffffff82f222e8 - 0xffffffff82f222f8 is .bss in /boot/kernel/wmt.ko 0xffffffff82f22300 - 0xffffffff82f22320 is usb_host_id in /boot/kernel/wmt.ko 0xffffffff82f22320 - 0xffffffff82f22344 is .note.gnu.build-id in /boot/kernel/wmt.ko 0xffffffff82f23000 - 0xffffffff82f26000 is .text in /boot/kernel/ums.ko 0xffffffff82f26000 - 0xffffffff82f27000 is .rodata in /boot/kernel/ums.ko 0xffffffff82f27000 - 0xffffffff82f272c0 is .data in /boot/kernel/ums.ko 0xffffffff82f272c0 - 0xffffffff82f272d0 is set_sysctl_set in /boot/kernel/ums.ko 0xffffffff82f272e0 - 0xffffffff82f27300 is usb_host_id in /boot/kernel/ums.ko 0xffffffff82f27300 - 0xffffffff82f27338 is set_modmetadata_set in /boot/kernel/ums.ko 0xffffffff82f27338 - 0xffffffff82f27340 is set_sysinit_set in /boot/kernel/ums.ko 0xffffffff82f27340 - 0xffffffff82f27350 is .bss in /boot/kernel/ums.ko 0xffffffff82f27350 - 0xffffffff82f27374 is .note.gnu.build-id in /boot/kernel/ums.ko 0xffffffff82f28000 - 0xffffffff82f2c000 is .text in /boot/kernel/autofs.ko 0xffffffff82f2c000 - 0xffffffff82f2d000 is .rodata in /boot/kernel/autofs.ko 0xffffffff82f2d000 - 0xffffffff82f2da24 is .data in /boot/kernel/autofs.ko 0xffffffff82f2da28 - 0xffffffff82f2da78 is set_sysinit_set in /boot/kernel/autofs.ko 0xffffffff82f2da78 - 0xffffffff82f2da80 is set_sysuninit_set in /boot/kernel/autofs.ko 0xffffffff82f2da80 - 0xffffffff82f2dac0 is set_sysctl_set in /boot/kernel/autofs.ko 0xffffffff82f2dac0 - 0xffffffff82f2dae0 is .bss in /boot/kernel/autofs.ko 0xffffffff82f2dae0 - 0xffffffff82f2daf8 is set_modmetadata_set in /boot/kernel/autofs.ko 0xffffffff82f2daf8 - 0xffffffff82f2db1c is .note.gnu.build-id in /boot/kernel/autofs.ko 0xffffffff82f2e000 - 0xffffffff82f2f000 is .text in /boot/kernel/mac_ntpd.ko 0xffffffff82f2f000 - 0xffffffff82f30000 is .rodata in /boot/kernel/mac_ntpd.ko 0xffffffff82f30000 - 0xffffffff82f309d0 is .data in /boot/kernel/mac_ntpd.ko 0xffffffff82f309d0 - 0xffffffff82f309e8 is set_sysctl_set in /boot/kernel/mac_ntpd.ko 0xffffffff82f309e8 - 0xffffffff82f30a00 is set_modmetadata_set in /boot/kernel/mac_ntpd.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82f30a00 - 0xffffffff82f30a08 is set_sysinit_set in /boot/kernel/mac_ntpd.ko 0xffffffff82f30a08 - 0xffffffff82f30a2c is .note.gnu.build-id in /boot/kernel/mac_ntpd.ko 0xffffffff82f31000 - 0xffffffff82f32000 is .text in /boot/kernel/green_saver.ko 0xffffffff82f32000 - 0xffffffff82f33000 is .rodata in /boot/kernel/green_saver.ko 0xffffffff82f33000 - 0xffffffff82f330cc is .data in /boot/kernel/green_saver.ko 0xffffffff82f330d0 - 0xffffffff82f330e8 is set_modmetadata_set in /boot/kernel/green_saver.ko 0xffffffff82f330e8 - 0xffffffff82f330f0 is set_sysinit_set in /boot/kernel/green_saver.ko 0xffffffff82f330f0 - 0xffffffff82f33114 is .note.gnu.build-id in /boot/kernel/green_saver.ko (kgdb) disass 0xffffffff80cf0110 Dump of assembler code for function strcmp: 0xffffffff80cf0100 <+0>: push %rbp 0xffffffff80cf0101 <+1>: mov %rsp,%rbp 0xffffffff80cf0104 <+4>: xor %ecx,%ecx 0xffffffff80cf0106 <+6>: cs nopw 0x0(%rax,%rax,1) 0xffffffff80cf0110 <+16>: movzbl (%rdi,%rcx,1),%eax 0xffffffff80cf0114 <+20>: movzbl (%rsi,%rcx,1),%edx 0xffffffff80cf0118 <+24>: cmp %dl,%al 0xffffffff80cf011a <+26>: jne 0xffffffff80cf0127 <strcmp+39> 0xffffffff80cf011c <+28>: inc %rcx 0xffffffff80cf011f <+31>: test %eax,%eax 0xffffffff80cf0121 <+33>: jne 0xffffffff80cf0110 <strcmp+16> 0xffffffff80cf0123 <+35>: xor %eax,%eax 0xffffffff80cf0125 <+37>: pop %rbp 0xffffffff80cf0126 <+38>: ret 0xffffffff80cf0127 <+39>: sub %edx,%eax 0xffffffff80cf0129 <+41>: pop %rbp 0xffffffff80cf012a <+42>: ret End of assembler dump. (kgdb)
Interesting. Your context: Local exec file: `/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff80231108 is .hash 0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash 0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym 0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr 0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811843f8 is .text 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata . . . The downloaded kernel.txz expanded: Local exec file: `/usr/home/root/artifacts/13.4R/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff802310f0 is .hash 0xffffffff802310f0 - 0xffffffff8025f9c0 is .gnu.hash 0xffffffff8025f9c0 - 0xffffffff802f2450 is .dynsym 0xffffffff802f2450 - 0xffffffff8036d0c4 is .dynstr 0xffffffff8036d0c8 - 0xffffffff8038da68 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811863f8 is .text 0xffffffff81186400 - 0xffffffff817f8c20 is .rodata . . . And, your context: (kgdb) disass 0xffffffff80cf0110 Dump of assembler code for function strcmp: 0xffffffff80cf0100 <+0>: push %rbp 0xffffffff80cf0101 <+1>: mov %rsp,%rbp 0xffffffff80cf0104 <+4>: xor %ecx,%ecx 0xffffffff80cf0106 <+6>: cs nopw 0x0(%rax,%rax,1) 0xffffffff80cf0110 <+16>: movzbl (%rdi,%rcx,1),%eax 0xffffffff80cf0114 <+20>: movzbl (%rsi,%rcx,1),%edx 0xffffffff80cf0118 <+24>: cmp %dl,%al 0xffffffff80cf011a <+26>: jne 0xffffffff80cf0127 <strcmp+39> 0xffffffff80cf011c <+28>: inc %rcx 0xffffffff80cf011f <+31>: test %eax,%eax 0xffffffff80cf0121 <+33>: jne 0xffffffff80cf0110 <strcmp+16> 0xffffffff80cf0123 <+35>: xor %eax,%eax 0xffffffff80cf0125 <+37>: pop %rbp 0xffffffff80cf0126 <+38>: ret 0xffffffff80cf0127 <+39>: sub %edx,%eax 0xffffffff80cf0129 <+41>: pop %rbp 0xffffffff80cf012a <+42>: ret End of assembler dump. My 13.4-RELEASE kernel.txz expansion: (kgdb) disass strcmp Dump of assembler code for function strcmp: 0xffffffff80cf2290 <+0>: push %rbp 0xffffffff80cf2291 <+1>: mov %rsp,%rbp 0xffffffff80cf2294 <+4>: xor %ecx,%ecx 0xffffffff80cf2296 <+6>: cs nopw 0x0(%rax,%rax,1) 0xffffffff80cf22a0 <+16>: movzbl (%rdi,%rcx,1),%eax 0xffffffff80cf22a4 <+20>: movzbl (%rsi,%rcx,1),%edx 0xffffffff80cf22a8 <+24>: cmp %dl,%al 0xffffffff80cf22aa <+26>: jne 0xffffffff80cf22b7 <strcmp+39> 0xffffffff80cf22ac <+28>: inc %rcx 0xffffffff80cf22af <+31>: test %eax,%eax 0xffffffff80cf22b1 <+33>: jne 0xffffffff80cf22a0 <strcmp+16> 0xffffffff80cf22b3 <+35>: xor %eax,%eax 0xffffffff80cf22b5 <+37>: pop %rbp 0xffffffff80cf22b6 <+38>: ret 0xffffffff80cf22b7 <+39>: sub %edx,%eax 0xffffffff80cf22b9 <+41>: pop %rbp 0xffffffff80cf22ba <+42>: ret End of assembler dump. Same code, different address range. It does not look like I can investigate backtraces via just the kernel*.txz contents. That invalidates a lot of my older notes that involve such. strcmp is in your context is at least believable, presuming the address interpretation of the blur is accurate: the nice start of an instruction could well produce the general protection fault: 0xffffffff80cf0110 <+16>: movzbl (%rdi,%rcx,1),%eax Most bad values would not point to an intended instruction start of an instruction that could generate the initially reported failure. For reference, for the kernel.txz expansion: # strings boot/kernel/kernel | grep "\-RELEASE " @(#)FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC
I was running 13.1-RELEASE when this bug was filed; 13.2-RELEASE later on (both those kernels no longer exist); 13.3 up until last Saturday; and now 13.4, whence the most recent output in the last couple of comments ...
(In reply to George Mitchell from comment #215) My means of picking up kernel files to looks at was not matching the patch level involved, just the 13.* status each time. I was making a bad assumption by doing so, only noticed now. So far as I know, there are no pre-made official distributions of the patched variants of the kernel files for RELEASE. I do not know if PkgBase for 14.* release builds would be sufficient for matching for a 14.*-RELEASE-p* well for the purpose or not. 13.* has no PkgBase distributions to try. https://pkg.freebsd.org/FreeBSD:14:amd64/base_release_*/
comment #5 and comment #82 both identify strcmp for the failure context and both identify it is a strcmp during modlist_lookup that got the failures in those examples. This is part of the linker_load_module activity, something the back trace in your recent example also indicates as going on. comment #5 was for a context attempting to find "zfs". comment #82 was for a cotnext attempting to find "acpi_wmi". (That aspect varies across the failures.) The comment #82 notes are likely the closest to being failure details as far as I can tell. Only the kgdb backtrace seems to be all that useful and is what comment #5 and comment #82 were based on, apparently with correct contexts for matching the live system kernel of the times in question.
(In reply to Mark Millard from comment #217) I'll note that linker_load_dependencies that shows up in the modern example kernel (non-kgdb) backtrace also calls modlist_lookup . It also uses strcmp directly. And it also calls modlist_lookup2 that in turn calls modlist_lookup . It can also recurse back out to linker_load_module. It appears that getting a dump during the initial general protection fault and getting it savecore'd and crashinfo'd so that we can see a kgdb backtrace is what would be the primary next-useful-thing. I also hope that: static modlisthead_t found_modules; can be examined to see if the modlist has any bad name pointers that strcmp ends up trying to use.
For a successful boot, could you try: # kgdb . . . (kgdb) disass/s linker_load_dependencies . . . ( a range around offset +0x274 ) . . . If you have /usr/src/ in place as a copy of the source for 13.4-RELELASE, the "/s" should lead to also showing related source code, but tracking code generation order, not source code order. As things may be inlined, you may see the strcmp and such from called routines, not just source from linker_load_dependencies itself. This might give an idea of which phase linker_load_dependencies was in when it ended up leading to the failure during strcmp .
root@court:/home/george # kgdb GNU gdb (GDB) 15.1 [GDB v15.1 for FreeBSD] Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.4". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Reading symbols from /boot/kernel/sem.ko... Reading symbols from /usr/lib/debug//boot/kernel/sem.ko.debug... Reading symbols from /boot/modules/if_re.ko... (No debugging symbols found in /boot/modules/if_re.ko) Reading symbols from /boot/kernel/fusefs.ko... Reading symbols from /usr/lib/debug//boot/kernel/fusefs.ko.debug... Reading symbols from /boot/modules/amdgpu.ko... (No debugging symbols found in /boot/modules/amdgpu.ko) Reading symbols from /boot/modules/drm.ko... (No debugging symbols found in /boot/modules/drm.ko) Reading symbols from /boot/kernel/iic.ko... Reading symbols from /usr/lib/debug//boot/kernel/iic.ko.debug... Reading symbols from /boot/modules/linuxkpi_gplv2.ko... (No debugging symbols found in /boot/modules/linuxkpi_gplv2.ko) --Type <RET> for more, q to quit, c to continue without paging--c Reading symbols from /boot/modules/dmabuf.ko... (No debugging symbols found in /boot/modules/dmabuf.ko) Reading symbols from /boot/modules/ttm.ko... (No debugging symbols found in /boot/modules/ttm.ko) Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_gpu_info_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_sdma_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_asd_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_ta_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_pfp_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_me_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_ce_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_rlc_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_mec_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_mec2_bin.ko) Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko... (No debugging symbols found in /boot/modules/amdgpu_raven_vcn_bin.ko) Reading symbols from /boot/kernel/zfs.ko... Reading symbols from /usr/lib/debug//boot/kernel/zfs.ko.debug... Reading symbols from /boot/kernel/netgraph.ko... Reading symbols from /usr/lib/debug//boot/kernel/netgraph.ko.debug... Reading symbols from /boot/kernel/acpi_wmi.ko... Reading symbols from /usr/lib/debug//boot/kernel/acpi_wmi.ko.debug... Reading symbols from /boot/kernel/intpm.ko... Reading symbols from /usr/lib/debug//boot/kernel/intpm.ko.debug... Reading symbols from /boot/kernel/smbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/smbus.ko.debug... Reading symbols from /boot/kernel/uhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/uhid.ko.debug... Reading symbols from /boot/kernel/usbhid.ko... Reading symbols from /usr/lib/debug//boot/kernel/usbhid.ko.debug... Reading symbols from /boot/kernel/hidbus.ko... Reading symbols from /usr/lib/debug//boot/kernel/hidbus.ko.debug... Reading symbols from /boot/kernel/wmt.ko... Reading symbols from /usr/lib/debug//boot/kernel/wmt.ko.debug... Reading symbols from /boot/kernel/ums.ko... Reading symbols from /usr/lib/debug//boot/kernel/ums.ko.debug... Reading symbols from /boot/kernel/autofs.ko... Reading symbols from /usr/lib/debug//boot/kernel/autofs.ko.debug... Reading symbols from /boot/kernel/mac_ntpd.ko... Reading symbols from /usr/lib/debug//boot/kernel/mac_ntpd.ko.debug... Reading symbols from /boot/kernel/green_saver.ko... Reading symbols from /usr/lib/debug//boot/kernel/green_saver.ko.debug... sched_switch (td=td@entry=0xffffffff82043780 <thread0_st>, flags=flags@entry=260) at /usr/src/sys/kern/sched_4bsd.c:1085 1085 SDT_PROBE0(sched, , , on__cpu); (kgdb) disass/s linker_load_dependencies Dump of assembler code for function linker_load_dependencies: /usr/src/sys/kern/kern_linker.c: 2213 { 0xffffffff80bc0840 <+0>: push %rbp 0xffffffff80bc0841 <+1>: mov %rsp,%rbp 0xffffffff80bc0844 <+4>: push %r15 0xffffffff80bc0846 <+6>: push %r14 0xffffffff80bc0848 <+8>: push %r13 0xffffffff80bc084a <+10>: push %r12 0xffffffff80bc084c <+12>: push %rbx 0xffffffff80bc084d <+13>: sub $0x18,%rsp 0xffffffff80bc0851 <+17>: mov %rdi,%r14 2214 linker_file_t lfdep; 2215 struct mod_metadata **start, **stop, **mdp, **nmdp; 2216 struct mod_metadata *mp, *nmp; 2217 const struct mod_depend *verinfo; 2218 modlist_t mod; 2219 const char *modname, *nmodname; 2220 int ver, error = 0; 2221 2222 /* 2223 * All files are dependent on /kernel. 2224 */ 2225 sx_assert(&kld_sx, SA_XLOCKED); 2226 if (linker_kernel_file) { 0xffffffff80bc0854 <+20>: mov 0x1484f6d(%rip),%rbx # 0xffffffff820457c8 <linker_kernel_file> 0xffffffff80bc085b <+27>: test %rbx,%rbx 0xffffffff80bc085e <+30>: je 0xffffffff80bc0895 <linker_load_dependencies+85> 2227 linker_kernel_file->refs++; --Type <RET> for more, q to quit, c to continue without paging--c 0xffffffff80bc0860 <+32>: incl 0x8(%rbx) 773 file->deps = realloc(file->deps, (file->ndeps + 1) * sizeof(*newdeps), 0xffffffff80bc0863 <+35>: mov 0x78(%r14),%rdi 0xffffffff80bc0867 <+39>: mov 0x70(%r14),%eax 0xffffffff80bc086b <+43>: inc %eax 0xffffffff80bc086d <+45>: movslq %eax,%rsi 0xffffffff80bc0870 <+48>: shl $0x3,%rsi 0xffffffff80bc0874 <+52>: mov $0xffffffff81cc6be0,%rdx 0xffffffff80bc087b <+59>: mov $0x102,%ecx 0xffffffff80bc0880 <+64>: call 0xffffffff80bc84f0 <realloc> 0xffffffff80bc0885 <+69>: mov %rax,0x78(%r14) 774 M_LINKER, M_WAITOK | M_ZERO); 775 file->deps[file->ndeps] = dep; 0xffffffff80bc0889 <+73>: movslq 0x70(%r14),%rcx 0xffffffff80bc088d <+77>: mov %rbx,(%rax,%rcx,8) 776 file->ndeps++; 0xffffffff80bc0891 <+81>: incl 0x70(%r14) ./linker_if.h: 142 KOBJOPLOOKUP(((kobj_t)file)->ops,linker_lookup_set); 0xffffffff80bc0895 <+85>: mov (%r14),%rcx 0xffffffff80bc0898 <+88>: movzbl 0x1112839(%rip),%edx # 0xffffffff81cd30d8 <linker_lookup_set_desc> 0xffffffff80bc089f <+95>: mov (%rcx,%rdx,8),%rax 0xffffffff80bc08a3 <+99>: cmpq $0xffffffff81cd30d8,(%rax) 0xffffffff80bc08aa <+106>: je 0xffffffff80bc08c3 <linker_load_dependencies+131> 0xffffffff80bc08ac <+108>: lea (%rcx,%rdx,8),%rsi 0xffffffff80bc08b0 <+112>: mov 0x800(%rcx),%rdi 0xffffffff80bc08b7 <+119>: mov $0xffffffff81cd30d8,%rdx 0xffffffff80bc08be <+126>: call 0xffffffff80c3ca30 <kobj_lookup_method> 0xffffffff80bc08c3 <+131>: xor %ebx,%ebx 0xffffffff80bc08c5 <+133>: lea -0x38(%rbp),%rdx 0xffffffff80bc08c9 <+137>: lea -0x30(%rbp),%rcx 143 rc = ((linker_lookup_set_t *) _m)(file, name, start, stop, count); 0xffffffff80bc08cd <+141>: mov %r14,%rdi 0xffffffff80bc08d0 <+144>: mov $0xffffffff8122e707,%rsi 0xffffffff80bc08d7 <+151>: xor %r8d,%r8d 0xffffffff80bc08da <+154>: call *0x8(%rax) /usr/src/sys/kern/kern_linker.c: 2231 NULL) != 0) 0xffffffff80bc08dd <+157>: test %eax,%eax 2230 if (linker_file_lookup_set(lf, MDT_SETNAME, &start, &stop, 0xffffffff80bc08df <+159>: jne 0xffffffff80bc0af7 <linker_load_dependencies+695> 2232 return (0); 2233 for (mdp = start; mdp < stop; mdp++) { 0xffffffff80bc08e5 <+165>: mov -0x38(%rbp),%r15 0xffffffff80bc08e9 <+169>: mov -0x30(%rbp),%rdx 0xffffffff80bc08ed <+173>: cmp %rdx,%r15 0xffffffff80bc08f0 <+176>: jb 0xffffffff80bc0a76 <linker_load_dependencies+566> 2244 return (EEXIST); 2245 } 2246 } 2247 2248 for (mdp = start; mdp < stop; mdp++) { 0xffffffff80bc08f6 <+182>: cmp %rdx,%r15 0xffffffff80bc08f9 <+185>: jae 0xffffffff80bc0aea <linker_load_dependencies+682> 0xffffffff80bc08ff <+191>: mov %r14,-0x40(%rbp) 0xffffffff80bc0903 <+195>: jmp 0xffffffff80bc0941 <linker_load_dependencies+257> 2269 linker_file_add_dependency(lf, lfdep); 2270 continue; 2271 } 2272 error = linker_load_module(NULL, modname, lf, verinfo, NULL); 0xffffffff80bc0905 <+197>: xor %edi,%edi 0xffffffff80bc0907 <+199>: mov %r12,%rsi 0xffffffff80bc090a <+202>: mov -0x40(%rbp),%r14 0xffffffff80bc090e <+206>: mov %r14,%rdx 0xffffffff80bc0911 <+209>: mov %r13,%rcx 0xffffffff80bc0914 <+212>: xor %r8d,%r8d 0xffffffff80bc0917 <+215>: call 0xffffffff80bbd3f0 <linker_load_module> 2273 if (error) { 0xffffffff80bc091c <+220>: test %eax,%eax 0xffffffff80bc091e <+222>: jne 0xffffffff80bc0b17 <linker_load_dependencies+727> 0xffffffff80bc0924 <+228>: data16 data16 cs nopw 0x0(%rax,%rax,1) 2248 for (mdp = start; mdp < stop; mdp++) { 0xffffffff80bc0930 <+240>: add $0x8,%r15 0xffffffff80bc0934 <+244>: mov -0x30(%rbp),%rdx 0xffffffff80bc0938 <+248>: cmp %rdx,%r15 0xffffffff80bc093b <+251>: jae 0xffffffff80bc0ae6 <linker_load_dependencies+678> 2249 mp = *mdp; 0xffffffff80bc0941 <+257>: mov (%r15),%rax 2250 if (mp->md_type != MDT_DEPEND) 0xffffffff80bc0944 <+260>: cmpl $0x1,0x4(%rax) 0xffffffff80bc0948 <+264>: jne 0xffffffff80bc0930 <linker_load_dependencies+240> 2253 verinfo = mp->md_data; 0xffffffff80bc094a <+266>: mov 0x8(%rax),%r13 2252 modname = mp->md_cval; 0xffffffff80bc094e <+270>: mov 0x10(%rax),%r12 2254 nmodname = NULL; 2255 for (nmdp = start; nmdp < stop; nmdp++) { 0xffffffff80bc0952 <+274>: mov -0x38(%rbp),%rbx 0xffffffff80bc0956 <+278>: jmp 0xffffffff80bc0964 <linker_load_dependencies+292> 0xffffffff80bc0958 <+280>: nopl 0x0(%rax,%rax,1) 0xffffffff80bc0960 <+288>: add $0x8,%rbx 0xffffffff80bc0964 <+292>: cmp %rdx,%rbx 0xffffffff80bc0967 <+295>: jae 0xffffffff80bc0990 <linker_load_dependencies+336> 2256 nmp = *nmdp; 0xffffffff80bc0969 <+297>: mov (%rbx),%rax 2257 if (nmp->md_type != MDT_VERSION) 0xffffffff80bc096c <+300>: cmpl $0x3,0x4(%rax) 0xffffffff80bc0970 <+304>: jne 0xffffffff80bc0960 <linker_load_dependencies+288> 2258 continue; 2259 nmodname = nmp->md_cval; 0xffffffff80bc0972 <+306>: mov 0x10(%rax),%rsi 2260 if (strcmp(modname, nmodname) == 0) 0xffffffff80bc0976 <+310>: mov %r12,%rdi 0xffffffff80bc0979 <+313>: call 0xffffffff80cf0100 <strcmp> 2261 break; 2262 } 2263 if (nmdp < stop)/* early exit, it's a self reference */ 0xffffffff80bc097e <+318>: mov -0x30(%rbp),%rdx 2260 if (strcmp(modname, nmodname) == 0) 0xffffffff80bc0982 <+322>: test %eax,%eax 0xffffffff80bc0984 <+324>: jne 0xffffffff80bc0960 <linker_load_dependencies+288> 0xffffffff80bc0986 <+326>: cs nopw 0x0(%rax,%rax,1) 2261 break; 2262 } 2263 if (nmdp < stop)/* early exit, it's a self reference */ 0xffffffff80bc0990 <+336>: cmp %rdx,%rbx 0xffffffff80bc0993 <+339>: jb 0xffffffff80bc0930 <linker_load_dependencies+240> 0xffffffff80bc0995 <+341>: mov 0x1484e0c(%rip),%r14 # 0xffffffff820457a8 <found_modules> 1501 if (verinfo == NULL) 0xffffffff80bc099c <+348>: test %r13,%r13 0xffffffff80bc099f <+351>: je 0xffffffff80bc09b3 <linker_load_dependencies+371> 0xffffffff80bc09a1 <+353>: test %r14,%r14 1502 return (modlist_lookup(name, 0)); 1503 bestmod = NULL; 1504 TAILQ_FOREACH(mod, &found_modules, link) { 0xffffffff80bc09a4 <+356>: je 0xffffffff80bc0905 <linker_load_dependencies+197> 0xffffffff80bc09aa <+362>: xor %ebx,%ebx 0xffffffff80bc09ac <+364>: jmp 0xffffffff80bc09e8 <linker_load_dependencies+424> 0xffffffff80bc09ae <+366>: xchg %ax,%ax 1487 TAILQ_FOREACH(mod, &found_modules, link) { 0xffffffff80bc09b0 <+368>: mov (%r14),%r14 0xffffffff80bc09b3 <+371>: test %r14,%r14 0xffffffff80bc09b6 <+374>: je 0xffffffff80bc0905 <linker_load_dependencies+197> 1488 if (strcmp(mod->name, name) == 0 && 0xffffffff80bc09bc <+380>: mov 0x18(%r14),%rdi 0xffffffff80bc09c0 <+384>: mov %r12,%rsi 0xffffffff80bc09c3 <+387>: call 0xffffffff80cf0100 <strcmp> 0xffffffff80bc09c8 <+392>: test %eax,%eax 0xffffffff80bc09ca <+394>: jne 0xffffffff80bc09b0 <linker_load_dependencies+368> 0xffffffff80bc09cc <+396>: mov %r14,%rbx 0xffffffff80bc09cf <+399>: jmp 0xffffffff80bc0a23 <linker_load_dependencies+483> 0xffffffff80bc09d1 <+401>: mov %r14,%rbx 0xffffffff80bc09d4 <+404>: data16 data16 cs nopw 0x0(%rax,%rax,1) 1502 return (modlist_lookup(name, 0)); 1503 bestmod = NULL; 1504 TAILQ_FOREACH(mod, &found_modules, link) { 0xffffffff80bc09e0 <+416>: mov (%r14),%r14 0xffffffff80bc09e3 <+419>: test %r14,%r14 0xffffffff80bc09e6 <+422>: je 0xffffffff80bc0a1a <linker_load_dependencies+474> 1505 if (strcmp(mod->name, name) != 0) 0xffffffff80bc09e8 <+424>: mov 0x18(%r14),%rdi 0xffffffff80bc09ec <+428>: mov %r12,%rsi 0xffffffff80bc09ef <+431>: call 0xffffffff80cf0100 <strcmp> 0xffffffff80bc09f4 <+436>: test %eax,%eax 0xffffffff80bc09f6 <+438>: jne 0xffffffff80bc09e0 <linker_load_dependencies+416> 1506 continue; 1507 ver = mod->version; 0xffffffff80bc09f8 <+440>: mov 0x20(%r14),%eax 1508 if (ver == verinfo->md_ver_preferred) 0xffffffff80bc09fc <+444>: cmp 0x4(%r13),%eax 0xffffffff80bc0a00 <+448>: je 0xffffffff80bc09cc <linker_load_dependencies+396> 1509 return (mod); 1510 if (ver >= verinfo->md_ver_minimum && 0xffffffff80bc0a02 <+450>: cmp 0x0(%r13),%eax 0xffffffff80bc0a06 <+454>: jl 0xffffffff80bc09e0 <linker_load_dependencies+416> 1511 ver <= verinfo->md_ver_maximum && 0xffffffff80bc0a08 <+456>: cmp 0x8(%r13),%eax 0xffffffff80bc0a0c <+460>: jg 0xffffffff80bc09e0 <linker_load_dependencies+416> 1512 (bestmod == NULL || ver > bestmod->version)) 0xffffffff80bc0a0e <+462>: test %rbx,%rbx 0xffffffff80bc0a11 <+465>: je 0xffffffff80bc09d1 <linker_load_dependencies+401> 0xffffffff80bc0a13 <+467>: cmp 0x20(%rbx),%eax 1510 if (ver >= verinfo->md_ver_minimum && 0xffffffff80bc0a16 <+470>: jg 0xffffffff80bc09d1 <linker_load_dependencies+401> 0xffffffff80bc0a18 <+472>: jmp 0xffffffff80bc09e0 <linker_load_dependencies+416> 2264 continue; 2265 mod = modlist_lookup2(modname, verinfo); 2266 if (mod) { /* woohoo, it's loaded already */ 0xffffffff80bc0a1a <+474>: test %rbx,%rbx 0xffffffff80bc0a1d <+477>: je 0xffffffff80bc0905 <linker_load_dependencies+197> 2267 lfdep = mod->container; 0xffffffff80bc0a23 <+483>: mov 0x10(%rbx),%rbx 2268 lfdep->refs++; 0xffffffff80bc0a27 <+487>: incl 0x8(%rbx) 0xffffffff80bc0a2a <+490>: mov -0x40(%rbp),%r14 773 file->deps = realloc(file->deps, (file->ndeps + 1) * sizeof(*newdeps), 0xffffffff80bc0a2e <+494>: mov 0x78(%r14),%rdi 0xffffffff80bc0a32 <+498>: mov 0x70(%r14),%eax 0xffffffff80bc0a36 <+502>: inc %eax 0xffffffff80bc0a38 <+504>: movslq %eax,%rsi 0xffffffff80bc0a3b <+507>: shl $0x3,%rsi 0xffffffff80bc0a3f <+511>: mov $0xffffffff81cc6be0,%rdx 0xffffffff80bc0a46 <+518>: mov $0x102,%ecx 0xffffffff80bc0a4b <+523>: call 0xffffffff80bc84f0 <realloc> 0xffffffff80bc0a50 <+528>: mov %rax,0x78(%r14) 774 M_LINKER, M_WAITOK | M_ZERO); 775 file->deps[file->ndeps] = dep; 0xffffffff80bc0a54 <+532>: movslq 0x70(%r14),%rcx 0xffffffff80bc0a58 <+536>: mov %rbx,(%rax,%rcx,8) 776 file->ndeps++; 0xffffffff80bc0a5c <+540>: incl 0x70(%r14) 0xffffffff80bc0a60 <+544>: jmp 0xffffffff80bc0930 <linker_load_dependencies+240> 2232 return (0); 2233 for (mdp = start; mdp < stop; mdp++) { 0xffffffff80bc0a65 <+549>: mov -0x30(%rbp),%rdx 0xffffffff80bc0a69 <+553>: add $0x8,%r15 0xffffffff80bc0a6d <+557>: cmp %rdx,%r15 0xffffffff80bc0a70 <+560>: jae 0xffffffff80bc0b08 <linker_load_dependencies+712> 2234 mp = *mdp; 0xffffffff80bc0a76 <+566>: mov (%r15),%rax 2235 if (mp->md_type != MDT_VERSION) 0xffffffff80bc0a79 <+569>: cmpl $0x3,0x4(%rax) 0xffffffff80bc0a7d <+573>: jne 0xffffffff80bc0a69 <linker_load_dependencies+553> 1487 TAILQ_FOREACH(mod, &found_modules, link) { 0xffffffff80bc0a7f <+575>: mov 0x1484d22(%rip),%rbx # 0xffffffff820457a8 <found_modules> 0xffffffff80bc0a86 <+582>: test %rbx,%rbx 0xffffffff80bc0a89 <+585>: je 0xffffffff80bc0a69 <linker_load_dependencies+553> 0xffffffff80bc0a8b <+587>: mov 0x8(%rax),%rcx 0xffffffff80bc0a8f <+591>: mov 0x10(%rax),%r12 0xffffffff80bc0a93 <+595>: mov (%rcx),%r13d 0xffffffff80bc0a96 <+598>: jmp 0xffffffff80bc0aa8 <linker_load_dependencies+616> 0xffffffff80bc0a98 <+600>: nopl 0x0(%rax,%rax,1) 0xffffffff80bc0aa0 <+608>: mov (%rbx),%rbx 0xffffffff80bc0aa3 <+611>: test %rbx,%rbx 0xffffffff80bc0aa6 <+614>: je 0xffffffff80bc0a65 <linker_load_dependencies+549> 1488 if (strcmp(mod->name, name) == 0 && 0xffffffff80bc0aa8 <+616>: mov 0x18(%rbx),%rdi 0xffffffff80bc0aac <+620>: mov %r12,%rsi 0xffffffff80bc0aaf <+623>: call 0xffffffff80cf0100 <strcmp> 0xffffffff80bc0ab4 <+628>: test %eax,%eax 0xffffffff80bc0ab6 <+630>: jne 0xffffffff80bc0aa0 <linker_load_dependencies+608> 0xffffffff80bc0ab8 <+632>: test %r13d,%r13d 1489 (ver == 0 || mod->version == ver)) 0xffffffff80bc0abb <+635>: je 0xffffffff80bc0ac3 <linker_load_dependencies+643> 0xffffffff80bc0abd <+637>: cmp %r13d,0x20(%rbx) 1488 if (strcmp(mod->name, name) == 0 && 0xffffffff80bc0ac1 <+641>: jne 0xffffffff80bc0aa0 <linker_load_dependencies+608> 2242 " '%s'!\n", modname, ver, 2243 mod->container->filename); 0xffffffff80bc0ac3 <+643>: mov 0x10(%rbx),%rax 0xffffffff80bc0ac7 <+647>: mov 0x28(%rax),%rcx 2241 printf("interface %s.%d already present in the KLD" 0xffffffff80bc0acb <+651>: mov $0xffffffff811fbd3e,%rdi 0xffffffff80bc0ad2 <+658>: mov %r12,%rsi 0xffffffff80bc0ad5 <+661>: mov %r13d,%edx 0xffffffff80bc0ad8 <+664>: xor %eax,%eax 0xffffffff80bc0ada <+666>: call 0xffffffff80c42bc0 <printf> 0xffffffff80bc0adf <+671>: mov $0x11,%ebx 0xffffffff80bc0ae4 <+676>: jmp 0xffffffff80bc0af7 <linker_load_dependencies+695> 2276 break; 2277 } 2278 } 2279 2280 if (error) 2281 return (error); 2282 linker_addmodules(lf, start, stop, 0); 0xffffffff80bc0ae6 <+678>: mov -0x38(%rbp),%r15 0xffffffff80bc0aea <+682>: mov %r14,%rdi 0xffffffff80bc0aed <+685>: mov %r15,%rsi 0xffffffff80bc0af0 <+688>: call 0xffffffff80bc0b80 <linker_addmodules> 0xffffffff80bc0af5 <+693>: xor %ebx,%ebx 2283 return (error); 2284 } 0xffffffff80bc0af7 <+695>: mov %ebx,%eax 0xffffffff80bc0af9 <+697>: add $0x18,%rsp 0xffffffff80bc0afd <+701>: pop %rbx 0xffffffff80bc0afe <+702>: pop %r12 0xffffffff80bc0b00 <+704>: pop %r13 0xffffffff80bc0b02 <+706>: pop %r14 0xffffffff80bc0b04 <+708>: pop %r15 0xffffffff80bc0b06 <+710>: pop %rbp 0xffffffff80bc0b07 <+711>: ret 2248 for (mdp = start; mdp < stop; mdp++) { 0xffffffff80bc0b08 <+712>: mov -0x38(%rbp),%r15 0xffffffff80bc0b0c <+716>: cmp %rdx,%r15 0xffffffff80bc0b0f <+719>: jb 0xffffffff80bc08ff <linker_load_dependencies+191> 0xffffffff80bc0b15 <+725>: jmp 0xffffffff80bc0aea <linker_load_dependencies+682> 2275 " version mismatch\n", lf->filename, modname); 0xffffffff80bc0b17 <+727>: mov 0x28(%r14),%rsi 2274 printf("KLD %s: depends on %s - not available or" 0xffffffff80bc0b1b <+731>: mov $0xffffffff81223a3c,%rdi 0xffffffff80bc0b22 <+738>: mov %r12,%rdx 0xffffffff80bc0b25 <+741>: mov %eax,%ebx 0xffffffff80bc0b27 <+743>: xor %eax,%eax 0xffffffff80bc0b29 <+745>: call 0xffffffff80c42bc0 <printf> 0xffffffff80bc0b2e <+750>: jmp 0xffffffff80bc0af7 <linker_load_dependencies+695> End of assembler dump. (kgdb)
(In reply to George Mitchell from comment #220) +0x274 == +628 (at 0xffffffff80bc0ab4) 1488 if (strcmp(mod->name, name) == 0 && 0xffffffff80bc0aa8 <+616>: mov 0x18(%rbx),%rdi 0xffffffff80bc0aac <+620>: mov %r12,%rsi 0xffffffff80bc0aaf <+623>: call 0xffffffff80cf0100 <strcmp> 0xffffffff80bc0ab4 <+628>: test %eax,%eax 0xffffffff80bc0ab6 <+630>: jne 0xffffffff80bc0aa0 <linker_load_dependencies+608> I expect that kgdb would expose strcmp in the backtrace, as it does a better job generally. (The original Fatal Trap notice reported the address in strcmp as well.) This suggests problems for the value in mp->md_cval (via modname) as of the strcmp call (based on where the Fatal Trap was reported to be at in strcmp). Note the mod->name argument to strcmp . Note that if mod-> could not be dereferenced, the failure would be before strcmp was called. Thus it looks to be the ->name value that strcmp ends up using that lead to the failure. Same sort of thing as for comment #82 (but a separate in-line). in strcmp it showed: 0xffffffff80cf0110 <+16>: movzbl (%rdi,%rcx,1),%eax So if you ever get a chance to have it report the %rdi value on an actual failure, the value may prove interesting. Of course, none of this gets far enough to suggest why the value of mod->name would be messed up.
(In reply to Mark Millard from comment #221) References to "mp->md_cval (via modname)" should have been to mod->name instead.
Created attachment 255825 [details] Latest crash dump/text Setting dumpdev="/dev/ada0p3" was enough to get an actual dump in single-user mode, and it's attached. I marked all previous attachments obsolete, but of course they are still ther if someone wants to look at them. My core file is too big to attach, even compressed, but I can make it available if you think it will actually be more helpful than the core.txt.8 file that I've attached.
(In reply to George Mitchell from comment #223) Cool. System core files tend to have information one does not want to publish. You might want any transfers to be in a more secure person-to-person form instead of being public, possibly via encryption. But it gets messier overall: one would need the kernel /boot/kernel/* files and the /usr/lib/debug/boot/kernel/*.debug files if one is not also running a matching 13.4-RELEASE-p? build someplace. (kgdb uses the information in these files as well.) There are no simple, reference copies of those files to download for use in analyzing a system crash file a far as I know --given it is a patched update that is in use. For now, for me, I'll probably just ask for you to use kgdb, the kernel file, the core file, and the implicit *.debug and other files to report some things if I come up with questions.
(In reply to George Mitchell from comment #223) If you happen to get other examples that list alternatives to zfsctrl and zfs.ko in the likes of: #8 0xffffffff80bc0ab4 in modlist_lookup (name=0xffffffff83255959 "zfsctrl", ver=1) at /usr/src/sys/kern/kern_linker.c:1488 or in the likes of: #14 0xffffffff80bbfa04 in kern_kldload (td=td@entry=0xfffffe0080377e00, file=file@entry=0xfffff800045da000 "zfs.ko", fileid=fileid@entry=0xfffffe0075f8ede4) at /usr/src/sys/kern/kern_linker.c:1149 lf = 0x258800000000 error = 0 modname = 0xffffffff83255959 "zfsctrl" kldname = 0xfffff800045da000 "zfs.ko" It likely would be good to capture them for future reference as well.
Most of my effort has been toward discovering how to make this bug bite me less often. But as you might guess from the summary of the bug, there have been plenty of instances where "zfs" was replaced with "acpi_wmi" or "vboxnetflt" with otherwise similar contexts.
(In reply to George Mitchell from comment #226) "zfsctrl" is new compared to past examples as far as I know. Even for "zfs" and "acpi_wmi" I think it would be good to have an example from the 13.4-RELEASE-p? in use. I'll note that "zfsctrl" vs. "zfs" are not equivalent: different code path and call-chain, or so it appears. I can identify where "zfsctrl" is from --but cannot for the historical "zfs". Other new strings and/or *.ko files would be good too. In part what I'm looking for is the earliest example to occur. In other respects, the variability indicates a race or something is invovled in whatever the original corruption is: not always failing in the same place. Note that in the below sequence, acpi_wmi.ko would happen later. Does it ever crash before zfs.ko ends up hitting a corruption? (I expect that the below list is in the order of the *.ko loads but am not sure.) Never-fails up too the first sometimes-fails puts some sort of bounds on things. Reading symbols from /boot/kernel/fusefs.ko... Reading symbols from /boot/kernel/sem.ko... Reading symbols from /boot/modules/if_re.ko... Reading symbols from /boot/modules/amdgpu.ko... Reading symbols from /boot/modules/drm.ko... Reading symbols from /boot/kernel/iic.ko... Reading symbols from /boot/modules/linuxkpi_gplv2.ko... Reading symbols from /boot/modules/dmabuf.ko... Reading symbols from /boot/modules/ttm.ko... Reading symbols from /boot/modules/amdgpu_raven_gpu_info_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_sdma_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_asd_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_ta_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_pfp_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_me_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_ce_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_rlc_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_mec_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_mec2_bin.ko... Reading symbols from /boot/modules/amdgpu_raven_vcn_bin.ko... Reading symbols from /boot/kernel/zfs.ko...
In kgdb based on the kernel and system crash core file (and related), it be possible to do a sequence like: (kgdb) print *found_modules->tqh_first $51 = {link = {tqe_next = 0xfffff8010175d340, tqe_prev = 0xffffffff81b8e218 <found_modules>}, container = 0xfffff80101918c00, name = 0xffffffff81113803 "cam", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next $52 = {link = {tqe_next = 0xfffff8010175d300, tqe_prev = 0xfffff8010175d380}, container = 0xfffff80101918c00, name = 0xffffffff811e1b57 "xz", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next $53 = {link = {tqe_next = 0xfffff8010175d2c0, tqe_prev = 0xfffff8010175d340}, container = 0xfffff80101918c00, name = 0xffffffff8123ecdc "acpi", version = 1} . . . until the problematical name field is shown (bad pointer or non-terminated string. This should allow reporting what the last good name is and what the failing example looks like.
I used freebsd-update fetch and-then freebsd-update install to get a 13.4-RELEASE-p2 (so: a 13./4-RELEASE-p1 kernel) based on a 13.4-RELEASE install. However, the result lead to kgdb reporting: warning: the debug information found in "/usr/lib/debug//boot/kernel/kernel.debug" does not match "/boot/kernel/kernel" (CRC mismatch). That was because freebsd-update did not update: usr/lib/debug/boot/kernel/kernel.debug but did update boot/kernel/kernel . As stands, it does not look like I can reproduce even that much of your environment. Looks like I'll continue to be limited to your reporting results of your experiments done in your environment. It does not look like my having a crash-core file would do much good. # strings 13.4R*/boot/kernel/kernel | grep 13.4-RELEASE @(#)FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC FreeBSD 13.4-RELEASE releng/13.4-n258257-58066db597be GENERIC 13.4-RELEASE @(#)FreeBSD 13.4-RELEASE-p1 GENERIC FreeBSD 13.4-RELEASE-p1 GENERIC 13.4-RELEASE-p1
On top of which I use SCHED_4BSD. However, I'm happy to give you access to my /usr/lib/debug/boot/kernel/kernel.debug and /boot/kernel/kernel.
Mark points out I should specifically post this, since I am not using a stock distribution of the kernel: diff -u sys/amd64/conf/{GENERIC,M5P} --- sys/amd64/conf/GENERIC 2024-07-03 16:23:56.252550000 -0400 +++ sys/amd64/conf/M5P 2024-07-03 16:25:05.287604000 -0400 @@ -18,12 +18,13 @@ # cpu HAMMER -ident GENERIC +ident M5P makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols makeoptions WITH_CTF=1 # Run ctfconvert(1) for DTrace support -options SCHED_ULE # ULE scheduler +#options SCHED_ULE # ULE scheduler +options SCHED_4BSD # 4BSD scheduler options NUMA # Non-Uniform Memory Architecture support options PREEMPTION # Enable kernel thread preemption options VIMAGE # Subsystem virtualization, e.g. VNET
(In reply to Mark Millard from comment #228) MMy crude traversal of the long list of nodes in the list ends with the sequence: (kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next $206 = {link = {tqe_next = 0xfffff8000465bc80, tqe_prev = 0xfffff80004607c40}, container = 0xfffff80004b29a80, name = 0xffffffff829f801d "amdgpu_raven_ce_bin_fw", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next $207 = {link = {tqe_next = 0xfffff8000465bbc0, tqe_prev = 0xfffff80004607b80}, container = 0xfffff80004b29780, name = 0xffffffff82e12000 <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next . . .->link->tqe_next $208 = {link = {tqe_next = 0xfffff80004607a00, tqe_prev = 0xfffff8000465bc80}, container = 0xfffff80003868c00, name = 0xffffffff82e1e000 <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next . . .->link->tqe_next $209 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1} (kgdb) print *found_modules->tqh_first->link->tqe_next->link->tqe_next . . .->link->tqe_next $210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, version = 15660016} The ones that also show prefix <...> text like: <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw" <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw" <se_mask+242> "amdgpu_raven_mec2_bin_fw" are not the first ones to do so. Also note the duplication of "amdgpu_raven_mec2_bin_fw". Also note the: tqe_next = 0xfffff80000000007 that, when dereferenced, ends up with clearly garabge for the purpose of the list: link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, version = 15660016} For reference: (kgdb) print &mmhub_client_ids_vega20 $211 = (<data variable, no debug info> *) 0xffffffff82e12000 <mmhub_client_ids_vega20> (kgdb) print &xgpu_fiji_mgcg_cgcg_init $212 = (<data variable, no debug info> *) 0xffffffff82e1de90 <xgpu_fiji_mgcg_cgcg_init> (kgdb) print &se_mask $213 = (<data variable, no debug info> *) 0xffffffff82e41d7c <se_mask> Those addresses are in the .rodata for /boot/modules/amdgpu.ko : 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko 0xffffffff82d9a000 - 0xffffffff82eea000 is .rodata in /boot/modules/amdgpu.ko 0xffffffff82eea000 - 0xffffffff82ef7948 is .bss in /boot/modules/amdgpu.ko 0xffffffff82ef7950 - 0xffffffff82f064b8 is .data in /boot/modules/amdgpu.ko 0xffffffff82f064b8 - 0xffffffff82f068d0 is set_sysctl_set in /boot/modules/amdgpu.ko 0xffffffff82f068d0 - 0xffffffff82f068f8 is set_sysinit_set in /boot/modules/amdgpu.ko 0xffffffff82f068f8 - 0xffffffff82f06908 is set_sysuninit_set in /boot/modules/amdgpu.ko 0xffffffff82f06908 - 0xffffffff82f06958 is set_modmetadata_set in /boot/modules/amdgpu.ko 0xffffffff82f06958 - 0xffffffff82f0697c is .note.gnu.build-id in /boot/modules/amdgpu.ko That matches up with the node with: link = {tqe_next = 0xfffff8000465bc80 referencing: $207 = {link = {tqe_next = 0xfffff8000465bbc0, tqe_prev = 0xfffff80004607b80}, container = 0xfffff80004b29780, name = 0xffffffff82e12000 <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw" where name has the address 0xffffffff82e12000 . I'll note that the name = 0xffffffff829f801d "amdgpu_raven_ce_bin_fw" before the oddities lands between the .bss for the kernel and the .text for /boot/modules/amdgpu.ko (not in either one): 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko . . . 0xffffffff829a3360 - 0xffffffff829a3384 is .note.gnu.build-id in /boot/modules/ttm.ko For reference, the first node's name field has: name = 0xffffffff81184803 "cam" That is in the kernel's .rodata : 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata there are earlier: name = 0xffffffff8298b0de <drm_ioctls+350> "iic" and: name = 0xffffffff8298f24f <orientation_data+6415> "linuxkpi_gplv2" in: 0xffffffff82973000 - 0xffffffff82991000 is .rodata in /boot/modules/drm.ko name = 0xffffffff829a21c2 <global_write_combined+370> "ttm" in: 0xffffffff829a2000 - 0xffffffff829a2eb0 is .bss in /boot/modules/ttm.ko (Not the just prior .rodata for /boot/modules/ttm.ko .) For reference: $198 = {link = {tqe_next = 0xfffff80003904d00, tqe_prev = 0xfffff8000465a1c0}, container = 0xfffff8000464b180, name = 0xffffffff8297644b "drmn", version = 2} has its tqe_next pointing to the ttm using .bss for the name string: $199 = {link = {tqe_next = 0xfffff8000465bd00, tqe_prev = 0xfffff8000465a3c0}, container = 0xfffff8000469da80, name = 0xffffffff829a21c2 <global_write_combined+370> "ttm", version = 1} I will note that /boot/modules/ttm.ko is the last (most recent) to show up in the "info file" kgdb output: (kgdb) info file Symbols from "/usr/home/root/failing-kernel-files/usr/lib/debug/boot/kernel/kernel.debug". Kernel core dump file: `/usr/home/root/failing-kernel-files/vmcore.8', file type FreeBSD kernel vmcore. Local exec file: `/usr/home/root/failing-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff80231108 is .hash 0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash 0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym 0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr 0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811843f8 is .text 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata 0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set 0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set 0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set 0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set 0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips 0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs 0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set 0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set 0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set 0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set 0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set 0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set 0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set 0xffffffff81803620 - 0xffffffff81803660 is usb_host_id 0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set 0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset 0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset 0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors 0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set 0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set 0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set 0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset 0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset 0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set 0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set 0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set 0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set 0xffffffff81803820 - 0xffffffff81804881 is kern_conf 0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id 0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame 0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic 0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding 0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently 0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly 0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line 0xffffffff81c08000 - 0xffffffff81d51248 is .data 0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set 0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set 0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu 0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82a00000 - 0xffffffff82d9a000 is .text in /boot/modules/amdgpu.ko 0xffffffff82d9a000 - 0xffffffff82eea000 is .rodata in /boot/modules/amdgpu.ko 0xffffffff82eea000 - 0xffffffff82ef7948 is .bss in /boot/modules/amdgpu.ko 0xffffffff82ef7950 - 0xffffffff82f064b8 is .data in /boot/modules/amdgpu.ko 0xffffffff82f064b8 - 0xffffffff82f068d0 is set_sysctl_set in /boot/modules/amdgpu.ko 0xffffffff82f068d0 - 0xffffffff82f068f8 is set_sysinit_set in /boot/modules/amdgpu.ko 0xffffffff82f068f8 - 0xffffffff82f06908 is set_sysuninit_set in /boot/modules/amdgpu.ko 0xffffffff82f06908 - 0xffffffff82f06958 is set_modmetadata_set in /boot/modules/amdgpu.ko 0xffffffff82f06958 - 0xffffffff82f0697c is .note.gnu.build-id in /boot/modules/amdgpu.ko 0xffffffff82918000 - 0xffffffff82973000 is .text in /boot/modules/drm.ko 0xffffffff82973000 - 0xffffffff82991000 is .rodata in /boot/modules/drm.ko 0xffffffff82991000 - 0xffffffff829911e0 is .bss in /boot/modules/drm.ko 0xffffffff829911e0 - 0xffffffff82992df8 is .data in /boot/modules/drm.ko 0xffffffff82992df8 - 0xffffffff82992e80 is set_sysinit_set in /boot/modules/drm.ko 0xffffffff82992e80 - 0xffffffff82992ef0 is set_sysuninit_set in /boot/modules/drm.ko 0xffffffff82992ef0 - 0xffffffff82992fc0 is set_sysctl_set in /boot/modules/drm.ko 0xffffffff82992fc0 - 0xffffffff82992fcc is .data.read_mostly in /boot/modules/drm.ko 0xffffffff82992fd0 - 0xffffffff82993050 is set_modmetadata_set in /boot/modules/drm.ko 0xffffffff82993050 - 0xffffffff82993074 is .note.gnu.build-id in /boot/modules/drm.ko 0xffffffff8298d000 - 0xffffffff8298d000 is .text in /boot/modules/linuxkpi_gplv2.ko 0xffffffff8298d000 - 0xffffffff8298e000 is .rodata in /boot/modules/linuxkpi_gplv2.ko 0xffffffff8298e000 - 0xffffffff8298e0d0 is .data in /boot/modules/linuxkpi_gplv2.ko 0xffffffff8298e0d0 - 0xffffffff8298e100 is set_modmetadata_set in /boot/modules/linuxkpi_gplv2.ko 0xffffffff8298e100 - 0xffffffff8298e124 is .note.gnu.build-id in /boot/modules/linuxkpi_gplv2.ko 0xffffffff82991000 - 0xffffffff82996000 is .text in /boot/modules/dmabuf.ko 0xffffffff82996000 - 0xffffffff82997000 is .rodata in /boot/modules/dmabuf.ko 0xffffffff82997000 - 0xffffffff82997280 is .data in /boot/modules/dmabuf.ko 0xffffffff82997280 - 0xffffffff82997290 is set_modmetadata_set in /boot/modules/dmabuf.ko 0xffffffff82997290 - 0xffffffff829972a8 is set_sysinit_set in /boot/modules/dmabuf.ko 0xffffffff829972a8 - 0xffffffff829972c0 is set_sysuninit_set in /boot/modules/dmabuf.ko 0xffffffff829972c0 - 0xffffffff82997358 is .bss in /boot/modules/dmabuf.ko 0xffffffff82997358 - 0xffffffff8299737c is .note.gnu.build-id in /boot/modules/dmabuf.ko 0xffffffff82998000 - 0xffffffff829a1000 is .text in /boot/modules/ttm.ko 0xffffffff829a1000 - 0xffffffff829a2000 is .rodata in /boot/modules/ttm.ko 0xffffffff829a2000 - 0xffffffff829a2eb0 is .bss in /boot/modules/ttm.ko 0xffffffff829a2eb0 - 0xffffffff829a32e8 is .data in /boot/modules/ttm.ko 0xffffffff829a32e8 - 0xffffffff829a3320 is set_sysctl_set in /boot/modules/ttm.ko 0xffffffff829a3320 - 0xffffffff829a3350 is set_modmetadata_set in /boot/modules/ttm.ko 0xffffffff829a3350 - 0xffffffff829a3358 is set_sysinit_set in /boot/modules/ttm.ko 0xffffffff829a3358 - 0xffffffff829a3360 is set_sysuninit_set in /boot/modules/ttm.ko 0xffffffff829a3360 - 0xffffffff829a3384 is .note.gnu.build-id in /boot/modules/ttm.ko (kgdb) For reference: $210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, happens after previously dereferencing to see over something like 200 nodes in the list. With that I'll stop this specific note.
(In reply to George Mitchell from comment #231) Also: the build is based on the -p2 source code (hash 3f40d5821): # strings boot/kernel/kernel | grep "\-RELEASE" @(#)FreeBSD 13.4-RELEASE-p2 3f40d5821 M5P FreeBSD 13.4-RELEASE-p2 3f40d5821 M5P 13.4-RELEASE-p2 Because it is a rebuild, the kernel ends up with -p2 instead of the official -p1 ( from -p2 not updating boot/kernel/kernel in the official distributions ).
(In reply to Mark Millard from comment #232) I mistakenly wrote of a duplicacation in: QUOTE <mmhub_client_ids_vega20> "amdgpu_raven_rlc_bin_fw" <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw" <se_mask+242> "amdgpu_raven_mec2_bin_fw" are not the first ones to do so. Also note the duplication of "amdgpu_raven_mec2_bin_fw". END QUOTE amdgpu_raven_mec_bin_fw vs. amdgpu_raven_mec2_bin_fw is not a duplication. Sorry.
For the 3 node sequence (last partially-good and then just-junk): $208 = {link = {tqe_next = 0xfffff80004607a00, tqe_prev = 0xfffff8000465bc80}, container = 0xfffff80003868c00, name = 0xffffffff82e1e000 <xgpu_fiji_mgcg_cgcg_init+368> "amdgpu_raven_mec_bin_fw", version = 1} $209 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1} $210 = {link = {tqe_next = 0xeef3f000e2c3f0, tqe_prev = 0xff54f000eef3f0}, container = 0x322ff0003287f0, name = 0xe987f000fea5f0 <error: Cannot access memory at address 0xe987f000fea5f0>, version = 15660016} it looks like the: $209 = {link = {tqe_next = 0xfffff80000000007, is the earliest example of (evidence of) corruption. The address is outside of (smaller address than) the kernel start: Local exec file: `/usr/home/root/failing-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp Having 0000000007 also looks odd. However, the rest of that node: tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 <se_mask+242> "amdgpu_raven_mec2_bin_fw", version = 1} does not look to have any obvious problems with its content. The contents of the container are shown as: $214 = {ops = 0xfffff80003164000, refs = 1, userrefs = 0, flags = 1, link = {tqe_next = 0xfffff8000469ed80, tqe_prev = 0xfffff80003868c18}, filename = 0xfffff80004b22120 "amdgpu_raven_mec2_bin.ko", pathname = 0xfffff80004607a40 "/boot/modules/amdgpu_raven_mec2_bin.ko", id = 20, address = 0xffffffff82e61000 <link_enc_regs+1520> "\203\376\001tL\270\026", size = 276456, ctors_addr = 0x0, ctors_size = 0, dtors_addr = 0x0, dtors_size = 0, ndeps = 3, deps = 0xfffff80004b220e0, common = {stqh_first = 0x0, stqh_last = 0xfffff80004b29680}, modules = {tqh_first = 0xfffff80004b1ff00, tqh_last = 0xfffff80004b1ff10}, loaded = {tqe_next = 0x0, tqe_prev = 0x0}, loadcnt = 20, nenabled = 0, fbt_nentries = 0} which also seems to not have obvious problems. The type of vmcore.* does not provide threads, stack content, or backtrace information. Nor is there any indication of any detailed point for when the tqe_next = 0xfffff80000000007 became the case. It is not necessarily obvious if the list was longer before the 0xfffff80000000007 became the case. There does not seem to be a way to tell if the corrupted value might be becuase of "raven" specific code vs. more general code. It would be interesting to know if an alternate card type has the problem vs. not. As for the raven context, getting vmcore.* captures that fail at a different stage, such as the failure that mentioned acpi_wmi but did not get a vmcore.* , would help indicate if where the corruption happens in the list moves around (relative to other content).
Mark, thank you sincerely for your help in tracking this down. I have temporarily rearranged my startup script to increase my chance of getting more crashes (it seems loading amdgpu AFTER zfs, acpi_wmi, and vboxnetflt makes the crash more likely, so of late I have been loading amdgpu FIRST so I can get more work done), and if that occurs, I will put some more vmcores in the directory I told you about earlier today. Your assistance is GREATLY appreciated!
(In reply to Mark Millard from comment #235) Old comments that reference one or both of: 0xFFFFF80000000000 (also known as 18446735277616529408) 0xFFFFF80000000007 comment #44 comment #94 comment #148 Example from 44 (that 94 references): #8 vtozoneslab (va=18446735277616529408, zone=<optimized out>, slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635 #9 free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>) at /usr/src/sys/kern/kern_malloc.c:911 #10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, buf=0xfffff80000000007, size=16688648) at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216 Example from 148 (an nfsd process context): #7 0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, dst=<optimized out>, expect=<optimized out>) at /usr/src/sys/amd64/include/atomic.h:225 #8 selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007) at /usr/src/sys/kern/sys_generic.c:1755 #9 0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00) at /usr/src/sys/kern/sys_generic.c:1967 [I'll note that 18446741877726026240 = 0xFFFFFE00B52E9A00 but is likely from use of dereferencing something based on the 0xfffff80000000007 in some way.] The history suggests that 0xfffff80000000007 (or 0xfffff80000000000) corruption is not limited to a specific place.
(In reply to Mark Millard from comment #237) I should have noted: the #44 , #94 , #148 comments material are not tied to the found_modules->tqh_first->. . . list as far as I can tell.
In: https://lists.freebsd.org/archives/freebsd-hackers/2024-December/004100.html Philipp writes: QUOTE By simple grep through sys/ I found following comment in sys/amd64/include/vmparam.h: > /* > * Virtual addresses of things. Derived from the page directory and > * page table indexes from pmap.h for precision. > [...] > * 0xfffff80000000000 - 0xfffffbffffffffff 4TB direct map The direct map is 4TB of virtuall address space mapping the physical address space 1:1 (minus the base). So I would guess this is caused by an NULL pointer converted by PHYS_TO_DMAP. END QUOTE So either: PHYS_TO_DMAP(0x0)+7 or: PHYS_TO_DMAP(0x0+7) looks likely to be involved for 0xfffff80000000007 showing up in: $209 = {link = {tqe_next = 0xfffff80000000007,
(In reply to Mark Millard from comment #237) Intrestingly, the traceback from comment #148 involves a different list: /* * Remove the references to the thread from all of the objects we were * polling. */ static void seltdclear(struct thread *td) { struct seltd *stp; struct selfd *sfp; struct selfd *sfn; stp = td->td_sel; STAILQ_FOREACH_SAFE(sfp, &stp->st_selq, sf_link, sfn) selfdfree(stp, sfp); stp->st_flags = 0; } It was a sfp value that ended up being reported as: 0xfffff80000000007
One of the older ("obsolete") crash dump reports is for: /* * free: * * Free a block of memory allocated by malloc. * * This routine may not block. */ void free(void *addr, struct malloc_type *mtp) { uma_zone_t zone; uma_slab_t slab; u_long size; #ifdef MALLOC_DEBUG if (free_dbg(&addr, mtp) != 0) return; #endif /* free(NULL, ...) does nothing */ if (addr == NULL) return; vtozoneslab((vm_offset_t)addr & (~UMA_SLAB_MASK), &zone, &slab); . . . where addr ended up being 0xfffff80000000007 , in other words PHYS_TO_DMAP(0x7). The (vm_offset_t)addr & (~UMA_SLAB_MASK) turned it into 0xfffff80000000000 for vtozoneslab. That in turn reported a failure. The presence of a NULL check in the kernel's free suggests to me that the kernel's free may not be intended to handle DMAP addresses. Similarly for other kernel code that checks against NULL but not against PHYS_TO_DMAP(NULL). How does one tell where DMAP addresses should not appear when looking around via kgdb?
Mark, I'm extremely grateful for all your recent disassemblies, but don't you need some older kernel/kernel.debug files to match up with those older files? I'm in the process of extracting them from my old backups ...
Argh, it's worse than I thought -- there's -p2, -p5, etc. etc.
(In reply to George Mitchell from comment #243) I got what I reported for the obsolete materials via the attatchments. I can get to source via git. There are no vmcore.* 's that I'm aware of. But it still allows seeing that the pointer value 0xfffff80000000007 was showing up in various places, just based on the kgdb backtrace reports. I would not worry about providing pre-13.4-RELEASE-p1 vmcore.* files or related kernel or kernel.debug files.
Thanks, Mark; that's a relief!
Meanwhile, I haven't been able to cause any new crashes . . . I'll have to try something different.
(In reply to Mark Millard from comment #235) I found another context issue that might eventually prove to be of interest for the vmcore.8 that I've got a copy of. First remember: (kgdb) print *found_modules->tqh_first->link.tqe_next->. . .->link.tqe_next $1 = {link = {tqe_next = 0xfffff80000000007, tqe_prev = 0xfffff8000465bbc0}, container = 0xfffff80004b29600, name = 0xffffffff82e62026 "amdgpu_raven_mec2_bin_fw", version = 1} Note that amdgpu_raven_mec2_bin_fw as the last name. Now: (kgdb) info sharedlibrary From To Syms Read Shared Object Library 0xffffffff82545000 0xffffffff82552000 Yes ./boot/kernel/fusefs.ko 0xffffffff8256d000 0xffffffff8256f000 Yes ./boot/kernel/sem.ko No /boot/modules/if_re.ko No /boot/modules/amdgpu.ko No /boot/modules/drm.ko 0xffffffff8298a000 0xffffffff8298b000 Yes ./boot/kernel/iic.ko No /boot/modules/linuxkpi_gplv2.ko No /boot/modules/dmabuf.ko No /boot/modules/ttm.ko No /boot/modules/amdgpu_raven_gpu_info_bin.ko No /boot/modules/amdgpu_raven_sdma_bin.ko No /boot/modules/amdgpu_raven_asd_bin.ko No /boot/modules/amdgpu_raven_ta_bin.ko No /boot/modules/amdgpu_raven_pfp_bin.ko No /boot/modules/amdgpu_raven_me_bin.ko No /boot/modules/amdgpu_raven_ce_bin.ko No /boot/modules/amdgpu_raven_rlc_bin.ko No /boot/modules/amdgpu_raven_mec_bin.ko No /boot/modules/amdgpu_raven_mec2_bin.ko No /boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff83000000 0xffffffff8324c000 Yes ./boot/kernel/zfs.ko So both amdgpu_raven_vcn_bin.ko and zfs.ko are not in the found_modules list before the failing point in the list --but the failure was not seen until the activity associated with zfs.ko 's load attempt. Note: So far, I'm operating without copies of /boot/modules/*.ko or any debug information for such. my guess is that port builds do not normally generate debug information for /boot/modules/*.ko files so only public symbols for linking might show up for such. I had carelessly worked in a way that was referencing some files from my live system previously. But those were not from the same vintage of drm-*-kmod and related. For reference: (kgdb) info files Symbols from "/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/usr/lib/debug/boot/kernel/kernel.debug". Kernel core dump file: `/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/vmcore.8', file type FreeBSD kernel vmcore. Local exec file: `/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff80231108 is .hash 0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash 0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym 0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr 0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811843f8 is .text 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata 0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set 0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set 0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set 0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set 0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips 0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs 0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set 0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set 0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set 0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set 0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set 0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set 0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set 0xffffffff81803620 - 0xffffffff81803660 is usb_host_id 0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set 0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset 0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset 0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors 0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set 0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set 0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set 0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset 0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset 0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set 0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set 0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set 0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set 0xffffffff81803820 - 0xffffffff81804881 is kern_conf 0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id 0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame 0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic 0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding 0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently 0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly 0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line 0xffffffff81c08000 - 0xffffffff81d51248 is .data 0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set 0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set 0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu 0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82545000 - 0xffffffff82552000 is .text in ./boot/kernel/fusefs.ko 0xffffffff82552000 - 0xffffffff82554000 is .rodata in ./boot/kernel/fusefs.ko 0xffffffff82554000 - 0xffffffff82556874 is .data in ./boot/kernel/fusefs.ko 0xffffffff82556878 - 0xffffffff82556970 is set_sdt_probes_set in ./boot/kernel/fusefs.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82556970 - 0xffffffff82556ba0 is set_sdt_argtypes_set in ./boot/kernel/fusefs.ko 0xffffffff82556ba0 - 0xffffffff82556bd8 is set_sysinit_set in ./boot/kernel/fusefs.ko 0xffffffff82556bd8 - 0xffffffff82556bf8 is set_sysuninit_set in ./boot/kernel/fusefs.ko 0xffffffff82556bf8 - 0xffffffff82556c60 is set_sysctl_set in ./boot/kernel/fusefs.ko 0xffffffff82556c60 - 0xffffffff82556cc0 is .bss in ./boot/kernel/fusefs.ko 0xffffffff82556cc0 - 0xffffffff82556cc8 is set_sdt_providers_set in ./boot/kernel/fusefs.ko 0xffffffff82556cc8 - 0xffffffff82556ce0 is set_modmetadata_set in ./boot/kernel/fusefs.ko 0xffffffff82556ce0 - 0xffffffff82556d04 is .note.gnu.build-id in ./boot/kernel/fusefs.ko 0xffffffff8256d000 - 0xffffffff8256f000 is .text in ./boot/kernel/sem.ko 0xffffffff8256f000 - 0xffffffff82570000 is .rodata in ./boot/kernel/sem.ko 0xffffffff82570000 - 0xffffffff8257095c is .data in ./boot/kernel/sem.ko 0xffffffff82570960 - 0xffffffff82570978 is set_sysctl_set in ./boot/kernel/sem.ko 0xffffffff82570978 - 0xffffffff82570988 is set_sysinit_set in ./boot/kernel/sem.ko 0xffffffff82570988 - 0xffffffff82570990 is set_sysuninit_set in ./boot/kernel/sem.ko 0xffffffff82570990 - 0xffffffff82570a10 is .bss in ./boot/kernel/sem.ko 0xffffffff82570a10 - 0xffffffff82570a28 is set_modmetadata_set in ./boot/kernel/sem.ko 0xffffffff82570a28 - 0xffffffff82570a4c is .note.gnu.build-id in ./boot/kernel/sem.ko 0xffffffff8298a000 - 0xffffffff8298b000 is .text in ./boot/kernel/iic.ko 0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in ./boot/kernel/iic.ko 0xffffffff8298c000 - 0xffffffff8298c270 is .data in ./boot/kernel/iic.ko 0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in ./boot/kernel/iic.ko 0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in ./boot/kernel/iic.ko 0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in ./boot/kernel/iic.ko 0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in ./boot/kernel/iic.ko 0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in ./boot/kernel/iic.ko 0xffffffff83000000 - 0xffffffff8324c000 is .text in ./boot/kernel/zfs.ko 0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in ./boot/kernel/zfs.ko 0xffffffff832dc000 - 0xffffffff832fe228 is .data in ./boot/kernel/zfs.ko 0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in ./boot/kernel/zfs.ko 0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in ./boot/kernel/zfs.ko 0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in ./boot/kernel/zfs.ko 0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in ./boot/kernel/zfs.ko 0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in ./boot/kernel/zfs.ko 0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in ./boot/kernel/zfs.ko 0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in ./boot/kernel/zfs.ko 0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in ./boot/kernel/zfs.ko (kgdb)
Would a debug build of drm_510_kmod be of use here?
(In reply to George Mitchell from comment #248) A debug build of drm-510-kmod under 13.4-RELEASE would need to match up with vmcore.* examples that used the debug build. A non-debug build of drm-510-kmod under 13.4-RELEASE would need to match up with vmcore.* examples that used the non-debug build. So it is more driven by the vmcore.* content than anything for debug vs. non-debug: which was in use? The only thing that I know of for having the matching *.ko files for a vmcore.* is that "info files" would likely show the (correct) address ranges from the sections in the additional *.ko files that were loaded but are not there to find in what I now have.
Which did you use for graphics/gpu-firmware-amd-kmod@raven : latest? ( updated 2024-Dec-14: graphics/gpu-firmware-amd-kmod@raven ) ( possibly after the vmcore.8 context ) quarterly? So I'm not sure if all the boot/modules/*.ko that I now have are what they should be to match vmcore.8 .
(In reply to Mark Millard from comment #250) Sorry for mistaken reference. Actually: 2024-Dec-12 source commit 2024-Dec-14 FreeBSD Package distribution And the most recent actual change to raven looks to be unclear for how far back it is in the commit sequence.
(In reply to Mark Millard from comment #247) (Looks like you get if_re.ko via /boot/modules/ as well.) For reference: (kgdb) info sharedlibrary From To Syms Read Shared Object Library 0xffffffff82545000 0xffffffff82552000 Yes ./boot/kernel/fusefs.ko 0xffffffff8256d000 0xffffffff8256f000 Yes ./boot/kernel/sem.ko No /boot/modules/if_re.ko 0xffffffff82a00000 0xffffffff82cf5000 Yes (*) ./boot/modules/amdgpu.ko 0xffffffff82918000 0xffffffff8296d000 Yes (*) ./boot/modules/drm.ko 0xffffffff8298a000 0xffffffff8298b000 Yes ./boot/kernel/iic.ko 0xffffffff8298d000 0xffffffff8298f000 Yes (*) ./boot/modules/linuxkpi_gplv2.ko 0xffffffff82991000 0xffffffff82996000 Yes (*) ./boot/modules/dmabuf.ko 0xffffffff82998000 0xffffffff829a2000 Yes (*) ./boot/modules/ttm.ko 0xffffffff829a5000 0xffffffff829a6000 Yes (*) ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a8000 0xffffffff829a9000 Yes (*) ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829af000 0xffffffff829b0000 Yes (*) ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829de000 0xffffffff829df000 Yes (*) ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e8000 0xffffffff829e9000 Yes (*) ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829f0000 0xffffffff829f1000 Yes (*) ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f7000 0xffffffff829f8000 Yes (*) ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff82e11000 0xffffffff82e12000 Yes (*) ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1d000 0xffffffff82e1e000 Yes (*) ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e61000 0xffffffff82e62000 Yes (*) ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea5000 0xffffffff82ea6000 Yes (*) ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff83000000 0xffffffff8324c000 Yes ./boot/kernel/zfs.ko (kgdb) info file Symbols from "/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/usr/lib/debug/boot/kernel/kernel.debug". Kernel core dump file: `/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/vmcore.8', file type FreeBSD kernel vmcore. Local exec file: `/usr/home/root/failing-13.4Rp2-kp1-freebsd-kernel-files/boot/kernel/kernel', file type elf64-x86-64-freebsd. Entry point: 0xffffffff8038e000 0xffffffff802002a8 - 0xffffffff802002b5 is .interp 0xffffffff802002b8 - 0xffffffff80231108 is .hash 0xffffffff80231108 - 0xffffffff8025f9e4 is .gnu.hash 0xffffffff8025f9e8 - 0xffffffff802f24c0 is .dynsym 0xffffffff802f24c0 - 0xffffffff8036d162 is .dynstr 0xffffffff8036d168 - 0xffffffff8038db08 is .rela.dyn 0xffffffff8038e000 - 0xffffffff811843f8 is .text 0xffffffff81184400 - 0xffffffff817f68d0 is .rodata 0xffffffff817f68d0 - 0xffffffff817fba38 is set_sysctl_set 0xffffffff817fba38 - 0xffffffff817fef60 is set_modmetadata_set 0xffffffff817fef60 - 0xffffffff817fefb8 is set_cam_xpt_xport_set 0xffffffff817fefb8 - 0xffffffff817fefe0 is set_cam_xpt_proto_set 0xffffffff817fefe0 - 0xffffffff817ff028 is set_ah_chips 0xffffffff817ff028 - 0xffffffff817ff078 is set_ah_rfs 0xffffffff817ff078 - 0xffffffff817ff098 is set_kbddriver_set 0xffffffff817ff098 - 0xffffffff817ff150 is set_sdt_providers_set 0xffffffff817ff150 - 0xffffffff81800268 is set_sdt_probes_set 0xffffffff81800268 - 0xffffffff818035c8 is set_sdt_argtypes_set 0xffffffff818035c8 - 0xffffffff818035e0 is set_scterm_set 0xffffffff818035e0 - 0xffffffff81803608 is set_cons_set 0xffffffff81803608 - 0xffffffff81803610 is set_uart_acpi_class_and_device_set 0xffffffff81803620 - 0xffffffff81803660 is usb_host_id 0xffffffff81803660 - 0xffffffff81803680 is set_vt_drv_set 0xffffffff81803680 - 0xffffffff818036a8 is set_elf64_regset 0xffffffff818036a8 - 0xffffffff818036d8 is set_elf32_regset 0xffffffff818036d8 - 0xffffffff818036e8 is set_compressors 0xffffffff818036e8 - 0xffffffff818036f0 is set_kdb_dbbe_set 0xffffffff818036f0 - 0xffffffff81803700 is set_ratectl_set 0xffffffff81803700 - 0xffffffff81803718 is set_crypto_set 0xffffffff81803718 - 0xffffffff81803730 is set_ieee80211_ioctl_getset 0xffffffff81803730 - 0xffffffff81803748 is set_ieee80211_ioctl_setset 0xffffffff81803748 - 0xffffffff81803770 is set_scanner_set 0xffffffff81803770 - 0xffffffff81803790 is set_videodriver_set 0xffffffff81803790 - 0xffffffff818037d8 is set_scrndr_set 0xffffffff818037d8 - 0xffffffff81803820 is set_vga_set 0xffffffff81803820 - 0xffffffff81804881 is kern_conf 0xffffffff81804884 - 0xffffffff818048a8 is .note.gnu.build-id 0xffffffff818048a8 - 0xffffffff8180493c is .eh_frame 0xffffffff81a00000 - 0xffffffff81a00140 is .dynamic 0xffffffff81a00140 - 0xffffffff81a01000 is .relro_padding 0xffffffff81c00000 - 0xffffffff81c00035 is .data.read_frequently 0xffffffff81c00040 - 0xffffffff81c017f4 is .data.read_mostly 0xffffffff81c01800 - 0xffffffff81c07680 is .data.exclusive_cache_line 0xffffffff81c08000 - 0xffffffff81d51248 is .data 0xffffffff81d51248 - 0xffffffff81d54688 is set_sysinit_set 0xffffffff81d54688 - 0xffffffff81d55e48 is set_sysuninit_set 0xffffffff81d55e80 - 0xffffffff81d592e8 is set_pcpu 0xffffffff81d592f0 - 0xffffffff81d82851 is set_vnet 0xffffffff81d82880 - 0xffffffff82200000 is .bss 0xffffffff82545000 - 0xffffffff82552000 is .text in ./boot/kernel/fusefs.ko 0xffffffff82552000 - 0xffffffff82554000 is .rodata in ./boot/kernel/fusefs.ko 0xffffffff82554000 - 0xffffffff82556874 is .data in ./boot/kernel/fusefs.ko 0xffffffff82556878 - 0xffffffff82556970 is set_sdt_probes_set in ./boot/kernel/fusefs.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82556970 - 0xffffffff82556ba0 is set_sdt_argtypes_set in ./boot/kernel/fusefs.ko 0xffffffff82556ba0 - 0xffffffff82556bd8 is set_sysinit_set in ./boot/kernel/fusefs.ko 0xffffffff82556bd8 - 0xffffffff82556bf8 is set_sysuninit_set in ./boot/kernel/fusefs.ko 0xffffffff82556bf8 - 0xffffffff82556c60 is set_sysctl_set in ./boot/kernel/fusefs.ko 0xffffffff82556c60 - 0xffffffff82556cc0 is .bss in ./boot/kernel/fusefs.ko 0xffffffff82556cc0 - 0xffffffff82556cc8 is set_sdt_providers_set in ./boot/kernel/fusefs.ko 0xffffffff82556cc8 - 0xffffffff82556ce0 is set_modmetadata_set in ./boot/kernel/fusefs.ko 0xffffffff82556ce0 - 0xffffffff82556d04 is .note.gnu.build-id in ./boot/kernel/fusefs.ko 0xffffffff8256d000 - 0xffffffff8256f000 is .text in ./boot/kernel/sem.ko 0xffffffff8256f000 - 0xffffffff82570000 is .rodata in ./boot/kernel/sem.ko 0xffffffff82570000 - 0xffffffff8257095c is .data in ./boot/kernel/sem.ko 0xffffffff82570960 - 0xffffffff82570978 is set_sysctl_set in ./boot/kernel/sem.ko 0xffffffff82570978 - 0xffffffff82570988 is set_sysinit_set in ./boot/kernel/sem.ko 0xffffffff82570988 - 0xffffffff82570990 is set_sysuninit_set in ./boot/kernel/sem.ko 0xffffffff82570990 - 0xffffffff82570a10 is .bss in ./boot/kernel/sem.ko 0xffffffff82570a10 - 0xffffffff82570a28 is set_modmetadata_set in ./boot/kernel/sem.ko 0xffffffff82570a28 - 0xffffffff82570a4c is .note.gnu.build-id in ./boot/kernel/sem.ko 0xffffffff82a00000 - 0xffffffff82cf5000 is .text in ./boot/modules/amdgpu.ko 0xffffffff82cf5000 - 0xffffffff82dfc000 is .rodata in ./boot/modules/amdgpu.ko 0xffffffff82dfc000 - 0xffffffff82e09378 is .bss in ./boot/modules/amdgpu.ko 0xffffffff82e09380 - 0xffffffff82e11d74 is .data in ./boot/modules/amdgpu.ko 0xffffffff82e11d78 - 0xffffffff82e12150 is set_sysctl_set in ./boot/modules/amdgpu.ko 0xffffffff82e12150 - 0xffffffff82e12178 is set_sysinit_set in ./boot/modules/amdgpu.ko 0xffffffff82e12178 - 0xffffffff82e12188 is set_sysuninit_set in ./boot/modules/amdgpu.ko 0xffffffff82e12188 - 0xffffffff82e121e0 is set_modmetadata_set in ./boot/modules/amdgpu.ko 0xffffffff82e121e0 - 0xffffffff82e12204 is .note.gnu.build-id in ./boot/modules/amdgpu.ko 0xffffffff82918000 - 0xffffffff8296d000 is .text in ./boot/modules/drm.ko 0xffffffff8296d000 - 0xffffffff82989000 is .rodata in ./boot/modules/drm.ko 0xffffffff82989000 - 0xffffffff82989190 is .bss in ./boot/modules/drm.ko 0xffffffff82989190 - 0xffffffff8298a9a8 is .data in ./boot/modules/drm.ko 0xffffffff8298a9a8 - 0xffffffff8298aa20 is set_sysinit_set in ./boot/modules/drm.ko 0xffffffff8298aa20 - 0xffffffff8298aa80 is set_sysuninit_set in ./boot/modules/drm.ko 0xffffffff8298aa80 - 0xffffffff8298ab50 is set_sysctl_set in ./boot/modules/drm.ko 0xffffffff8298ab50 - 0xffffffff8298ab5c is .data.read_mostly in ./boot/modules/drm.ko 0xffffffff8298ab60 - 0xffffffff8298abd8 is set_modmetadata_set in ./boot/modules/drm.ko 0xffffffff8298abd8 - 0xffffffff8298abfc is .note.gnu.build-id in ./boot/modules/drm.ko 0xffffffff8298a000 - 0xffffffff8298b000 is .text in ./boot/kernel/iic.ko 0xffffffff8298b000 - 0xffffffff8298c000 is .rodata in ./boot/kernel/iic.ko 0xffffffff8298c000 - 0xffffffff8298c270 is .data in ./boot/kernel/iic.ko 0xffffffff8298c270 - 0xffffffff8298c280 is set_sysinit_set in ./boot/kernel/iic.ko 0xffffffff8298c280 - 0xffffffff8298c288 is set_sysuninit_set in ./boot/kernel/iic.ko 0xffffffff8298c288 - 0xffffffff8298c2a8 is set_modmetadata_set in ./boot/kernel/iic.ko 0xffffffff8298c2a8 - 0xffffffff8298c2b0 is .bss in ./boot/kernel/iic.ko 0xffffffff8298c2b0 - 0xffffffff8298c2d4 is .note.gnu.build-id in ./boot/kernel/iic.ko 0xffffffff8298d000 - 0xffffffff8298f000 is .text in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff8298f000 - 0xffffffff82990000 is .rodata in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff82990000 - 0xffffffff829900c8 is .data in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff829900c8 - 0xffffffff829900f0 is set_modmetadata_set in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff829900f0 - 0xffffffff829900f8 is set_sysinit_set in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff829900f8 - 0xffffffff829900fc is .bss in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff829900fc - 0xffffffff82990120 is .note.gnu.build-id in ./boot/modules/linuxkpi_gplv2.ko 0xffffffff82991000 - 0xffffffff82996000 is .text in ./boot/modules/dmabuf.ko 0xffffffff82996000 - 0xffffffff82997000 is .rodata in ./boot/modules/dmabuf.ko 0xffffffff82997000 - 0xffffffff82997200 is .data in ./boot/modules/dmabuf.ko 0xffffffff82997200 - 0xffffffff82997210 is set_modmetadata_set in ./boot/modules/dmabuf.ko 0xffffffff82997210 - 0xffffffff82997228 is set_sysinit_set in ./boot/modules/dmabuf.ko 0xffffffff82997228 - 0xffffffff82997240 is set_sysuninit_set in ./boot/modules/dmabuf.ko 0xffffffff82997240 - 0xffffffff829972d8 is .bss in ./boot/modules/dmabuf.ko 0xffffffff829972d8 - 0xffffffff829972fc is .note.gnu.build-id in ./boot/modules/dmabuf.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82998000 - 0xffffffff829a2000 is .text in ./boot/modules/ttm.ko 0xffffffff829a2000 - 0xffffffff829a3000 is .rodata in ./boot/modules/ttm.ko 0xffffffff829a3000 - 0xffffffff829a3500 is .data in ./boot/modules/ttm.ko 0xffffffff829a3500 - 0xffffffff829a3520 is set_sysinit_set in ./boot/modules/ttm.ko 0xffffffff829a3520 - 0xffffffff829a3538 is set_sysuninit_set in ./boot/modules/ttm.ko 0xffffffff829a3540 - 0xffffffff829a4720 is .bss in ./boot/modules/ttm.ko 0xffffffff829a4720 - 0xffffffff829a4758 is set_modmetadata_set in ./boot/modules/ttm.ko 0xffffffff829a4758 - 0xffffffff829a477c is .note.gnu.build-id in ./boot/modules/ttm.ko 0xffffffff829a5000 - 0xffffffff829a6000 is .text in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a6000 - 0xffffffff829a7000 is .rodata in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7000 - 0xffffffff829a713c is rodata in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7140 - 0xffffffff829a71f0 is .data in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a71f0 - 0xffffffff829a7210 is set_modmetadata_set in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7210 - 0xffffffff829a7218 is set_sysinit_set in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a7218 - 0xffffffff829a723c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a8000 - 0xffffffff829a9000 is .text in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829a9000 - 0xffffffff829aa000 is .rodata in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829aa000 - 0xffffffff829ae400 is rodata in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae400 - 0xffffffff829ae4b0 is .data in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4b0 - 0xffffffff829ae4d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4d0 - 0xffffffff829ae4d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829ae4d8 - 0xffffffff829ae4fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829af000 - 0xffffffff829b0000 is .text in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829b0000 - 0xffffffff829b1000 is .rodata in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829b1000 - 0xffffffff829da200 is rodata in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829da200 - 0xffffffff829da2b0 is .data in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829da2b0 - 0xffffffff829da2d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829da2d0 - 0xffffffff829da2d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829da2d8 - 0xffffffff829da2fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829de000 - 0xffffffff829df000 is .text in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829df000 - 0xffffffff829e0000 is .rodata in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e0000 - 0xffffffff829e8300 is rodata in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e8300 - 0xffffffff829e83b0 is .data in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e83b0 - 0xffffffff829e83d0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e83d0 - 0xffffffff829e83d8 is set_sysinit_set in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e83d8 - 0xffffffff829e83fc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e8000 - 0xffffffff829e9000 is .text in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829e9000 - 0xffffffff829ea000 is .rodata in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ea000 - 0xffffffff829ef480 is rodata in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef480 - 0xffffffff829ef530 is .data in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef530 - 0xffffffff829ef550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef550 - 0xffffffff829ef558 is set_sysinit_set in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829ef558 - 0xffffffff829ef57c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829f0000 - 0xffffffff829f1000 is .text in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f1000 - 0xffffffff829f2000 is .rodata in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f2000 - 0xffffffff829f6480 is rodata in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6480 - 0xffffffff829f6530 is .data in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6530 - 0xffffffff829f6550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6550 - 0xffffffff829f6558 is set_sysinit_set in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f6558 - 0xffffffff829f657c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f7000 - 0xffffffff829f8000 is .text in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829f8000 - 0xffffffff829f9000 is .rodata in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829f9000 - 0xffffffff829fb480 is rodata in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb480 - 0xffffffff829fb530 is .data in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb530 - 0xffffffff829fb550 is set_modmetadata_set in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb550 - 0xffffffff829fb558 is set_sysinit_set in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff829fb558 - 0xffffffff829fb57c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff82e11000 - 0xffffffff82e12000 is .text in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e12000 - 0xffffffff82e13000 is .rodata in ./boot/modules/amdgpu_raven_rlc_bin.ko --Type <RET> for more, q to quit, c to continue without paging-- 0xffffffff82e13000 - 0xffffffff82e1c8e4 is rodata in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c8e8 - 0xffffffff82e1c998 is .data in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c998 - 0xffffffff82e1c9b8 is set_modmetadata_set in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c9b8 - 0xffffffff82e1c9c0 is set_sysinit_set in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1c9c0 - 0xffffffff82e1c9e4 is .note.gnu.build-id in ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1d000 - 0xffffffff82e1e000 is .text in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e1e000 - 0xffffffff82e1f000 is .rodata in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e1f000 - 0xffffffff82e60710 is rodata in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e60710 - 0xffffffff82e607c0 is .data in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607c0 - 0xffffffff82e607e0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607e0 - 0xffffffff82e607e8 is set_sysinit_set in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e607e8 - 0xffffffff82e6080c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e61000 - 0xffffffff82e62000 is .text in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82e62000 - 0xffffffff82e63000 is .rodata in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82e63000 - 0xffffffff82ea4710 is rodata in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea4710 - 0xffffffff82ea47c0 is .data in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47c0 - 0xffffffff82ea47e0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47e0 - 0xffffffff82ea47e8 is set_sysinit_set in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea47e8 - 0xffffffff82ea480c is .note.gnu.build-id in ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea5000 - 0xffffffff82ea6000 is .text in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82ea6000 - 0xffffffff82ea7000 is .rodata in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82ea7000 - 0xffffffff82f003e0 is rodata in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82f003e0 - 0xffffffff82f00490 is .data in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82f00490 - 0xffffffff82f004b0 is set_modmetadata_set in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82f004b0 - 0xffffffff82f004b8 is set_sysinit_set in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff82f004b8 - 0xffffffff82f004dc is .note.gnu.build-id in ./boot/modules/amdgpu_raven_vcn_bin.ko 0xffffffff83000000 - 0xffffffff8324c000 is .text in ./boot/kernel/zfs.ko 0xffffffff8324c000 - 0xffffffff832dc000 is .rodata in ./boot/kernel/zfs.ko 0xffffffff832dc000 - 0xffffffff832fe228 is .data in ./boot/kernel/zfs.ko 0xffffffff832fe228 - 0xffffffff832fe318 is set_sysinit_set in ./boot/kernel/zfs.ko 0xffffffff832fe318 - 0xffffffff832fe398 is set_sysuninit_set in ./boot/kernel/zfs.ko 0xffffffff832fe400 - 0xffffffff833b79c8 is .bss in ./boot/kernel/zfs.ko 0xffffffff833b79c8 - 0xffffffff833b85e8 is set_sysctl_set in ./boot/kernel/zfs.ko 0xffffffff833b85e8 - 0xffffffff833b8810 is set_sdt_probes_set in ./boot/kernel/zfs.ko 0xffffffff833b8810 - 0xffffffff833b8c30 is set_sdt_argtypes_set in ./boot/kernel/zfs.ko 0xffffffff833b8c30 - 0xffffffff833b8c98 is set_modmetadata_set in ./boot/kernel/zfs.ko 0xffffffff833b8c98 - 0xffffffff833b8cbc is .note.gnu.build-id in ./boot/kernel/zfs.ko
(Answering a bunch of comments.) I'm using non-debug builds of all ports. I'm using locally built ports, but I will probably be phasing that out. All the gpu-firmware-amd-kmods are version 20220511, and drm-510-kmod-5.10.163_10. I am using realtek-re-kmod-197.00, locally built. My /boot/loader.conf says: if_re_load="YES" if_re_name="/boot/modules/if_re.ko" loader_logo="beastie" sem_load="YES" hw.vga.textmode="1" hw.syscons.disable="1" fusefs_load="YES" dumpdev="/dev/ada0p3" It looks like there's a version 20230625 for graphics/gpu-firmware-amd-kmod; I guess I should update that. Hmmm - now I'm thoroughly confused, because I typed "portmaster -BDg graphics/gpu-firmware-amd-kmod" and it seems to have compiled the "aldebaran" flavor (version 20230625) while leaving everything else (meaning 85 other installed packages) alone (version 20220511).
(In reply to George Mitchell from comment #253) QUOTE Hmmm - now I'm thoroughly confused, because I typed "portmaster -BDg graphics/gpu-firmware-amd-kmod" and it seems to have compiled the "aldebaran" flavor (version 20230625) while leaving everything else (meaning 85 other installed packages) alone (version 20220511). END QUOTE The default flavor is the first in the list unless extra work was done to control it in the port: PKGNAMESUFFIX= -${FLAVOR:C/_/-/g} FLAVORS= aldebaran \ arcturus \ banks \ beige_goby \ . . . You implicitly requested that only the aldebaran flavor be built. You probably meant something like (at least poudriere has such @all support): portmaster -BDg graphics/gpu-firmware-amd-kmod@all but looking at the man page I see no hint of portmaster supporting use of @all . May be the ports Makefile handling does such automatically for @all use? That might make portmaster also work. In your case you likely could use: portmaster -BDg graphics/gpu-firmware-amd-kmod@raven for what we are investigating. The extra notation for setting a default that is not the starting value in FLAVORS is like: FLAVOR?= SOMEFLAVOR in the Makefile . There is a notation for referencing into the FLAVORS list to pick out the default: FLAVOR?= ${FLAVORS:[1]} The port in question does not seem to have these. On make command lines, FLAVOR=SOMEFLAVOR can be used. Side note: The fact that you use portmaster instead of poudriere or poudriere-devel (or other such) is likely something else that should generally be published for a self-built context: it tells folks some about how much they need to be worried about odd interactions from an unclean build environment. It is commonly more difficult to make portmaster or Makefile based builds reproducable across systems and across other context switching: the builds use more of the variations in the live contexts.
Apparently I overlooked the appearance of flavors in this port. If I believe pviconf -lv, I do indeed have a Raven Ridge chip. But if drm-510-kmod is going to preemptively load all the firmware packages, it's very surprising (to me, anyway) that a simple compile of the port doesn't preemptively compile all of the flavors. Certainly it did that by default before the appearance of flavors in the port. I'll see if this changes any behavior.
(In reply to George Mitchell from comment #255) The various ports have been set up to allow avoiding installing unnecessary gpu-firmware ones. Just install: graphics/gpu-firmware-amd-kmod@raven If you have more installed, you might want to first uninstall what you have and then do the above so that only raven related things are present. The loading behavior shown in out investigative materials shows that the system is picking out raven related materials as what to load for the amdgpu_* naming: 0xffffffff829a5000 0xffffffff829a6000 Yes (*) ./boot/modules/amdgpu_raven_gpu_info_bin.ko 0xffffffff829a8000 0xffffffff829a9000 Yes (*) ./boot/modules/amdgpu_raven_sdma_bin.ko 0xffffffff829af000 0xffffffff829b0000 Yes (*) ./boot/modules/amdgpu_raven_asd_bin.ko 0xffffffff829de000 0xffffffff829df000 Yes (*) ./boot/modules/amdgpu_raven_ta_bin.ko 0xffffffff829e8000 0xffffffff829e9000 Yes (*) ./boot/modules/amdgpu_raven_pfp_bin.ko 0xffffffff829f0000 0xffffffff829f1000 Yes (*) ./boot/modules/amdgpu_raven_me_bin.ko 0xffffffff829f7000 0xffffffff829f8000 Yes (*) ./boot/modules/amdgpu_raven_ce_bin.ko 0xffffffff82e11000 0xffffffff82e12000 Yes (*) ./boot/modules/amdgpu_raven_rlc_bin.ko 0xffffffff82e1d000 0xffffffff82e1e000 Yes (*) ./boot/modules/amdgpu_raven_mec_bin.ko 0xffffffff82e61000 0xffffffff82e62000 Yes (*) ./boot/modules/amdgpu_raven_mec2_bin.ko 0xffffffff82ea5000 0xffffffff82ea6000 Yes (*) ./boot/modules/amdgpu_raven_vcn_bin.ko
(In reply to Mark Millard from comment #256) There is no reason for you to build what you do not need to install now that such is allowed. The flavored graphics/gpu-firmware-amd-kmod dates back to: 2022-05-01 and raven was present at the time. It is very different for the official package builders: No user-installation directly but making everything available for a wide variety of installation contexts: build all but allow installing just what is needed based on using flavors to advantage. Default build procedures are biased to the official-builder context. That is why there is a graphics/gpu-firmware-kmod that builds or installs all the gpu-firmware* but also a graphics/gpu-firmware-amd-kmod that supports using just graphics/gpu-firmware-amd-kmod@raven to build or install just the one variant. Building graphics/drm-510-kmod does not build any graphics/gpu-firmware*-kmod as far as I know for how things are now. I'll note that, in my view, testing the official builds that FreeBSD does to produce the packages is appropriate, even if you wanted to go back to building your own. Why? Being able to compare/contrast the two. If one way things just work and the other way they fail, that is significant. Also, if you demonstrate the failure using official package builds, you are more likely to get support for the problem.
(In reply to Mark Millard from comment #257) I forgot to write: Installing graphics/drm-510-kmod also does not install any graphics/gpu-firmware*-kmod as far as I know for how things are now. And I forgot to list what does have the dependency structure to bundle it all for builds or installation: graphics/drm-kmod . It picks between 510 , 515 , and 61 . It does depend on: graphics/gpu-firmware-kmod that in turn has a run-dependency on the various gnu firmware flavors. It is more biased to creating a context ready for most anything supported, much as the official package builders would want for building, for example. But drm-kmod does not have to be used.
I think you looking at the wrong direction. The question is where does the NULL pointer is from. So lets look at the 'found_modules->tqh_first->link.tqe_next->. . .->link.tqe_next' instance. This list only managed by sys/kern/kern_linker.c. And only at one point there is an insert: ``` static modlist_t modlist_newmodule(const char *modname, int version, linker_file_t container) { modlist_t mod; mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO); if (mod == NULL) panic("no memory for module list"); mod->container = container; mod->name = modname; mod->version = version; TAILQ_INSERT_TAIL(&found_modules, mod, link); return (mod); } ``` So I would guess the +7 is from the TAILQ list and the fake NULL pointer is directly from malloc(9). So a build with MALLOC_DEBUG might help. Also I have looked a bit a for PHYS_TO_DMAP in sys/compat/linuxkpi and found arch_io_reserve_memtype_wc(). This function is used at drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: ``` int r = arch_io_reserve_memtype_wc(adev->gmc.aper_base, adev->gmc.aper_size); if (r) { DRM_ERROR("Unable to set WC memtype for the aperture base\n"); #ifdef __linux__ /* * BSDFIXME: On recent AMD GPU requested area crosses * DMAP boundries resulting in error. Ignore it for now */ return r; #endif } ``` This could also sneak in a fake NULL pointer and cause UB.
(In reply to satanist+freebsd from comment #259) In: mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO); if (mod == NULL) panic("no memory for module list"); mod->container = container; if something similar to mod == 0xfffff80000000007 resulted, it appears to me that the dereference in mod->container or the like would have gotten a general protection fault, given the later actual failure that sometimes happens because of the 0xfffff80000000007 that sometimes happens. I'll note also that, for example, one of the historical crashes involving 0xfffff80000000007 was in handling a different list: /* * Remove the references to the thread from all of the objects we were * polling. */ static void seltdclear(struct thread *td) { struct seltd *stp; struct selfd *sfp; struct selfd *sfn; stp = td->td_sel; STAILQ_FOREACH_SAFE(sfp, &stp->st_selq, sf_link, sfn) selfdfree(stp, sfp); stp->st_flags = 0; } so the issue does not appear to be list specific, even if one list is more common for failing than others for some reason. I do not know if there is some relevant relationship with the likes of code from: drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c for alternate failure points. No simple reproduction test has ever been discovered. MALLOC_DEBUG is controlled in the kernel via sys/kern/kern_malloc.c having the code: #if defined(INVARIANTS) || defined(MALLOC_MAKE_FAILURES) || \ defined(DEBUG_MEMGUARD) || defined(DEBUG_REDZONE) #define MALLOC_DEBUG 1 #endif It, in turn leads to definition and use of the kernel's malloc_dbg() and free_dbg(). I certainly have no objection to such testing, say via using an INVARIANTS based kernel build. But I'm not testing, having no context to use to reproduce the problem with. I'm just looking at vmcore.* file(s) via kgdb . But I'll also note, that recently we appear to have learned that some of the software in use was rather old and not being updated --so not tracking kernel updates. Testing if the modern software built to match the kernel in use also produces the problems seems appropriate, as that is what would be changed if there is still a bug to be fixed. As I understand that testing is what is going on now.
Created attachment 256016 [details] core.txt.9 There have been fewer crashes related to this bug recently. But I do have two more. "core.txt.9" is basically the very same as the December 13 "Latest crash dump/text," in that zfs.ko was the other module involved in the crash. So It reordered the loading of kernel modules, moving zfs.ko after vboxnetflt.ko and acpi_wmi.ko. Sure enough, in "core.txt.0" (from a few moments ago) it's vboxnetflt.ko instead of zfs.ko that is involved in the crash.
Created attachment 256017 [details] core.txt.0
(In reply to George Mitchell from comment #261) Looks like the build is somewhat different, possibly more of a debug build? I think this is the first time that mod = 0xfffff80000000007 has been reported for inside modlist_lookup : core.txt.9 : #6 <signal handler called> No locals. #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:44 No locals. #8 0xffffffff80bc0ab4 in modlist_lookup (name=0xffffffff83255959 "zfsctrl", ver=1) at /usr/src/sys/kern/kern_linker.c:1488 mod = 0xfffff80000000007 core.txt.0 : #6 <signal handler called> No locals. #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:44 No locals. #8 0xffffffff80bc0ab4 in modlist_lookup ( name=0xffffffff829fd0c4 "vboxnetflt", ver=1) at /usr/src/sys/kern/kern_linker.c:1488 mod = 0xfffff80000000007 If so, I'll need to synchronize to any updated files that I'd previously downloaded, not just the vmcore.[90] files. (The mod value is not a surprise. It is from the same linking field that was found to have 0xfffff80000000007 as its value in the earlier vmcore.8 .)
The last update I made to drm-510-kmod was on December 7. The only change more recent than that was the order in which I kldload modules at boot time. For a long time, that was zfs, vboxnetflt, acpi_wmi; and just yesterday I moved zfs to last.
(In reply to George Mitchell from comment #264) What about amdgpu_raven*.ko and the like from graphics/gpu-firmware-amd-kmod@raven ( so: gpu-firmware-amd-kmod-raven-20230625_2 )? Did you rebuild and install it? Install a official FreeBSD package? You had indicated that you had accidentially only been updating @aldebaran the way that you were doing things with portmaster: The flavored graphics/gpu-firmware-amd-kmod dates back to: 2022-05-01 and raven was present at the time. Rebuilds of graphics/gpu-firmware-amd-kmod via portmaster without the @raven being explicit were not rebuilding raven's files ever since. I'll note that rebuilding/installing drm-510-kmod does not rebuild/install graphics/gpu-firmware-amd-kmod . (What does span both a drm-*-kmod and all the graphics/gpu-firmware-amd-kmod@??? is drm-kmod . But that builds and installs a lot of unnecessary materials relative to most personal contexts. drm-510-kmod and graphics/gpu-firmware-amd-kmod@raven may be more reasonable.) I gather: no updates to the FreeBSD kernel files, such as a GENERIC-DEBUG build and install? Same files that I used with vmcore.8 ? I do not think I've heve rhad a copy of your locally built realtek-re-kmod-197.00 *.ko file.
I did install a new version of gpu-firmware-amd-kmod-raven-20230625.1304000_2 on December 18, after "Latest crash dump/text" but before both "core.txt.9" and "core.txt.0". No kernel changes since December 4 when I updated from 13.3 to 13.4.