Created attachment 237279 [details] /var/crash/core.txt.1 crash description It doesn't happen every time. If I use kld_list="amdgpu" in /etc/rc.conf, it happens close to 50% of the time. If instead I boot to single user mode and manually kldload amdgpu, it happens maybe 20% of the time. If I have amdgpu_load="YES" in /boot/loader.conf, the module fails to load at all, without saying anything. FreeBSD 13.1-RELEASE-p2, drm-510-kmod-5.10.113_7, AMD Ryzen 3 2200G with Radeon Vega Graphics. Crashes are always general protection fault panics, replete with complaints about drm_modeset_is_locked being false.
This is marginally an improvement from FreeBSD 12, where kldload amdgpu would always immediately totally lock up the machine, with no recovery path short of powering down and back on. And when this crash DOESN'T happen, everything works marvelously well (and considerably better than running in VESA mode), so thanks for the work so far!
I have four more of the /var/crash/core.txt files, and core dumps (very large, too big to attach here even compressed) for each of them.
Thank you, and please note that issues for <https://github.com/freebsd/drm-kmod> are normally raised in GitHub.
Ugh, I don't have a GitHub account and I would rather not open one. (Yes, that does seem selfish of me and I apologize.)
From a _very_ quick look, it does not appear that this is an amdgpu problem. The crash is in the core kernel code and the stack trace has mentions of zfs. #6 <signal handler called> #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:46 #8 0xffffffff80be8c3d in modlist_lookup (name=0xfffff80004b71000 "zfs", ver=0) at /usr/src/sys/kern/kern_linker.c:1487 #9 modlist_lookup2 (name=0xfffff80004b71000 "zfs", verinfo=0x0) at /usr/src/sys/kern/kern_linker.c:1501 #10 linker_load_module (kldname=kldname@entry=0x0, modname=modname@entry=0xfffff80004b71000 "zfs", parent=parent@entry=0x0, verinfo=<optimized out>, verinfo@entry=0x0, lfpp=lfpp@entry=0xfffffe0075fddd90) at /usr/src/sys/kern/kern_linker.c:2165 #11 0xffffffff80beb17a in kern_kldload (td=td@entry=0xfffffe007f505a00, file=<optimized out>, file@entry=0xfffff80004b71000 "zfs", fileid=fileid@entry=0xfffffe0075fddde4) at /usr/src/sys/kern/kern_linker.c:1150 #12 0xffffffff80beb29b in sys_kldload (td=0xfffffe007f505a00, uap=<optimized out>) at /usr/src/sys/kern/kern_linker.c:1173 #13 0xffffffff810ae6ec in syscallenter (td=0xfffffe007f505a00) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189 #14 amd64_syscall (td=0xfffffe007f505a00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 To the reporter: do you by chance have zfs in kld_list ?
Also, how is the root file system tuned? tunefs -p / (In reply to George Mitchell from comment #1) > … from FreeBSD 12 𣀦… Did you run 13.0⋯ for a while, or did you upgrade from 12.⋯ direct to 13.1⋯?
> … immediately after kldload amdgpu … (In reply to George Mitchell from comment #0) If I understand correctly, the attachment shows: 1. kldload amdgpu whilst in single user mode 2. a subsequent, but non-immediate, exit ^D to multi-user mode 3. panic … ugen0.4: <Logitech USB Optical Mouse> at usbus0 <118>Enter full pathname of shell or RETURN for /bin/sh: Cannot read termcap database; <118>using dumb terminal settings. <118>root@:/ # kldload amdgpu <6>[drm] amdgpu kernel modesetting enabled. … <6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0 <118>root@:/ # ^D <118>Setting hostuuid: 032e02b4-0499-0547-c106-430700080009. <118>Setting hostid: 0x82f0750c. Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 stack pointer = 0x28:0xfffffe0075fdda60 frame pointer = 0x28:0xfffffe0075fdda60 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 52 (kldload) …
Thanks for the work so far. "zfs" is not explicitly in the kld_list, but I do use ZFS and zfs_enable is set to "YES". Also: tunefs: POSIX.1e ACLs: (-a) disabled tunefs: NFSv4 ACLs: (-N) disabled tunefs: MAC multilabel: (-l) disabled tunefs: soft updates: (-n) disabled tunefs: soft update journaling: (-j) disabled tunefs: gjournal: (-J) disabled tunefs: trim: (-t) disabled tunefs: maximum blocks per file in a cylinder group: (-e) 4096 tunefs: average file size: (-f) 16384 tunefs: average number of files in a directory: (-s) 64 tunefs: minimum percentage of free space: (-m) 8% tunefs: space to hold for metadata blocks: (-k) 6408 tunefs: optimization preference: (-o) time tunefs: volume label: (-L) I never ran 13.0; I'm always leery of upgrading to x.0 from x-1. (My upgrade was from 12.3-p6.) Also, I still remember a collection of severe crashes from years back with soft updates plus journaling. Are those problems known to be solved now? (Sorry to be getting off the main topic.)
In this particular crash, I manually loaded amdgpu in single-user mode, and then immediately hit control-D.
sysrc -f /etc/rc.conf kld_list – is there amdgpu alone, or are other modules listed? (In reply to George Mitchell from comment #9) Given the brief analysis by avg@ (comment #5), I'm inclined to: * view the load of amdgpu as successful * give thought to other modules, ones that are (or should be) subsequently loaded. Do you use IRC, Matrix (e.g. Element) or Discord?
(In reply to George Mitchell from comment #8) > …crashes from years back with soft updates plus journaling. > Are those problems known to be solved now? … For what's described: without a bug number, it might be impossible for me to tell. > … I never ran 13.0; … 13.1 fixed a bug that involved soft updates _without_ soft update journaling: <https://www.freebsd.org/releases/13.1R/relnotes/#storage-ufs> <https://docs.freebsd.org/en/books/handbook/config/#soft-updates> recommends soft updates. If there's no explicit recommendation to also enable soft update journaling, this could be because (bug 261944) there's not yet, in the Handbook, a suitable explanation of the feature. tunefs(8) <https://www.freebsd.org/cgi/man.cgi?query=tunefs&sektion=8&manpath=FreeBSD> for FreeBSD 13.1-RELEASE lacks a recently added explanation, you can gain this by switching the online view of the manual page to FreeBSD 14.0-CURRENT.
Without amdgpu in the kld_list, kld_list currently is not even defined. Perhaps it's more helpful to show what gets loaded aside from amdgpu int he course of a normal boot: kldstat 1 64 0xffffffff80200000 1f300f0 kernel 2 1 0xffffffff82132000 77e0 sem.ko 3 3 0xffffffff8213a000 8cc90 vboxdrv.ko 4 1 0xffffffff82600000 3df128 zfs.ko 5 2 0xffffffff82518000 4240 vboxnetflt.ko 6 2 0xffffffff8251d000 aac8 netgraph.ko 7 1 0xffffffff82528000 31c8 ng_ether.ko 8 1 0xffffffff8252c000 55e0 vboxnetadp.ko 9 1 0xffffffff82532000 3378 acpi_wmi.ko 10 1 0xffffffff82536000 3218 intpm.ko 11 1 0xffffffff8253a000 2180 smbus.ko 12 1 0xffffffff8253d000 33c0 uslcom.ko 13 1 0xffffffff82541000 4d90 ucom.ko 14 1 0xffffffff82546000 2340 uhid.ko 15 1 0xffffffff82549000 3380 usbhid.ko 16 1 0xffffffff8254d000 31f8 hidbus.ko 17 1 0xffffffff82551000 3320 wmt.ko 18 1 0xffffffff82555000 4350 ums.ko 19 1 0xffffffff8255a000 5af8 autofs.ko 20 1 0xffffffff82560000 2a08 mac_ntpd.ko 21 1 0xffffffff82563000 20f0 green_saver.ko The SU+J thing is totally anecdotal, based on what I used to see on freebsd-hackers. Right now, I format my disks with UFS for root/var/tmp (no more than 8GB for fast fscking), and then a ZFS partition for /usr. I don't use IRC, Matrix, or Element (not sure what those last two are) and on the rare occasions I use Discord, I use the web site.
As of today, with version drm-510-kmod-5.10.113_8: 1. I can reliably prevent a crash by booting to single user mode, manually kldloading amdgpu, and continuing (typing control-d). dmesg then reports: [drm] amdgpu kernel modesetting enabled. drmn0: <drmn> on vgapci0 vgapci0: child drmn0 requested pci_enable_io vgapci0: child drmn0 requested pci_enable_io [drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1458:0xD000 0xC8). drmn0: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [drm] register mmio base: 0xFE600000 [drm] register mmio size: 524288 [drm] add ip block number 0 <soc15_common> [drm] add ip block number 1 <gmc_v9_0> [drm] add ip block number 2 <vega10_ih> [drm] add ip block number 3 <psp> [drm] add ip block number 4 <gfx_v9_0> [drm] add ip block number 5 <sdma_v4_0> [drm] add ip block number 6 <powerplay> [drm] add ip block number 7 <dm> [drm] add ip block number 8 <vcn_v1_0> drmn0: successfully loaded firmware image 'amdgpu/raven_gpu_info.bin' [drm] BIOS signature incorrect 44 f drmn0: Fetched VBIOS from ROM BAR amdgpu: ATOM BIOS: 113-RAVEN-111 drmn0: successfully loaded firmware image 'amdgpu/raven_sdma.bin' [drm] VCN decode is enabled in VM mode [drm] VCN encode is enabled in VM mode [drm] JPEG decode is enabled in VM mode [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit drmn0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used) drmn0: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF drmn0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF [drm] Detected VRAM RAM=2048M, BAR=2048M [drm] RAM width 128bits DDR4 [TTM] Zone kernel: Available graphics memory: 3100774 KiB [TTM] Zone dma32: Available graphics memory: 2097152 KiB [TTM] Initializing pool allocator [drm] amdgpu: 2048M of VRAM memory ready [drm] amdgpu: 3072M of GTT memory ready. [drm] GART: num cpu pages 262144, num gpu pages 262144 [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000). drmn0: successfully loaded firmware image 'amdgpu/raven_asd.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_ta.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_pfp.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_me.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_ce.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_rlc.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_mec.bin' drmn0: successfully loaded firmware image 'amdgpu/raven_mec2.bin' amdgpu: hwmgr_sw_init smu backed is smu10_smu drmn0: successfully loaded firmware image 'amdgpu/raven_vcn.bin' [drm] Found VCN firmware Version ENC: 1.12 DEC: 2 VEP: 0 Revision: 1 drmn0: Will use PSP to load VCN firmware [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR drmn0: RAS: optional ras ta ucode is not available drmn0: RAP: optional rap ta ucode is not available [drm] kiq ring mec 2 pipe 1 q 0 [drm] DM_PPLIB: values for F clock [drm] DM_PPLIB: 400000 in kHz, 3649 in mV [drm] DM_PPLIB: 933000 in kHz, 4074 in mV [drm] DM_PPLIB: 1200000 in kHz, 4399 in mV [drm] DM_PPLIB: 1333000 in kHz, 4399 in mV [drm] DM_PPLIB: values for DCF clock [drm] DM_PPLIB: 300000 in kHz, 3649 in mV [drm] DM_PPLIB: 600000 in kHz, 4074 in mV [drm] DM_PPLIB: 626000 in kHz, 4250 in mV [drm] DM_PPLIB: 654000 in kHz, 4399 in mV [drm] Display Core initialized with v3.2.104! [drm] VCN decode and encode initialized successfully(under SPG Mode). drmn0: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8 [drm] fb mappable at 0x60BCA000 [drm] vram apper at 0x60000000 [drm] size 8294400 [drm] fb depth is 24 [drm] pitch is 7680 VT: Replacing driver "vga" with new "fb". start FB_INFO: type=11 height=1080 width=1920 depth=32 pbase=0x60bca000 vbase=0xfffff80060bca000 name=drmn0 flags=0x0 stride=7680 bpp=32 end FB_INFO drmn0: ring gfx uses VM inv eng 0 on hub 0 drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub 0 drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub 0 drmn0: ring comp_1.2.0 uses VM inv eng 5 on hub 0 drmn0: ring comp_1.3.0 uses VM inv eng 6 on hub 0 drmn0: ring comp_1.0.1 uses VM inv eng 7 on hub 0 drmn0: ring comp_1.1.1 uses VM inv eng 8 on hub 0 drmn0: ring comp_1.2.1 uses VM inv eng 9 on hub 0 drmn0: ring comp_1.3.1 uses VM inv eng 10 on hub 0 drmn0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 drmn0: ring sdma0 uses VM inv eng 0 on hub 1 drmn0: ring sdma0 uses VM inv eng 0 on hub 1 drmn0: ring vcn_dec uses VM inv eng 1 on hub 1 drmn0: ring vcn_enc0 uses VM inv eng 4 on hub 1 drmn0: ring vcn_enc1 uses VM inv eng 5 on hub 1 drmn0: ring jpeg_dec uses VM inv eng 6 on hub 1 vgapci0: child drmn0 requested pci_get_powerstate sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)! [drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0 Is the sysctl_warn_reuse message anything to worry about? 2. Adding amdgpu to the kldlist in rc.conf still crashes more often than not, as previously reported. 3. Attempting to load amdgpu via /boot/loader.conf appears to load the module in memory but not actually make it functional. (X uses VESA mode as if the module isn't there.)
Created attachment 238024 [details] /var/crash/core.txt.3 crash description from today Contrary to comment #13, today I got a crash despite booting to single user mode, typing "kldload amdgpu", and then control-d. But it looks indistinguishable from the other /var/crash/core.txt.1 description. Next I'll try booting to single user mode and kldloading zfs before kldloading amdgpu.
Created attachment 238075 [details] Another crash This time, I booted into single user mode and typed "kldload zfs amdgpu" with no problems. Then when I typed ctrl-d I got this crash (which looks pretty much the same as all the other ones, except the places in the backtrace that used to refer to zfs now refer to vboxnetflt, which I load for VirtualBox). So it seems likely that the crash has nothing to do with any specific other kernel loadable module might be cited in the backtrace.
The following comment is based on zero actual knowledge of how kernel loadable modules work. Still, based on what I'm seeing with this bug, I hypothesize that after one module is loaded, there is a mechanism by which the next module (and maybe other later ones) call back to modules already loaded in order to prevent incompatible modules (whatever that might mean) from trying to coexist. And somewhere in that path in the amdgpu module, it is detected that some lock that was taken while amdgpu was loading was erroneously not released. (Most of the time, the lock IS released, and I don't know exactly under what circumstances it isn't.) I hope this is helpful.
I've discovered how to avoid this crash (at least the last 20-30 times I have booted up): boot into single user mode, type <ENTER> to run /bin/sh, type "kldload amdgpu," and then (key step!) wait at least five seconds before typing ctrl-D to exit single user mode. Since I don't know why this helps, I guess it falls into the voodoo category, but maybe it's a clue.
I hate to say again how little I know about kernel module loading, but by any chance is there multithreading in the code that gets called when amdgpu.ko is first loaded? I can't help thinking that perhaps that code is returning prematurely, before some initialization is completely finished and all locks released. If I knew where to put it, I would throw in a five-second delay at the end of whatever gets called to load amdgpu.ko.
Created attachment 238668 [details] Another crash summary; looks like all the earlier ones Contrary to my comment #17, I got this same crash this morning, even waiting five seconds after loading amdgpu.ko before proceeding. So the delay doesn't prevent the crash.
I've figured out why this crash is timing related, and also why ZFS is involved. My system has a 1 TB USB disk, which contains a ZFS file system. When I power my system on, it takes a variable amount of time for that disk to become ready and for ZFS to take note of it. (I'm booting from a SATA disk with a traditional old UFS file system.) So if the USB disk becomes ready while amdgpu is still initializing, apparently this crash happens. I have no clue why that is true, but I am pretty sure this explains why the the crash happens only part of the time and is timing dependent. It remains true that the most reliable way to cause the crash is to include amdgpu in the kld_list in /etc/rc.conf and simply boot normally (and to have a *ZFS-formatted USB* disk attached to the system).
I've updated to version drm-510-kmod-5.10.113_8 and it hasn't crashed yet, but I've only had time for one test so far.
(In reply to George Mitchell from comment #21) If a crash _does_ occur/recur, then maybe test for reproducibility with this in your /boot/loader.conf kern.smp.disabled=1 <https://www.freebsd.org/cgi/man.cgi?query=smp&sektion=4&manpath=FreeBSD> (Be prepared for significantly reduced performance after restarting with SMP disabled.) This is a gut feeling, more than anything (apologies for the noise), partly based on experiences with virtual hardware …
Thanks! So far, I've booted four times with amdgpu in my kld_list, which previously would likely have yielded at least one crash, with no crash. So I have my fingers crossed, but I'll try your hack if it crashes again (and your theory certainly sounds plausible).
Created attachment 238802 [details] New core.txt The latest version definitely crashes less often, but I just now got a new crash that (to me) looks different than the earlier one. I was just about ready to mark this fixed!
After further consideration (and a partly sleepless night), I've decided that the latest crash is not an instance of this bug and possibly isn't related to amdgpu.ko at all. So I'm going to close this bug and maybe open a new one when I understand the new one better. Anyone looking at this bug in the future should pay no attention to "New core.txt" attachment, but should refer to the obsolete attachments.
Created attachment 238849 [details] A new crash I regret to say I'm going to have to reopen this bug. But I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall).
Created attachment 238850 [details] A new instance of the same crash I regret to say the crash has happened again. I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
Reopening bug.
(In reply to George Mitchell from comment #26) > FreeBSD court 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 752f813d6 M5P amd64 Please update the OS. ---- Given comment #5 from avg@, and (for example) comment #24 the different types of kernel panic: fs@ x11@ please: if panics recur with an updated OS, would you recommend continuing with this report (267028)? Or start afresh, with a new report for the more recent type of panic?
This is not drm related, the drm message are noises that we should fix one day when we switch ttys.
(In reply to Graham Perrin from comment #29) I'm on the release branch, not the stable branch. So you are suggesting I update from 13.1-RELEASE-p2 to 13.1-RELEASE-p5? And then recompile the kernel module as well, I assume?
For what it's worth, I'm doing this testing on a desktop machine, so setting kern.smp.disabled=1 actually doesn't impact operation too much -- except for Thunderbird. And so far I haven't seen the crash with that setting.
Does switch to graphics/drm-510-kmod and updating graphics/gpu-firmware-amd-kmod helps?
In fact, switching to graphics/drm-510-kmod from the generic VESA driver is what originally triggered this bug. Without using amdgpu.ko there is no problem.
All your reports show that it's from zfs, again the drm messages are noises.
Created attachment 238886 [details] Crash after updating kernel/world to 13.1-RELEASE-p5 This is after updating my kernel and world to 13.1-RELEASE-p5. I grant you the backtrace here sure points to the openzfs code, but why does the crash happen only with graphics/drm-510-kmod installed and amdgpu.ko loaded, but not otherwise? For the time being, I will be running WITHOUT amdgpu.ko in my kld_list, and I am confident this crash will not occur. I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2. But based on an earlier comment I updated to 13.1-RELEASE-p5. Then after going back to kern.smp.disabled=0 I got another of the crash. I did observe that something in sys/contrib/openzfs/module/zfs got updated between p2 and p5, but it doesn't seem to have fixed this crash. Compiling graphics/drm-510-kmod under p5 yielded an amdgpu.ko that was identical to amdgpu.ko compiled under p2.
I'm still having this problem, though I can reduce its frequency by booting in single-user mode, kldloading amdgpu, waiting five or ten seconds, and then going to multi-user mode with control-D. I've updated the title to emphasize that the bug happens only when amdgpu.ko (from graphics/drm-510-kmod version 5.10.113_8) and ZFS are both in use. Also, it happens during booting, or else never.
I don't want the title to become too wordy, but also I'll note again that my 1TB USB disk (GPT formatted with one ZFS partition only) that takes a measurable, variable amount of time to become ready may be the main reason this crash doesn't always happen.
(In reply to George Mitchell from comment #37) grep -e solaris -e zfs /boot/loader.conf grep zfs /etc/rc.conf What's reported?
(In reply to Graham Perrin from comment #39) > grep -e solaris -e zfs /boot/loader.conf > grep zfs /etc/rc.conf zfs_enable="YES" # Set to YES to automatically mount ZFS file systems
(In reply to George Mitchell from comment #40) (In reply to George Mitchell from comment #20) > … timing related, … Please add to /boot/loader.conf zfs_load="YES"
Created attachment 239336 [details] Crash dump Well, this helps a bit. By adding that line to /boot/loader.conf and restoring kld_list="amdgpu" to my /etc/rc.conf, I was able to reboot without the crash four times in a row, whereas before it would crash about every other time. But it crashed on the fifth time. (See attached core.txt.0.)
In the new core.txt.0, there are about 19 lines of text from the previous shutdown near the beginning of the file. But the substance of the backtrace looks identical to all the previous ones. So loading ZFS early mitigates the problem but does not fix it.
I think that in these frames we clearly see a bogus pointer / address: #7 <signal handler called> #8 vtozoneslab (va=18446735277616529408, zone=<optimized out>, slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635 #9 free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>) at /usr/src/sys/kern/kern_malloc.c:911 #10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, buf=0xfffff80000000007, size=16688648) at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216 I'd recommend poking around frames 11-13 to see from where that address comes. Also, I don't get an impression that the latest crash is similar to earlier ones. kern_reboot / zfs__fini vs dbuf_evict_thread.
It appears I could mitigate this problem if I could load amdgpu.ko from /boot/loader.conf, which currently doesn't work. See bug #268962. Alternatively, at present I can completely avoid this crash by: 1. having zfs_load="YES" in /boot/loader.conf. 2. booting into single user mode. 3. typing kldload amdgpu. 4. typing control-D.
(In reply to George Mitchell from comment #45) Correction to comment #45: I can avoid the problem around 95% of the time with the specified steps, but not 100%.
Created attachment 239752 [details] Latest crash dump The last couple of crashes strongly resemble all the earlier ones, but they are now less frequent with zfs.ko being loaded at /boot/loader.conf time and amdgpu.ko loaded while booted into single user mode. The difference (see core.txt.2 from today's date) is that the backtrace line where modlist2_lookup is called is now looking up vboxnetflt instead of zfs. My rcorder list shows: /etc/rc.d/dumpon /etc/rc.d/sysctl /etc/rc.d/natd /etc/rc.d/dhclient /etc/rc.d/hostid /etc/rc.d/ddb /etc/rc.d/ccd /etc/rc.d/gbde /etc/rc.d/geli /etc/rc.d/zpool /etc/rc.d/swap /etc/rc.d/zfskeys /etc/rc.d/fsck /etc/rc.d/zvol /etc/rc.d/growfs /etc/rc.d/root /etc/rc.d/sppp /etc/rc.d/mdconfig /etc/rc.d/hostid_save /etc/rc.d/serial /etc/rc.d/mountcritlocal /etc/rc.d/zfsbe /etc/rc.d/tmp /etc/rc.d/zfs /etc/rc.d/var /etc/rc.d/cfumass /etc/rc.d/cleanvar /etc/rc.d/FILESYSTEMS /etc/rc.d/geli2 /etc/rc.d/ldconfig /etc/rc.d/kldxref /etc/rc.d/adjkerntz /etc/rc.d/hostname /etc/rc.d/ip6addrctl /etc/rc.d/ippool /etc/rc.d/netoptions /etc/rc.d/opensm /etc/rc.d/random /etc/rc.d/iovctl /etc/rc.d/rctl /usr/local/etc/rc.d/vboxnet /etc/rc.d/ugidfw /etc/rc.d/autounmountd /etc/rc.d/mixer /etc/rc.d/ipsec /usr/local/etc/rc.d/uuidd /etc/rc.d/kld /etc/rc.d/ipfilter /etc/rc.d/devmatch /etc/rc.d/addswap /etc/rc.d/ipnat /etc/rc.d/ipmon /etc/rc.d/ipfs /etc/rc.d/netif /etc/rc.d/ppp /etc/rc.d/pfsync /etc/rc.d/pflog /etc/rc.d/rtsold /etc/rc.d/static_ndp /etc/rc.d/static_arp /etc/rc.d/devd /etc/rc.d/resolv /etc/rc.d/stf /etc/rc.d/ipfw /etc/rc.d/routing /etc/rc.d/bridge /etc/rc.d/zfsd /etc/rc.d/defaultroute /etc/rc.d/routed /etc/rc.d/pf /etc/rc.d/route6d /etc/rc.d/ipfw_netflow /etc/rc.d/blacklistd /etc/rc.d/netwait /etc/rc.d/local_unbound /etc/rc.d/NETWORKING /etc/rc.d/kdc /etc/rc.d/tlsservd /etc/rc.d/iscsid /etc/rc.d/pppoed /etc/rc.d/ctld /etc/rc.d/nfsuserd /etc/rc.d/tlsclntd /etc/rc.d/kfd /usr/local/etc/rc.d/sndiod /etc/rc.d/gssd /etc/rc.d/nfscbd /etc/rc.d/ipropd_master /etc/rc.d/ipropd_slave /etc/rc.d/kadmind /etc/rc.d/kpasswdd /etc/rc.d/iscsictl /etc/rc.d/mountcritremote /etc/rc.d/archdep /etc/rc.d/dmesg /etc/rc.d/wpa_supplicant /etc/rc.d/hostapd /etc/rc.d/accounting /etc/rc.d/mdconfig2 /etc/rc.d/devfs /etc/rc.d/gptboot /etc/rc.d/virecover /etc/rc.d/os-release /etc/rc.d/motd /etc/rc.d/cleartmp /etc/rc.d/newsyslog /etc/rc.d/syslogd /etc/rc.d/linux /etc/rc.d/sysvipc /etc/rc.d/hastd /etc/rc.d/localpkg /etc/rc.d/auditd /etc/rc.d/bsnmpd /etc/rc.d/ntpdate /etc/rc.d/watchdogd /etc/rc.d/savecore /etc/rc.d/pwcheck /etc/rc.d/power_profile /etc/rc.d/auditdistd /etc/rc.d/SERVERS /etc/rc.d/rpcbind /etc/rc.d/nisdomain /etc/rc.d/nfsclient /etc/rc.d/ypserv /etc/rc.d/ypupdated /etc/rc.d/ypxfrd /etc/rc.d/ypbind /etc/rc.d/ypldap /etc/rc.d/ypset /etc/rc.d/keyserv /etc/rc.d/automountd /etc/rc.d/yppasswdd /etc/rc.d/quota /etc/rc.d/automount /etc/rc.d/mountd /etc/rc.d/nfsd /etc/rc.d/statd /etc/rc.d/lockd /etc/rc.d/DAEMON /etc/rc.d/rwho /etc/rc.d/utx /etc/rc.d/bootparams /etc/rc.d/hcsecd /etc/rc.d/ftp-proxy /etc/rc.d/local /usr/local/etc/rc.d/git_daemon /etc/rc.d/lpd /usr/local/etc/rc.d/dbus /etc/rc.d/mountlate /etc/rc.d/nscd /etc/rc.d/ntpd /etc/rc.d/powerd /usr/local/etc/rc.d/slurmd /usr/local/etc/rc.d/slurmctld /etc/rc.d/ubthidhci /etc/rc.d/rarpd /etc/rc.d/sdpd /etc/rc.d/apm /etc/rc.d/rtadvd /etc/rc.d/moused /etc/rc.d/rfcomm_pppd_server /usr/local/etc/rc.d/avahi-daemon /etc/rc.d/swaplate /etc/rc.d/bthidd /etc/rc.d/bluetooth /usr/local/etc/rc.d/avahi-dnsconfd /etc/rc.d/LOGIN /etc/rc.d/sshd /usr/local/etc/rc.d/vboxheadless /etc/rc.d/syscons /etc/rc.d/sysctl_lastload /usr/local/etc/rc.d/xdm /usr/local/etc/rc.d/vboxwatchdog /etc/rc.d/inetd /usr/local/etc/rc.d/dnetc /usr/local/etc/rc.d/munged /etc/rc.d/sendmail /etc/rc.d/ftpd /usr/local/etc/rc.d/rsyncd /usr/local/etc/rc.d/saned /etc/rc.d/cron /etc/rc.d/msgs /etc/rc.d/othermta /etc/rc.d/jail /etc/rc.d/bgfsck /usr/local/etc/rc.d/smartd /etc/rc.d/securelevel The vboxnetflt.ko module is loaded by /usr/local/etc/rc.d/vboxnet.
And the list of kernel modules loaded by a non-crashing boot is: kernel sem.ko zfs.ko if_re.ko vboxdrv.ko amdgpu.ko drm.ko linuxkpi_gplv2.ko dmabuf.ko ttm.ko amdgpu_raven_sdma_bin.ko amdgpu_raven_asd_bin.ko amdgpu_raven_ta_bin.ko amdgpu_raven_pfp_bin.ko amdgpu_raven_me_bin.ko amdgpu_raven_ce_bin.ko amdgpu_raven_rlc_bin.ko amdgpu_raven_mec_bin.ko amdgpu_raven_mec2_bin.ko amdgpu_raven_vcn_bin.ko vboxnetflt.ko (and a whole bunch more) In other words, when the crash happens, it always involves a call to modlist_lookup2 from whatever kernel module gets loaded following amdgpu.
*** Bug 268416 has been marked as a duplicate of this bug. ***
Created attachment 239967 [details] Crash after loading vboxnetflt early by hand Since the previous crash included a reference to vboxnetflt.ko, I experimented a few times with amdgpu.ko added to my kld_lst in /etc/rc.conf, and loading vboxnetflt by hand after booting to single user mode. I think it's pretty clear at this point that there is no problem in ZFS code. It's a lock mismanagement problem of some sort in amggpu.ko (from graphics/drm-510-kmod). If I have permission to change the assignee of this bug, I will.
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
(In reply to Graham Perrin from comment #10) It does appear that amdgpu.ko always loads successfully. But then the loading of some other module subsequently (which might be zfs.ko or vboxnetflt.ko or maybe something else) somehow causes an unexpected call back into the amdgpu code. I have no idea how. The current situation: 1. zfs.ko is loaded from /boot/loader.conf. 2. I always boot into single user mode. 3. The last few times, I had kld_list="amdgpu.ko" in my /etc/rc.conf, but for now I'm taking it back out. 4. So I'm loading amdgpu.ko manually in single user mode and then waiting ten seconds or so before going multiuser. It's voodoo but it usually avoids the crash.
(In reply to George Mitchell from comment #52) No it's not, I've told you already that what's printed by drm is not the panic it's noise when we switch ttys during a panic. All you crash logs talk about zfs dbufs, this isn't amdgpu.
(In reply to Emmanuel Vadot from comment #54) If I boot up without loading amdgpu.ko at all, then I NEVER get the crash. Confirmed many many times.
(In reply to Emmanuel Vadot from comment #54) I think that George's point was not about anything that gets printed, but what happens depending on whether amdgpu gets loaded (and when) or not. It's not unimaginable that an exotic bug in one module (or in the module loading code or the code for resolving symbols) results in a memory corruption and a crash elsewhere. A very wild guess, but I'd check if there are any duplicate symbols between amdgpu and zfs.ko... and even kernel itself.
(In reply to Andriy Gapon from comment #56) But then anyone else using zfs+amdgpu will have the same problem and that's not the case (I use both on multiple machine running either 13.1, stable/13 or CURRENT).
If it is ZFS, then the only exotic factor on my system is an external USB one-terabyte drive (WDC WD10EZEX-08WN4A0), formatted with GPT and one ZFS partition, that seems to take a variable amount of time to come on line at power up. I theorized at one point that tasting that drive at an unpredictable time was a factor in the crash. Your mileage may vary.
(In reply to Emmanuel Vadot from comment #54) QUOTE All you crash logs talk about zfs dbufs END QUOTE Not true: "Crash dump" and "Latest crash dump" have no examples of "dbuf" in the submitted text. Also: The backtrace in "Latest crash dump" makes no mention of "zfs" at all. (It does occur in other text.)
(In reply to George Mitchell from comment #58) Could a test be formed on your hardware, loading ZFS but having no actual import of any pool, possibly not even a pool to find (empty "zpool import")? As stands your context is hard for anyone else to make an analogous context for testing. Finding a failure in a simpler to replicate context could help with avoiding your having the only known failure context. So any other variations that are simpler contexts for others to replicate and test would be a good thing. But, also, if such effort ends up unable to replicate the problem in your environment, that might be usefull information as well.
(In reply to Mark Millard from comment #60) In addition to my external USB ZFS drive, also my /usr file system is a ZFS slice. My main hard drive has a very small UFS root (and /var and /tmp) slice, because I have a superstitious fear of ZFS on root. The /usr slice (the rest of the drive) is big enough to take an annoying amount of time to fsck, so when I first added this drive to my system (which was also when I updated from 12 to 13), I chose ZFS for /usr to minimize that time. For a while, I suppose I could copy my /usr slice onto the /usr slice from my old internal drive and mount that in place of the current /usr slice for some tests, and I could do without the external drive. I'll have to think about this.
(In reply to George Mitchell from comment #61) If you can boot an external USB3 drive or some such, may be a minimal separate drive: UFS 13.1-RELEASE with enough added to also have amdgpu.ko . With such a context, do you still manage to see boot failures? Progressing from the simplest independent context towards an independent one more like your normal context might be easier --and might avoid needing to change your normal context as much. Just a test context, not a normal use one. Fewer constraints on the configuration that way. Food for thought.
I had the same problem on 13.1-STABLE, vbox module load caused immediate kernel panic, I had rolled back to 13.1-STABLE because of this. In the bare loader when kernel was loaded the vbox drivers load was okay. When vbox drivers was part of /boot/loader.conf or /etc/rc.conf is caused immediate kernel panic (no dump). virtualbox-ose-kmod was recompiled from ports on a newly installer kernel and system. Not sure if this is amdgpu nor zfs related though..?
*rolled back to 13.1-RELEASE sorry :-) All works fine here. Might be vbox + amdgpu api desync?
(In reply to George Mitchell from comment #36) Have you ever gotten a crash with kern.smp.disabled=1 ? If not, how many tests did you try?
(In reply to George Mitchell from comment #53) A test might be to load something simple or unusual for your context after amdgpu.ko and seeing if it still crashes. I'm not sure it is a good example, but does, say, loading amdgpu.ko and then filemon.ko also lead to a crash (not loading more after that)?
(In reply to Emmanuel Vadot from comment #57) Only "good", easy bugs are like that. That's why I said that this one must be exotic. But there must be something specific about George's environment too. Maybe configuration, maybe build, maybe specific hardware, maybe even a hardware glitch. E.g., maybe if the graphics is active the RAM is more likely to randomly flip a bit.
(In reply to Mark Millard from comment #65) Yes, I got the crash. See comment #26.
I have a spare disk I can use for a test without ZFS. It's currently at 12.0-RELEASE so it will take me a while to update it to 13. Possibly I won't have a chance today, but I will try it.
(In reply to George Mitchell from comment #68) #26 and #27 indicate that you would try the workaround kern.smp.disabled=1 , not the result of trying: #26: I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall) #27: I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall). That is part of why I asked. Did the failure result with kern.smp.disabled=1 seem the same/similar to the other failures --or was it distinct in some way?
(In reply to George Mitchell from comment #24) It looks to me like the backtrace in "Latest crash dump": KDB: stack backtrace: #0 0xffffffff80c66ec5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbcf at vpanic+0x17f #2 0xffffffff80c1ba43 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff81084fb8 at calltrap+0x8 #5 0xffffffff80be8c3d at linker_load_module+0x17d #6 0xffffffff80beb17a at kern_kldload+0x16a #7 0xffffffff80beb29b at sys_kldload+0x5b #8 0xffffffff810ae6ec at amd64_syscall+0x10c #9 0xffffffff810858cb at fast_syscall_common+0xf8 basically matches the 4 attachments that have been set to be Obsolete. Should the Obsolete status be undone on the 4? Vs.: Should "Latest crash dump" be made to also be Obsolete? I'm guessing that none of the attachments should be obsolete at this point.
(In reply to Mark Millard from comment #70) There was also #36 with: QUOTE I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2. But based on an earlier comment I updated to 13.1-RELEASE-p5. Then after going back to kern.smp.disabled=0 I got another of the crash. END QUOTE It only reported not getting a crash for kern.smp.disabled=1 .
(In reply to Mark Millard from comment #70) I should have referred you to comment #27, not #26. But I definitely got the crash with smp.disabled=1. (In reply to Mark Millard from comment #71) I could make a case for obsoleting all but two of them, but possibly I would be throwing away useful information. To my unpracticed eye, though, the ones I DID obsolete were pretty redundant with the ones I kept. They all look pretty similar to me.
(In reply to George Mitchell from comment #73) But they are all the examples were the backtraces having nothing from zfs or dbuf. Having 5 of 11 reports that way looks rather different from 1 out of 7. I'd say that the frequency is notable.
(In reply to George Mitchell from comment #73) #27: QUOTE I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall). END QUOTE It still says that you will try in the future, not explicitly that you had a failure with kern.smp.disabled=1 . #36 reports not having failures with kern.smp.disabled=1 . I did not find any wording I could interpret as reporting a failure with kern.smp.disabled=1 (prior to #73). Do you remember noticing anything distinct? (Probably not, or you would have commented in #73. But just to be sure . . .)
(In reply to Mark Millard from comment #75) It's close to two months ago, so my memory may be misleading me, since my age is beginning to resemble the number of this comment. But I'm pretty sure smp.disabled=1 did not prevent the bug. I could be wrong.
I have been remiss in testing this without ZFS, because I will have to shuffle a couple of disks around. I apologize for the delay. I hope to be able to try this test later this week.
Although I have not yet managed to test this without ZFS, I have established that with zfs_load="YES" but without "vboxnet_enable="YES"" in /etc/rc.conf (zfs.ko and vboxnetflt.ko seeming to be the two modules with which amdgpu.ko has, um, personality conflicts), I can now boot up without crashing (so far). Does anyone have any idea what zfs.ko and vboxnetflt.ko do that other modules don't do?
I omitted an important phrase. It should have said, "with with zfs_load="YES" in /boot/loader.conf ..."
Created attachment 240427 [details] New version of the crash, from acpi_wmi Here's another module that doesn't get along well with amdgpu.ko on my system: acpi_wmi.ko. Other than that this crash looks identical to all the earlier ones, as far as I can tell. It's been about a dozen boot-up tries since I put zfs_load="YES" into /boot/loader.conf (so that ZFS gets loaded early to minimize its interaction with amdgpu.ko) and vboxnet_enable="NO" in /etc/rc.conf (so that vboxnetflt.ko doesn't get its chance to cause trouble either) until I got this new crash. I'll mention again that this crash always happens within a minute of booting up, or else never. Anyone have any ideas about what acpi_wmi.ko has in common with zfs.ko and vboxnetflt.ko?
(In reply to George Mitchell from comment #80) There are multiple, distinct backtraces in your various examples. This one matches the 4 still-listed-as Obsolete ones and the "Latest crash dump" one, but not the others (if I remember right). So it is another example where there is no mention of dbuf or of zfs in the backtrace's text, unlike some other backtraces. So far as I can tell, there still has been no evidence gathering seeing if the problem can happen absent zfs being loaded or zfs loaded but no pools ever imported. If I gather correctly, we now do have evidence that the specific type of backtrace can happen without vboxnetflt.ko ever having been loaded, proving it is not necessary for that kind of failure. That is a form of progress as far as evidence goes. It also suggests that merely being listed in a backtrace does not mean that fact necessarily tells one much about the basic problem. There is some possibility here that there is more than one basic problem and some of the backtrace variability is associated with that.
Using the gdb-based backtrace information: #8 0xffffffff80be8c5d in modlist_lookup (name=0xfffff80006217400 "acpi_wmi", ver=0) at /usr/src/sys/kern/kern_linker.c:1487 is for the strcmp code line in: static modlist_t modlist_lookup(const char *name, int ver) { modlist_t mod; TAILQ_FOREACH(mod, &found_modules, link) { if (strcmp(mod->name, name) == 0 && (ver == 0 || mod->version == ver)) return (mod); } return (NULL); } We also see that strcmp was called via: #6 <signal handler called> #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:46 We also see name was accessible, as shown in the "#8" line above. We see from #7 that strcmp was considered called, suggesting that the mod->name of itself did not fail. The implication would be that that value of name in mod->name was a bad pointer when strcmp tried to use the value. Nothing says that mod->name was or should have been for acpi_wmi at all. The "acpi_wmi" side of the comparison need not be relevant information. Other backtraces that look similar may well have a similar status for the name in the right had argument to the strcmp. This might be a useful hint to someone with appropriate background or suggest some way of detecting the bad value in mod->name earlier when that earlier context might be of more use for investigations.
I have set up a disk with FREEBSD 13.1-RELEASE-p7 and drm-510-kmod 5.10.113_8 WITHOUT ZFS and vbox-anything. I don't know how to avoid loading acpi-wmi.ko. So far it hasn't crashed, but I will try a whole bunch of reboots tomorrow with that disk.
(In reply to George Mitchell from comment #83) I found the following text on https://cateee.net/lkddb/web-lkddb/ACPI_WMI.html : QUOTE ACPI-WMI is a proprietary extension to ACPI to expose parts of the ACPI firmware to userspace - this is done through various vendor defined methods and data blocks in a PNP0C14 device, which are then made available for userspace to call. The implementation of this in Linux currently only exposes this to other kernel space drivers. This driver is a required dependency to build the firmware specific drivers needed on many machines, including Acer and HP laptops. END QUOTE So, I expect that if acpi_wmi.ko is being loaded by FreeBSD, it may well be a requirement for that machine to boot and/or operate via ACPI. But I'm not familiar with the details.
I have a new crash, but I did not get a dump because of an issue I will explain below. For those who came in late, here's a summary of my system. dmesg says I have:CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics (3493.71-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x810f10 Family=0x17 Model=0x11 Stepping=0 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND> AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM> AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX> Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB> SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768 TSC: P-state invariant, performance statistics My motherboard is a Gigabyte B450M D53H. BIOS is American Megatrends version F4, dated 1/25/2019. pciconf -lv says: vgapci0@pci0:6:0:0: class=0x030000 rev=0xc8 hdr=0x00 vendor=0x1002 device=0x15dd subvendor=0x1458 subdevice=0xd000 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]' class = display subclass = VGA Until recently, when I was running FBSD 12-RELEASE, my box had one hard drive. I added a new drive when I upgraded to FBSD 13-RELEASE so I would still have FBSD 12 as an emergency backup. Part of the upgrade is that on the new disk I created a small UFS slice for /, /var, and /tmp, and most of the rest of the disk is a ZFS slice for /usr (so I wouldn't have to wait for fsck on reboot after crashes). That means that it isn't practical to do a test without ZFS on that new disk (I'll call it my regular disk now). So I installed FBSD 13 (same version as my regular disk) on the old disk (I'll call it the test disk now), which had (and still has) a small UFS slice for /, /var, and /tmp and a big UFS slice for /usr. To boot from the test disk, I use the BIOS boot menu, since (unsurprisingly) I have set the default boot disk to my regular disk. I removed all mentions of ZFS and VBOX from /boot/loader.conf and /etc/rc.conf on the test disk. Then I booted up a whole bunch of times. On the thirteenth try, I got the crash. Unfortunately, I don't have a crash summary from it because the system rebooted from my regular disk instead of the test disk while I was still staring at the crash message on the screen. Subsequently, I booted 20 more times from the test disk without getting the crash again. What I saw (for a few seconds) on the screen from the one crash sure looked like the same old backtrace, and I have to say, to an ignorant yokel like myself, it seemed to be saying that there's a locking problem in amdgpu. There was absolutely no virtual terminal switching, because I had not started an X server and I did not type ALT+Fn. I'll try getting a proper crash dump later (possibly tomorrow). My thanks to all of you for your patience.
(In reply to George Mitchell from comment #85) Where does dumpdev point for "test disk"? Someplace also on the "test disk" that a "regular disk" boot would not change? If yes, the first boot of the "test disk" after the crash should have picked up the dump information, even if the "regular disk" was booted between times. But if the dumpdev place is common to both types of boot, then the regular disk boot would have processed the dump. likely using a different /var/crash/ place to store things. Another question would be if there is sufficient room for /var/crash/ to contain the saved vmcore.* and related files. Yet another question is if the test disk has /usr/local/bin/gdb installed vs. not. ( When present, /usr/local/bin/gdb is used to provides one of the forms of backtrace, the one with source file references and line numbers and such. Much nicer to deal with.) If a vmcore.* was saved but some related information was not for some reason, it should be possible to have the related information produced based on the vmcore.* file. Side note: In case it is relevant, I'll note that defining dumpdev in /boot/loader.conf in a form the kernel chan handle, instead of in /etc/rc.conf , can be used to allow the system to produce dumps for earlier crashes. (But I'm guessing the crash was not that earliy to need such.)
(In reply to George Mitchell from comment #85) For booting the test disk, getting the kldstat output from a successful boot might prove useful reference material at some point: it should show what to expect to be loaded by the kernel and in what order. Since you got a crash before starting the X server and had not used ALT+Fn, that would be appropriate context for the kldstat relative to the known UFS-only crash. Other time frames for kldstat may be relevant at some point.
I booted a ThreadRipper 1950X system via its UFS-only boot media alternative. The system is not set up for X. For example, no use/installation of amdgpu.ko for use with its video card. For reference: # kldstat Id Refs Address Size Name 1 58 0xffffffff80200000 295a5a0 kernel 2 1 0xffffffff83210000 3370 acpi_wmi.ko 3 1 0xffffffff83214000 3210 intpm.ko 4 1 0xffffffff83218000 2178 smbus.ko 5 1 0xffffffff8321b000 2220 cpuctl.ko 6 1 0xffffffff8321e000 3360 uhid.ko 7 1 0xffffffff83222000 4364 ums.ko 8 1 0xffffffff83227000 33a0 usbhid.ko 9 1 0xffffffff8322b000 32a8 hidbus.ko 10 1 0xffffffff8322f000 4d00 ng_ubt.ko 11 6 0xffffffff83234000 ab28 netgraph.ko 12 2 0xffffffff8323f000 a238 ng_hci.ko 13 4 0xffffffff8324a000 2668 ng_bluetooth.ko 14 1 0xffffffff8324d000 8380 uftdi.ko 15 1 0xffffffff83256000 4e48 ucom.ko 16 1 0xffffffff8325b000 3340 wmt.ko 17 1 0xffffffff8325f000 e250 ng_l2cap.ko 18 1 0xffffffff8326e000 1bf08 ng_btsocket.ko 19 1 0xffffffff8328a000 38b8 ng_socket.ko 20 1 0xffffffff8328e000 2a50 mac_ntpd.ko # uname -apKU FreeBSD amd64_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #61 main-n261026-d04c86717c8c-dirty: Sun Feb 19 15:03:52 PST 2023 root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400081 1400081
After getting another instance of my crash on my test disk and then booting from the correct disk, I got a crash summary that said: Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/lib/debug/boot/kernel/kernel.debug] It occurred to me that when I updated my test disk from FBSD 12 to 13 I had forgotten to run mergemaster. So I did so today. But I haven't been able to reproduce the crash in 25 tries since then. I'm convinced that running mergemaster did not fix the crash, which is after all highly random. So I will try some more tomorrow. I appreciate everybody's patience.
(In reply to George Mitchell from comment #89) What vintage/version of *gdb was in use? (If it was gdb that complained.) Was it /usr/local/bin/*gdb ? /usr/libexec/*gdb ? Actually, for the backtrace activity, it is kgdb that is used, not gdb. Thus my use of "*gdb" notation. But a core.txt.* file in my context shows: GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD] which would be for /usr/local/bin/*gdb ( not /usr/libexec/*gdb ). This is because I have: # pkg info gdb gdb-12.1_3 Name : gdb Version : 12.1_3 . . . installed. (I had to a livecore.* to have something to reference/illustrate with, having no example vmcore.* files around for a long time.) A significantly older gdb might indicate use of an old /usr/libexec/*gdb that had not been cleaned out. I'll note that I got no DWARF complaints from kgdb and: # llvm-dwarfdump -r 1 /usr/lib/debug/boot/kernel/kernel.debug | grep DWARF | head -1 0x00000000: Compile Unit: length = 0x000001d3, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x000001d7) indicates version = 0x0004 . This leads me to expect that you have an old gdb (kgdb) around that is in use. It sounds like you got a savecore into /var/crash/ . It should be possible to try investigating that without having to cause another crash, presuming the system is not updated (so that it matches the crash contents). For example, the same sort of command that crashinfo uses on the saved system-core file could be manually tried, possibly with a more modern kgdb vintage being used that would handle the more recent dwarf version. Attaching your core.txt.* file content might prove useful.
Created attachment 240591 [details] A new but related crash (I think) This one was at shutdown time rather than boot-up time, so potentially virtual terminal switching was involved. But once again there are references to "WARNING !drm_modeset_is_locked(&plane->mutex) failed" along with a mention of ZFS. I don't know what it means.
(In reply to George Mitchell from comment #91) So, apparently, this was not one of the UFS-only experiments. The gdb backtrace is messy: . . . #7 <signal handler called> . . . #27 <signal handler called> #28 0x00000000002881da in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd688 This indicates that we are not seeing evidence from the earlier problem that got #27. That, in turn, may or may not have been the original problem. The context looks to be a very different context than prior reports. But not seeing what lead to #27 makes forming solid judgments problematical. I see from this that a modern gdb (kgdb) was in use for this failure for the crashinfo generation after the savecore operation, having no problems with DWARF 4 vs. 2. But it would seem to be the boot media normally used with ZFS instead of the boot media intended for UFS-only testing. The two might be different for what is around for gdb (kgdb) for crashinfo to use.
(In reply to Mark Millard from comment #92) Looking at it some more and comparing to #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff8214d251 at spl_nvlist_free+0x61 #7 0xffffffff8220d740 at fm_nvlist_destroy+0x20 #8 0xffffffff822e6e95 at zfs_zevent_post_cb+0x15 #9 0xffffffff8220cd02 at zfs_zevent_drain+0x62 #10 0xffffffff8220cbf8 at zfs_zevent_drain_all+0x58 #11 0xffffffff8220ede9 at fm_fini+0x19 #12 0xffffffff82243b94 at spa_fini+0x54 #13 0xffffffff822ee303 at zfs_kmod_fini+0x33 #14 0xffffffff8215fb3b at zfs_shutdown+0x2b #15 0xffffffff80c1b76c at kern_reboot+0x3dc #16 0xffffffff80c1b381 at sys_reboot+0x411 #17 0xffffffff810ae6ec at amd64_syscall+0x10c both #27 and #28 in: #26 amd64_syscall (td=0xfffffe000f43ca00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #27 <signal handler called> #28 0x00000000002881da in ?? () Backtrace stopped: Cannot access memory at address 0x7fffffffd688 are possibly just the normal difficulty with finding where to stop listing.
(In reply to Mark Millard from comment #93) #7 <signal handler called> #8 vtozoneslab (va=18446735277616529408, zone=<optimized out>, slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635 looks to be the "*slab" line in: static __inline void vtozoneslab(vm_offset_t va, uma_zone_t *zone, uma_slab_t *slab) { vm_page_t p; p = PHYS_TO_VM_PAGE(pmap_kextract(va)); *slab = p->plinks.uma.slab; *zone = p->plinks.uma.zone; } For reference: 18446735277616529408 == 0xFFFFF80000000000
Created attachment 240622 [details] Another crash summary; looks like all the earlier ones Quick summary: I can't cause this crash on my test setup (amdgpu but no ZFS) over close to 50 tries. In more detail: I deleted all ports from my test setup and then added drm-510-kmod and gpu-firmware-amd-kmod, and (most importantly) gdb. I then made many fruitless attempts to reproduce the crash. Experimentally, I added "zfs" to my mod_list in /etc/rc.conf and got another instance of the crash after 11 attempts (see attachment). This crash looks like all the ones from my regular setup, but at least it appears to be in the right format to get a backtrace, etc. I then took "zfs" out of my mod_list and tried another 20 times to get the crash to recur. It did not recur.
(In reply to Mark Millard from comment #94) The "signal handler called" line hides a function call. I think the crash is due to a null pointer dereference ("fault virtual address = 0x0") in pmap_kextract called from the line above. Tracking down the PC address 0xffffffff80bf3727 in the kernel image should clarify.
(In reply to George Mitchell from comment #95) But, as I understand, comments #85 and #89 reported crashes of the test setup (no ZFS), as I understand. (I ignore #91 that was at shutdown and looks different.) If true, we do have some existence-proof type evidence for without ZFS involved. It just may be less common. (Unfortunately some detail was not available for validating a context match.) You may not want to spend all your time with the no-ZFS style tests, but spending some time on occasion could eventually prove useful. Any big, complicated thing (like ZFS) that can be eliminated may help isolate the problem.
(In reply to John F. Carr from comment #96) As I understand it, "fault virtual address = 0x0" is for #7 and not for number #27. As far as I can tell what lead to #27 and its specific type is not available to us.
(In reply to George Mitchell from comment #95) FYI: "Another crash summary; looks like all the earlier ones" is a crash when it is getting ready to load ZFS, not after ZFS has been loaded. So ZFS had not been started yet. So it is evidence for a problem without having ZFS in operation at all.
(In reply to Mark Millard from comment #98) Frame 27 is the entry into the kernel via the system call trap. We know this because it calls amd64_syscall. Frame 28 is a user program. We know this because the addresses are at the user address space and not the kernel address space (program counter at 0x2881da, stack frame at 0x7fffffffd688).
(In reply to George Mitchell from comment #95) FYI: "Another crash summary; looks like all the earlier ones" is a crash when it is getting ready to load ZFS, not after ZFS has been loaded. So ZFS had not been started yet. So it is evidence for a problem without having had ZFS in operation at all.
(In reply to Mark Millard from comment #97) You are correct that I did get two dumps without ZFS, but they did not appear to have decipherable dumps. I'll keep trying for another dump without ZFS now that I know we will obtain a usable dump on the test setup. (In reply to Mark Millard from comment #101) That's why we stopped seeing the reference to ZFS when I took "zfs" out of mod_list and put "zfs_load="YES"" in /boot/loader.conf in response to comment #41.
(In reply to John F. Carr from comment #100) Ahh, so kgdb ends up with fast_syscall_common+0xf8 or the like translated to a <signal handler called> . For this part, beleive and look at the kernel's backtrace for the area that says fast_syscall_common+0xf8 (or whatever). Good to know. Thanks.
If the problem is memory corruption running a debug kernel might find the corruption closer to when it happens. Are you able to build and run your own kernel with a configuration file like include GENERIC ident DEBUG options INVARIANTS options INVARIANT_SUPPORT ?
(In reply to George Mitchell from comment #102) So are all the load-time crashes with things loaded via use of: kld_list (str) A whitespace-separated list of kernel modules to load right after the local disks are mounted, without any .ko ex- tension or path. Loading modules at this point in the boot process is much faster than doing it via /boot/loader.conf for those modules not necessary for mounting local disks. and never with things that are loaded via /boot/loader.conf activity? It is a possible distinction in the test results that I'd managed to miss. (I'll note that the "for those modules not necessary for mounting local disks" may make zfs being listed kld_list unusual. That, in turn, might help explain why, so far, you are the only one known to be having the load-time crash problem examples.)
(In reply to John F. Carr from comment #104) I will try this today. By the way, perhaps I should have mentioned already that I use SCHED_4BSD (I'm the guy who periodically rants that it should be the default, or at least that the scheduler should be a kernel loadable module), though it's hard to see how that could be a factor. (In reply to Mark Millard from comment #105) Yes, I had an occurrence of brain fade when I put zfs into mod_list. I promise never to have brain fade ever again.
Created attachment 240642 [details] Crash without any use of ZFS, with acpi_wmi Here's a crash from my test setup with no use of ZFS at all. It looks like the earlier crash with acpi_wmi, without which I suspect this hardware won't run. Also, this kernel had INVARIANTS and INVARIANTS_SUPPORT compiled in (confirmed by the config shown in the summary), though I couldn't tell from anything I saw on the screen. Next I'll attach the relevant part of /var/log/messages, though I didn't see anything there either.
Created attachment 240643 [details] Relevant part of /var/log/messages Here's the log from the time of the crash, up to now.
(In reply to George Mitchell from comment #107) I'll note that in the example kldstat that I reported earlier the order started with: # kldstat Id Refs Address Size Name 1 58 0xffffffff80200000 295a5a0 kernel 2 1 0xffffffff83210000 3370 acpi_wmi.ko . . . So acpi_wm.ko appears to be the first module loaded in my context. I'd guess that is true for your context as well. This would mean that prior module loads are not required for the problem to happen (loading the first of the modules). That shold narrow the range of possibilities (for someone sufficiently knowledgeable in the subject area).
Created attachment 240683 [details] New instance This is from running my regular setup, not the debug setup. Almost immediately after I got this dump, my system crashed two more times in a row; see next attachment, which appears to contain a summary of both crashes (the 2nd and the 3rd). None of the stack dumps seem to have a call to modlist_lookup2, so possibly all three of these are some new amdgpu crash.
Created attachment 240684 [details] Crashes 2 and 3 The second crash was very late in the boot process, unlike most of the others. Running meld on these files might prove enlightening.
(In reply to George Mitchell from comment #110) The backtraces mentioning "zap_evict_sync" are not new. You submitted prior examples as attachments, such as "New core.txt". The backtrace(s) with "spa_all_configs" may well be new. I do not remember such.
Would it help if I attached my system log from the period of time yesterday when I got three crashes in a row?
Created attachment 240729 [details] Another instance of attachment #240591 [details] crash at shutdown time For the sake of completeness I'm attaching one more instance of the crash I see every few days at shutdown time instead of boot-up time. My plan for now is to restore my configuration to the one that most frequently provokes the crash: namely, I load ZFS with zfs_enable in /etc/rc.conf instead of zfs_load in /boot/loader.conf, and I'm adding vbox_enable="YES' back into /etc/rc.conf. Also, I'm updating from drm-510-kmod-5.10.113_8 to drm-510-kmod-5.10.163_2 since it's available, and I'll see if that crashes still. If so, then I will stop using amdgpu for a week and verify, for the purpose of maintaining my own sanity, that the crashes stop. And I'll report back here.
(In reply to George Mitchell from comment #114) All of the crashes that listed "acpi_wmi" were before amdgpu could have been involved: acpi_wmi loads first amdgpu would be later.
Created attachment 240731 [details] After upgrading to v5.10.163_2 I re-enabled the crashes (i.e. stopped loading ZFS early and turned vbox_enable back on) and got a crash on my very first reboot. Now I have disabled amdgpu and I'll be astonished if I get a crash before the twelfth of never. This crash does look slightly different, though, and seems to have had a trap 22 in ZFS code.
(In reply to Mark Millard from comment #115) Be that as it may, over the period of time from when I first upgraded to FBSD 13.1 until I started seriously trying to use drm-510-kmod, I never saw any occurrences at all of the ZFS crash, the vboxnetflt crash, or the acpi_wmi crash. And I don't expect to see any of them as long as I don't load amdgpu.ko.
(In reply to George Mitchell from comment #117) Yea, my expectation that acpi_wmi would always be loaded first was just wrong. Sorry. With the ZFS boot media, I see: Id Refs Address Size Name 1 94 0xffffffff80200000 295a9b0 kernel 2 1 0xffffffff82b5b000 5b80d8 zfs.ko 3 1 0xffffffff83115000 76f8 cryptodev.ko 4 1 0xffffffff83a10000 3370 acpi_wmi.ko . . . I looked at all your attachments again. It appears amdgpu was already present before the first crash point in all of them.
For: Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 objdump -d --prefix-addresses /boot/kernel/kernel | less shows: ffffffff80d1786b <qsort+0x12ab> mov %esi,0x4(%r11,%rdx,4) ffffffff80d17870 <qsort+0x12b0> mov 0x8(%rcx,%rdx,4),%esi As for other "instruction pointer" examples . . . Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff80d17890 ffffffff80d1788f <qsort+0x12cf> mov %esi,0xc(%r11,%rdx,4) ffffffff80d17894 <qsort+0x12d4> add $0x4,%rdx Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3701 <free+0x11> je ffffffff80bf378d <free+0x9d> ffffffff80bf3707 <free+0x17> mov %rsi,%r14 Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff82231ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3701 <free+0x11> je ffffffff80bf378d <free+0x9d> ffffffff80bf3707 <free+0x17> mov %rsi,%r14 Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 ffffffff80bf3722 <free+0x32> call ffffffff80f66670 <PHYS_TO_VM_PAGE> ffffffff80bf3727 <free+0x37> mov (%rax),%r13 Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80d0cea0 ffffffff80d0ce9c <vn_ioctl+0x1fc> jne ffffffff80d0cff2 <vn_ioctl+0x352> ffffffff80d0cea2 <vn_ioctl+0x202> movzwl 0x2(%r13),%ecx
(In reply to Mark Millard from comment #119) [Sorry for the accidental duplication of the block that had "instruction pointer = 0x20:0xffffffff80bf3707".] The qsort, free, and vn_ioctl addresses do not look to match up with any of the multi-level backtraces. So we have very little evidence about what the context was. I've no clue for the addresses that were outside the kernel.
(In reply to Mark Millard from comment #120) Ugg. I just realized that I'd not looked at an official releng/13.1 build. So using a download of an official kernel.txz this time . . . (the subroutines stay the same but the detailed code is different). Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xffffffff80d17870 ffffffff80d1786d <qsort+0x130d> mov -0x38(%rbp),%rdi ffffffff80d17871 <qsort+0x1311> mov %dl,(%rdi,%rsi,1) As for other "instruction pointer" examples . . . Fatal trap 9: general protection fault while in kernel mode cpuid = 2; apic id = 02 instruction pointer = 0x20:0xffffffff80d17890 ffffffff80d1788f <qsort+0x132f> cmp $0x3,%r8 ffffffff80d17893 <qsort+0x1333> jae ffffffff80d17910 <qsort+0x13b0> Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 The above is outside the kernel's code. Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff82231ba6 The above is outside the kernel's code. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3707 ffffffff80bf3700 <free+0x70> mov %gs:0xb0,%rax ffffffff80bf3709 <free+0x79> add %r15,0x8(%rcx,%rax,1) Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 ffffffff80bf3724 <free+0x94> cmpb $0x0,0x128(%rbx) ffffffff80bf372b <free+0x9b> jne ffffffff80bf3777 <free+0xe7> Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80d0cea0 ffffffff80d0ce9a <vn_ioctl+0x25a> mov %r14,-0xc8(%rbp) ffffffff80d0cea1 <vn_ioctl+0x261> cmpb $0x0,0xaf417e(%rip) # ffffffff81801026 <sdt_probes_enabled>
(In reply to George Mitchell from comment #117) Would it be reasonable to have some testing with amdgpu.ko loaded but never having a desktop environment active? Or, may be I should form the idea as questions: What is the minimal form of having amdgpu.ko loaded in the system? Can that be tested (if it has not been already)? Does this minimal form behave any differently than more involved use of amdgpu.ko (and the associated card firmware)? In a different direction . . . In/for a separate context, I once built amdgpu and its firmware and installed it. But I did not set up an automatic load. For the rare test, I manually loaded amdgpu and then started lumina. (It is an old memory. I might not have the details correct.) This procedure might have largely avoided later loads of kernel modules and, so, avoided discovering a problem.
All my so-called test setup tests were run without starting a desktop environment (by which I assume you mean not starting X). There were still crashes such as in comment #107, attachment #240642 [details]. With my normal setup, kldloading amdgpu manually instead of automatically noticeably reduced the incidence of crashes but did not eliminate them.
(In reply to George Mitchell from comment #123) "kldloading amdgpu manually": there are two possibilities: A) Using boot -s and doing kldload and then exiting to normal mode. There are examples in your attachments of doing this. B) Getting to normal mode, logging in, and only after that doing the first kldload of amdgpu. I do not remember any of the attachments clearly indicating such a sequence. It puts the amdgpu load after other other normal loads.
Well, I was going to try testing in an environment were I've got a serial console: an aarch64 main [so: 14] context. But it turns out that there is at least one missing function declaration foor the type of context at this point: /wrkdirs/usr/ports/graphics/drm-515-kmod/work/drm-kmod-drm_v5.15.25/drivers/gpu/drm/drm_cache.c:362:10: error: call to undeclared function 'in_interrupt'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] WARN_ON(in_interrupt()); ^ 1 error generated. *** [drm_cache.o] Error code 1 as is visible in the official build log: http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-515-kmod-5.15.25.log Turns out the drm-510-kmod variant allowed for releng/13.1 and later is missing possible macro definitions for aarch64: /wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:741:3: error: call to undeclared function 'DC_FP_START'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] DC_FP_START(); ^ /wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:743:3: error: call to undeclared function 'DC_FP_END'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] DC_FP_END(); ^ 2 errors generated. *** [dc.o] Error code 1 as is visible in: http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-510-kmod-5.10.163_2.log (It is not just my builds that have such issues: official builds have the problems as well.) I was hoping I'd be able to do some testing in the alternative type of context (likely never starting X11). That looks to not be in the cards at this time.
(In reply to Mark Millard from comment #125) Picking the drm-515-kmod one: it looks like the source file referenced needs to include the content of the file providing the #define : /usr/main-src/sys/compat/linuxkpi/common/include/linux/preempt.h:#define in_interrupt() \ There are overall, some other uses: drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c: if (r < 1 && in_interrupt()) drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt())) drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt())) drm-kmod//drivers/gpu/drm/drm_cache.c: if (WARN_ON(in_interrupt())) { drm-kmod//drivers/gpu/drm/drm_cache.c: WARN_ON(in_interrupt()); I have not checked if any others of those do get preempt.h already. amd64 might be working via header pollution in some way that aarch64 does not?
(In reply to Mark Millard from comment #126) Further inspection of what comes next after making drm_cache.c pick up the in_interrupt definition suggests that trying builds of aarch64 is premature at this point, making the type of test I was intending also premature.
(In reply to George Mitchell from comment #114) > [...] I will stop using amdgpu for a week and verify, for the purpose > of maintaining my own sanity, that the crashes stop. [...] Back in amd64 land, since the time of that comment, I have rebooted my system 25 times and there have been no crashes at all. I guess I'm sane.
(In reply to George Mitchell from comment #12) Could you also share your "kldstat" output for when amdgpu has been loaded? More than just amdgpu might be added to what is loaded before amdgpu compared to when amdgpu is not loaded at all. For example some of: # find /boot/ker*/ -name 'linux*' -print | more /boot/kernel/linux64.ko /boot/kernel/linux_common.ko /boot/kernel/linuxkpi.ko /boot/kernel/linuxkpi_wlan.ko might be involved, not just amdgpu. Loading only some prerequisites for amdgpu, but not amdgpu itself, might prove a useful isolation test.
(In reply to Mark Millard from comment #129) I wrote "what is loaded before" relative to amdgpu. But what amdgpu in turn leads to loading that is listed after amdgpu in "kld stat" output likely is just as relevant. For all I know all of it may be from after amdgpu's position in the "kld stat" list.
(In reply to Mark Millard from comment #130) Based on drm-515-kmod related materials on/for amd64 runing main [so: 14] and the type of card that happened to be present, I saw: 22 1 0xffffffff83c00000 4fd918 amdgpu.ko 23 2 0xffffffff83a8e000 79f50 drm.ko 24 1 0xffffffff83b08000 22a8 iic.ko 25 3 0xffffffff83b0b000 30d8 linuxkpi_gplv2.ko 26 4 0xffffffff83b0f000 6320 dmabuf.ko 27 3 0xffffffff83b16000 3360 lindebugfs.ko 28 1 0xffffffff83b1a000 b350 ttm.ko 29 1 0xffffffff83b26000 a118 amdgpu_polaris11_k_mc_bin.ko 30 1 0xffffffff83b31000 6370 amdgpu_polaris11_pfp_2_bin.ko 31 1 0xffffffff83b38000 6370 amdgpu_polaris11_me_2_bin.ko 32 1 0xffffffff83b3f000 4370 amdgpu_polaris11_ce_2_bin.ko 33 1 0xffffffff83b44000 7978 amdgpu_polaris11_rlc_bin.ko 34 1 0xffffffff83b4c000 42380 amdgpu_polaris11_mec_2_bin.ko 35 1 0xffffffff83b8f000 42380 amdgpu_polaris11_mec2_2_bin.ko 36 1 0xffffffff83bd2000 5270 amdgpu_polaris11_sdma_bin.ko 37 1 0xffffffff83bd8000 5270 amdgpu_polaris11_sdma1_bin.ko 38 1 0xffffffff840fe000 5db58 amdgpu_polaris11_uvd_bin.ko 39 1 0xffffffff8415c000 2ac78 amdgpu_polaris11_vce_bin.ko 40 1 0xffffffff83bde000 21d90 amdgpu_polaris11_k_smc_bin.ko This was from deliberately using kldload amdgpu after all the normal boot/login load activity. No kld_list= use involved at all. I wonder how much your environment would crash for amdgpu loaded this late. FYI: The prior load activity was: Id Refs Address Size Name 1 132 0xffffffff80200000 295b050 kernel 2 1 0xffffffff82b5d000 76f8 cryptodev.ko 3 1 0xffffffff82b65000 5b80d8 zfs.ko 4 1 0xffffffff83a10000 3370 acpi_wmi.ko 5 1 0xffffffff83a14000 3210 intpm.ko 6 1 0xffffffff83a18000 2178 smbus.ko 7 1 0xffffffff83a1b000 2220 cpuctl.ko 8 1 0xffffffff83a1e000 3360 uhid.ko 9 1 0xffffffff83a22000 4364 ums.ko 10 1 0xffffffff83a27000 33a0 usbhid.ko 11 1 0xffffffff83a2b000 32a8 hidbus.ko 12 1 0xffffffff83a2f000 4d00 ng_ubt.ko 13 6 0xffffffff83a34000 ab28 netgraph.ko 14 2 0xffffffff83a3f000 a238 ng_hci.ko 15 4 0xffffffff83a4a000 2668 ng_bluetooth.ko 16 1 0xffffffff83a4d000 8380 uftdi.ko 17 1 0xffffffff83a56000 4e48 ucom.ko 18 1 0xffffffff83a5b000 3340 wmt.ko 19 1 0xffffffff83a5f000 e250 ng_l2cap.ko 20 1 0xffffffff83a6e000 1bf08 ng_btsocket.ko 21 1 0xffffffff83a8a000 38b8 ng_socket.ko
(In reply to Mark Millard from comment #129) When I boot up to single-user mode, kldstat says: Id Refs Address Size Name 1 7 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko After "kldload amdgpu," it says: Id Refs Address Size Name 1 59 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko 5 1 0xffffffff82600000 417220 amdgpu.ko 6 2 0xffffffff82518000 739e0 drm.ko 7 3 0xffffffff8258c000 5220 linuxkpi_gplv2.ko 8 4 0xffffffff82592000 62d8 dmabuf.ko 9 1 0xffffffff82599000 c758 ttm.ko 10 1 0xffffffff825a6000 2218 amdgpu_raven_gpu_info_bin.ko 11 1 0xffffffff825a9000 64d8 amdgpu_raven_sdma_bin.ko 12 1 0xffffffff825b0000 2e2d8 amdgpu_raven_asd_bin.ko 13 1 0xffffffff825df000 93d8 amdgpu_raven_ta_bin.ko 14 1 0xffffffff825e9000 7558 amdgpu_raven_pfp_bin.ko 15 1 0xffffffff825f1000 6558 amdgpu_raven_me_bin.ko 16 1 0xffffffff825f8000 4558 amdgpu_raven_ce_bin.ko 17 1 0xffffffff82a18000 b9c0 amdgpu_raven_rlc_bin.ko 18 1 0xffffffff82a24000 437e8 amdgpu_raven_mec_bin.ko 19 1 0xffffffff82a68000 437e8 amdgpu_raven_mec2_bin.ko 20 1 0xffffffff82aac000 5a638 amdgpu_raven_vcn_bin.ko But after a full boot without amdgpu, it says: Id Refs Address Size Name 1 66 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 ff4b8 if_re.ko 3 3 0xffffffff82230000 8cc90 vboxdrv.ko 4 1 0xffffffff822bd000 77e0 sem.ko 5 1 0xffffffff82600000 3df128 zfs.ko 6 2 0xffffffff82518000 4240 vboxnetflt.ko 7 2 0xffffffff8251d000 aac8 netgraph.ko 8 1 0xffffffff82528000 31c8 ng_ether.ko 9 1 0xffffffff8252c000 55e0 vboxnetadp.ko 10 1 0xffffffff82532000 3378 acpi_wmi.ko 11 1 0xffffffff82536000 3218 intpm.ko 12 1 0xffffffff8253a000 2180 smbus.ko 13 1 0xffffffff8253d000 33c0 uslcom.ko 14 1 0xffffffff82541000 4d90 ucom.ko 15 1 0xffffffff82546000 2340 uhid.ko 16 1 0xffffffff82549000 3380 usbhid.ko 17 1 0xffffffff8254d000 31f8 hidbus.ko 18 1 0xffffffff82551000 3320 wmt.ko 19 1 0xffffffff82555000 4350 ums.ko 20 1 0xffffffff8255a000 5af8 autofs.ko 21 1 0xffffffff82560000 2a08 mac_ntpd.ko 22 1 0xffffffff82563000 20f0 green_saver.ko
(In reply to George Mitchell from comment #132) I wonder if, in your context, the following boot sequencing might sidestep the boot-crash issue: "A full boot without amdgpu" then: "kldload amdgpu" then: normal use. Basically: doing the amdgpu load as late as possible relative to everything else loaded, limiting what all loads after amdgpu.
Okay, my machine is set up as you requested. It boots to multiuser mode without starting an X session, at which point I load amdgpu and then start my normal XFCE session. I'll run it this way for a week. Undoubtedly, it won't exhibit the bootup crash in this mode of operation, but I won't be surprised if I still get a shutdown crash or two. And in any case this isn't a fix for the underlying bug. Not sure what new information this is likely to yield.
(In reply to George Mitchell from comment #134) Having the kldstat output for this combination would help identify what module is initially involved in any crash. Part of what may be of use is how often you see the dbuf_evict_thread type of backtrace and what module the first "instruction pointer =" references in such cases (if any). Another would be if new crash contexts show up that have not been seen before. So far there is no evidence for how many bugs there are, given the varying failure-structures that show up. There could even be the possibility of unreliable memory or bugs specific to amdgpu_raven_*.ko files (such as sometimes trashing some memory). I've yet to induce any failure in the amdgpu_polaris11_*.ko based amd64 context that I have access to (a ThreadRipper 1950X), although by no means is it a close match to your context. To my knowledge, you still have the only known examples of any of the failures. To some extent, if trying new things leads to new forms of failure for you, it potentially gives me new sequences to try on the ThreadRipper 1950X. How (un)likely that is to yield useful information I do not know. (My hope to also try on aarch64, where I've access to a serial console, did not pan out.)
Sorry, meant to put these in yesterday. After booting to single-user mode, kldstat reports: Id Refs Address Size Name 1 7 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko If I boot to single-user mode and kldload amdgpu, kldstat reports: Id Refs Address Size Name 1 59 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 8cc90 vboxdrv.ko 3 1 0xffffffff821be000 ff4b8 if_re.ko 4 1 0xffffffff822be000 77e0 sem.ko 5 1 0xffffffff82600000 417220 amdgpu.ko 6 2 0xffffffff82518000 739e0 drm.ko 7 3 0xffffffff8258c000 5220 linuxkpi_gplv2.ko 8 4 0xffffffff82592000 62d8 dmabuf.ko 9 1 0xffffffff82599000 c758 ttm.ko 10 1 0xffffffff825a6000 2218 amdgpu_raven_gpu_info_bin.ko 11 1 0xffffffff825a9000 64d8 amdgpu_raven_sdma_bin.ko 12 1 0xffffffff825b0000 2e2d8 amdgpu_raven_asd_bin.ko 13 1 0xffffffff825df000 93d8 amdgpu_raven_ta_bin.ko 14 1 0xffffffff825e9000 7558 amdgpu_raven_pfp_bin.ko 15 1 0xffffffff825f1000 6558 amdgpu_raven_me_bin.ko 16 1 0xffffffff825f8000 4558 amdgpu_raven_ce_bin.ko 17 1 0xffffffff82a18000 b9c0 amdgpu_raven_rlc_bin.ko 18 1 0xffffffff82a24000 437e8 amdgpu_raven_mec_bin.ko 19 1 0xffffffff82a68000 437e8 amdgpu_raven_mec2_bin.ko 20 1 0xffffffff82aac000 5a638 amdgpu_raven_vcn_bin.ko If I boot to multi-user mode without kldloading amdgpu, kldstat reports; Id Refs Address Size Name 1 66 0xffffffff80200000 1f2ffd0 kernel 2 1 0xffffffff82130000 ff4b8 if_re.ko 3 1 0xffffffff82231000 77e0 sem.ko 4 3 0xffffffff82239000 8cc90 vboxdrv.ko 5 1 0xffffffff82600000 3df128 zfs.ko 6 2 0xffffffff82518000 4240 vboxnetflt.ko 7 2 0xffffffff8251d000 aac8 netgraph.ko 8 1 0xffffffff82528000 31c8 ng_ether.ko 9 1 0xffffffff8252c000 55e0 vboxnetadp.ko 10 1 0xffffffff82532000 3378 acpi_wmi.ko 11 1 0xffffffff82536000 3218 intpm.ko 12 1 0xffffffff8253a000 2180 smbus.ko 13 1 0xffffffff8253d000 33c0 uslcom.ko 14 1 0xffffffff82541000 4d90 ucom.ko 15 1 0xffffffff82546000 2340 uhid.ko 16 1 0xffffffff82549000 3380 usbhid.ko 17 1 0xffffffff8254d000 31f8 hidbus.ko 18 1 0xffffffff82551000 3320 wmt.ko 19 1 0xffffffff82555000 4350 ums.ko 20 1 0xffffffff8255a000 5af8 autofs.ko 21 1 0xffffffff82560000 2a08 mac_ntpd.ko 22 1 0xffffffff82563000 20f0 green_saver.ko If I then kldload amdgpu, it says the same as above, plus: 23 1 0xffffffff82a00000 417220 amdgpu.ko 24 2 0xffffffff82566000 739e0 drm.ko 25 3 0xffffffff825da000 5220 linuxkpi_gplv2.ko 26 4 0xffffffff825e0000 62d8 dmabuf.ko 27 1 0xffffffff825e7000 c758 ttm.ko 28 1 0xffffffff825f4000 2218 amdgpu_raven_gpu_info_bin.ko 29 1 0xffffffff825f7000 64d8 amdgpu_raven_sdma_bin.ko 30 1 0xffffffff82e18000 2e2d8 amdgpu_raven_asd_bin.ko 31 1 0xffffffff829e0000 93d8 amdgpu_raven_ta_bin.ko 32 1 0xffffffff829ea000 7558 amdgpu_raven_pfp_bin.ko 33 1 0xffffffff829f2000 6558 amdgpu_raven_me_bin.ko 34 1 0xffffffff829f9000 4558 amdgpu_raven_ce_bin.ko 35 1 0xffffffff82e47000 b9c0 amdgpu_raven_rlc_bin.ko 36 1 0xffffffff82e53000 437e8 amdgpu_raven_mec_bin.ko 37 1 0xffffffff82e97000 437e8 amdgpu_raven_mec2_bin.ko 38 1 0xffffffff82edb000 5a638 amdgpu_raven_vcn_bin.ko
Created attachment 241022 [details] Four boot-time crashes in a row For some reason, I just got four boot-up crashes immediately in a row. After I cycled power, I was able to boot up without crashing. I think I'm going to load zfs.ko from /boot/loader.conf to get it loaded earlier, which mitigates this pronlem. (It's currently loaded with zfs_enable="YES" in /etc/rc.conf.)
(In reply to George Mitchell from comment #137) Your upload ended up being: application/octet-stream this time, instead of text/plain .
Yes. It's a compressed tar file with four core.txt files for the price of one. They are different enough that I thought I'd better attach them all, though mainly the later ones include increasing portions of the earlier ones because they were on immediately successive boots.
(In reply to George Mitchell from comment #137) All 4 are examples related to dbuf_evict_thread (a.k.a. zfs dbuf related crashes), as I feared. All 4 look like: Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x7 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82600ba6 Looks to be in: 5 1 0xffffffff82600000 3df128 zfs.ko panic: page fault cpuid = 1 time = 1679349400 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff827ac768 at zap_evict_sync+0x68 #7 0xffffffff8267d74a at dbuf_destroy+0xba #8 0xffffffff82683129 at dbuf_evict_one+0xf9 #9 0xffffffff8267b43d at dbuf_evict_thread+0x31d #10 0xffffffff80bd8abe at fork_exit+0x7e #11 0xffffffff8108604e at fork_trampoline+0xe #6 0xffffffff810ade4f in trap_pfault (frame=0xfffffe00b3bb6d00, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 avl_destroy_nodes (tree=tree@entry=0xfffff8001a80b5a0, cookie=cookie@entry=0xfffffe00b3bb6dd0) at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023 #9 0xffffffff827ac768 in mze_destroy (zap=0xfffff8001a80b480) at /usr/src/sys/contrib/openzfs/module/zfs/zap_micro.c:402 A question would be if this repeats based on amdgpu having been loaded (again last) but no X11 like activity having ever been started: limiting amdgpu use to just the load activity or as close to that limited of use as is possible. (This is separate from your zfs load time adjustment test.) My guess is that the content of some memory area(s) is being trashed in your context. I'm not sure how to track down what is doing the trashing or were all the trashed area(s) are if that is what is going on. At least we now have a clue how to get the specific type of crash. Before I had no clue what an example initial-context might be like. Note: Changing the load order should get a matching kldstat report to indicate the address ranges that end up involved.
(In reply to George Mitchell from comment #139) The upload did not look compressed to me: I just had to use tools that would tolerate the binary content at the start and end. The rest looked like normal text without me doing anything to decompress the file. But, looking, the prefix text does look like a partially-binary header, likely added by a tool. The tail end might just be binary padding. At least I've a clue for next time.
So I should just boot up to multi-user mode and kldload amdgpu, but not start XFCE? And repeat until it crashes again?
(In reply to George Mitchell from comment #142) Seeing if that no-XFCE context crashes vs. not would be a good idea. If it crashes similarly, then XFCE activity is not likely to be involved. If it does not crash, then XFCE activity is likely involved. FYI: all 4 crashes had: fault virtual address = 0x7 (the same small offset from a NULL pointer in C terms). This does not look like random trashing of memory (for the few examples available).
Created attachment 241027 [details] Another shutdown-time crash I got another shutdown-time crash. The part of this file that is relevant to this crash starts around line 1400; all the earlier stuff appears to be from the crashes earlier today.
(In reply to George Mitchell from comment #144) Looking at your full list of attachments, it appears that . . . All the shutdown time crashes have: fault virtual address = 0x0 (And we might now have a known type of context for getting the type of failure: late amdgpu but no XFCE.) All the dbuf_evict_thread related crashes have: fault virtual address = 0x7 (Late admgpu but having used XFCE.) All the kldload related crashes have: Fatal trap 9: general protection fault while in kernel mode (but no explicit fault address listed) (Early amdgpu loading.) My guess is something is trashing memory in a way that involves writing zeros over some pointer values that it should not be touching. Later code extracts such zeros and applies any offset and then tries to dereference the result, resulting in a crash. That you got "fault virtual address = 0x0" for shutdown without having involved XFCE, suggests that a problem is already in place before XFCE is potentially involved: XFCE is not required. (XFCE use might lead to more trashed memory than otherwise, leading to the 0x7 fault address cases.) But I do not see how to get solid evidence for or against such the hypothesis (or related ones). The only thing I can identify that is likely unique to your context --but is involved with amdgpu-- is the involvement of the amdgpu_raven_gpu_*.ko modules. Unfortunately moving your context to a different system that avoids such module use or finding someone with a separate system that does have such (and is willing to set up experiments), is non-trivial for both directions of testing. Beyond possibly some checking on the degree/ease of repeatability, I do not see how to gather better information, much less get anywhere near directly actionable information for fixing the crashes. The one thing we have not looked at is the crash dumps themselves, examining what memory looks like and such. But I do not know what to do for that either, relative to known-useful information. Such a direction would be very exploratory and likely very time consuming.
(In reply to Mark Millard from comment #145) For the: fault virtual address = 0x7 examples, it looks like the value stored in RAM has the 0x7 in it instead of being a later offset addition. The loop in question in avl_destroy_nodes just uses "mov (%rdi),%rdi" with no offset involved: NOTE: Loop starts below 0x0000000000000ba0 <+64>: mov %rdi,%rax 0x0000000000000ba3 <+67>: mov %rdx,%rcx 0x0000000000000ba6 <+70>: mov (%rdi),%rdi 0x0000000000000ba9 <+73>: mov %rax,%rdx 0x0000000000000bac <+76>: test %rdi,%rdi 0x0000000000000baf <+79>: jne 0xba0 <avl_destroy_nodes+64> NOTE: The above is the loop end
Created attachment 241046 [details] Crash at shutdown time Another occurrence of the crash at shutdown time rather than boot time. I'm reluctant to post a vmcore file here, but I can make it available to anyone who thinks it will be useful.
(In reply to George Mitchell from comment #147) That crash is difference from all prior ones. It crashed in nfsd via a: Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 01 instruction pointer = 0x20:0xffffffff80c895cb stack pointer = 0x28:0xfffffe00b555dba0 frame pointer = 0x28:0xfffffe00b555dbb0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1109 (nfsd) None of the prior kldstat outputs have shown nfsd as loaded. For reference: panic: general protection fault cpuid = 1 time = 1679441112 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff81084fd8 at calltrap+0x8 #5 0xffffffff80c8866b at seltdclear+0x2b #6 0xffffffff80c88355 at kern_select+0xbd5 #7 0xffffffff80c88456 at sys_select+0x56 #8 0xffffffff810ae6ec at amd64_syscall+0x10c #9 0xffffffff810858eb at fast_syscall_common+0xf8 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c1b7ec in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c1bc5e in vpanic (fmt=0xffffffff811b2f41 "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c1ba63 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810addf5 in trap_fatal (frame=0xfffffe00b555dae0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:944 #6 <signal handler called> #7 0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, dst=<optimized out>, expect=<optimized out>) at /usr/src/sys/amd64/include/atomic.h:225 #8 selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007) at /usr/src/sys/kern/sys_generic.c:1755 #9 0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00) at /usr/src/sys/kern/sys_generic.c:1967 #10 0xffffffff80c88355 in kern_select (td=<optimized out>, td@entry=0xfffffe00b52e9a00, nd=7, fd_in=<optimized out>, fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized out>, tvp@entry=0x0, abi_nfdbits=64) at /usr/src/sys/kern/sys_generic.c:1210 #11 0xffffffff80c88456 in sys_select (td=0xfffffe00b52e9a00, uap=0xfffffe00b52e9de8) at /usr/src/sys/kern/sys_generic.c:1014 #12 0xffffffff810ae6ec in syscallenter (td=0xfffffe00b52e9a00) at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189 #13 amd64_syscall (td=0xfffffe00b52e9a00, traced=0) at /usr/src/sys/amd64/amd64/trap.c:1185 #14 <signal handler called> #15 0x00000008011a373a in ?? () Note: 18446741877726026240 == 0xfffffe00b52e9a00
(In reply to Mark Millard from comment #148) > None of the prior kldstat outputs have shown nfsd as loaded. That's because they weren't verbose kldstats. nfsd is statically linked into the kernel. kldstat -v definitely shows that nfsd is present.
In order to reconfirm my sincere belief that the key factor in these crashes is amdgpu (and also because I need a respite from the crashes), I'm running without amdgpu (and running X in VESA mode) for a while. I fully expect that the crashes will stop as a result.
(In reply to George Mitchell from comment #150) Sounds appropriate. "amdgpu" is really the sort of bundle: 23 1 0xffffffff82a00000 417220 amdgpu.ko 24 2 0xffffffff82566000 739e0 drm.ko 25 3 0xffffffff825da000 5220 linuxkpi_gplv2.ko 26 4 0xffffffff825e0000 62d8 dmabuf.ko 27 1 0xffffffff825e7000 c758 ttm.ko 28 1 0xffffffff825f4000 2218 amdgpu_raven_gpu_info_bin.ko 29 1 0xffffffff825f7000 64d8 amdgpu_raven_sdma_bin.ko 30 1 0xffffffff82e18000 2e2d8 amdgpu_raven_asd_bin.ko 31 1 0xffffffff829e0000 93d8 amdgpu_raven_ta_bin.ko 32 1 0xffffffff829ea000 7558 amdgpu_raven_pfp_bin.ko 33 1 0xffffffff829f2000 6558 amdgpu_raven_me_bin.ko 34 1 0xffffffff829f9000 4558 amdgpu_raven_ce_bin.ko 35 1 0xffffffff82e47000 b9c0 amdgpu_raven_rlc_bin.ko 36 1 0xffffffff82e53000 437e8 amdgpu_raven_mec_bin.ko 37 1 0xffffffff82e97000 437e8 amdgpu_raven_mec2_bin.ko 38 1 0xffffffff82edb000 5a638 amdgpu_raven_vcn_bin.ko I'm still at a loss for getting any improved type of evidence. Spending time related to the dnetc related scheduler benchmarking today has been a nice break from pondering this.
As expected, I have had no crashes since avoiding drm-510-kmod and running in VESA mode. Might it be worth updating 5.10.163_2 to 5.10.163_3? Notes I haven't mentioned recently: Prior to FBSD 13, whenever I tried drm-510-kmod, my machine would lock up hard and not respond to anything other than cycling power. I have a AMD Ryzen 3 2200G with Radeon Vega Graphics running on a Gigabyte B450M D53H motherboard. Every time I boot up, I see the following ACPI warnings, which don't otherwise seem to affect operation: Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796) ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361) Do any of you understand these?
(In reply to George Mitchell from comment #152) I'm not sure what all is involved in setting up the VESA usage test, but it sounds like it was a great test for isolating the problem to the material associated with amdgpu loading for your Radeon Vega Graphics context. Are there any negative consequences to the use of VESA? If the notes are simple/short could you supply instructions so that I could try the analogous thing in the Polaris 11 context that I have access to?
(In reply to George Mitchell from comment #152) Looked at my ACPI boot warning/error messages and I get just (with a little context shown from the grep for ACPI lines): acpi_wmi0: <ACPI-WMI mapping> on acpi0 ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361) acpi_wmi1: <ACPI-WMI mapping> on acpi0 ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361) acpi_wmi2: <ACPI-WMI mapping> on acpi0 But I do not get anything analogous to your reported: Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796) So that last has some chance of being involved in your context since I've been unable to reproduce your problems and the message is unioque to your context. (Only suggestive.) Any chance that there is an UEFI update available for your machine?
(In reply to Mark Millard from comment #153) Hmm. I see that: https://docs.freebsd.org/en/books/handbook/x11/#x-install reports: "VESA module must be used when booting in BIOS mode and SCFB module must be used when booting in UEFI mode." My context is UEFI so VESA looks to be inappropriate for my context. Your using BIOS (non-UEFI) vs. my using UEFI (not-BIOS) is another context difference relative to my not managing to reproduce the problems.
Ironically, I am presently forced back into using amdgpu.ko because the xorg-server update from 21.1.6,1 to 21.1.7,1 broke the VESA driver (bug #270509).
I forgot to mention earlier: Whenever I start chrome from a terminal window, I see the message: amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description. Probably not related to this bug, but I thought I'd better mention it.
(In reply to George Mitchell from comment #32) > … I'm doing this testing on a desktop machine, … (In reply to George Mitchell from comment #152) > … not respond to anything other than cycling power. … In that situation, does the system respond to a normal (not long) press on the power button? ---- On my everyday notebook here, I have this in sysctl.conf(5): hw.acpi.power_button_state="S5"
(In reply to Graham Perrin from comment #158) When I referred to cycling power, I meant by a long press of the power button, which worked just fine (except that I was going to have to run fsck on the next boot). Also, that was when I was running FBSD 12 and I'm not in a position to repeat that test any more. Thanks for the input.
I also use vbox + zfs + zmdgpu. On 13.2-STABLE I had kernel panic on vboxdrv / vboxnetadp load. So I switched to 13.1-RELEASE. Now after upgrading to 13.2 I have this problem again. Maybe related? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270809
(In reply to Tomasz "CeDeROM" CEDRO from comment #160) The package builds via 13.2-RELEASE have not even started yet. Systems using/needing kernel-specific ports should wait to upgrade to 13.2-RELEASE until the packages are known to be available if they are updating via binary packages. This is normal when a new release happens. FreeBSD does not hold the release until after the packages are available. 13.1-RELEASE is still supported for some time but cannot use 13.2-RELEASE based packages generally.
Thanks Mark :-) The problem is that build from ports kernel crashes on module load :-(
(In reply to Tomasz "CeDeROM" CEDRO from comment #162) Crashing from having a wrong module vintage for the kernel is normal/historical as I understand. So, unfortunately, not anything new. The package build servers will not start building based on 13.2-RELEASE until 13.1-RELEASE goes EOL as I understand. Prior to that building from source is what is supported when such kernel-dependent ports are involved. FreeBSD still has some build-from-source biases in its handling of things. Resource limitations may well still be forcing such, for all I know. So, either wait to use 13.2-RELEASE or build and install (some) ports via source based builds if you require ports with kernel-dependent modules.
(In reply to Tomasz "CeDeROM" CEDRO from comment #162) Sorry that I misinterpreted some of the context/wording. And nice to see that the 13.1-RELEASE build is rejected with a message, now that I look again.
Created attachment 241523 [details] Crash that happened neither at startup nor shutdown Perhaps not related to my original crash, but undoubtedly a crash that happened in amdgpu code. I was watching a movie using vlc. I decided I was finished watching and I typed control-q. The screen froze with a frame from the movie still showing, and after a few seconds the machine rebooted and saved a coredump, with the attached crash summary that really doesn't resemble any of the earlier ones saved here. Does anyone have any words of wisdom? To avoid the startup crash, I had booted to single user mode and had kldloaded vboxnetflt and amdgpu before continuing to multiuser mode.
I got tired of all those VirtualBox problems. I do not really care anymore about that program, if its problems are related to amdgpu or zfs. I have switched to bhyve that can be easily managed from a shell with vm utility [1]. I recommend doing the same. [1] https://github.com/churchers/vm-bhyve
This is an amdgpu problem. Although vboxnetflt is one of the kernel modules that can, in cooperation with amdgpu, exhibit the crash, zfs and acpi_vmi have also exhibited the same failure -- and the most recent crash summary contains no reference to vboxnetflt participating in the crash. (It does show that I manually typed "kldload vboxnetflt" in single-user mode about an hour and a half before the crash occurred.)
After upgrading to 5.10.163_5 today, I haven't yet had this crash -- but I've booted only a couple of times so far and it's too soon to jump to any conclusions.
Created attachment 241741 [details] Shutdown crash with version 5.10.163_5 5.10.163_5 still crashes. This time it was a shutdown time.
Created attachment 241750 [details] And another plain old boot time crash I had thought I could artificially provoke the crash by booting to single user mode, loading the amdgpu, zfs, vboxnetflt, and acpi_wmi kernel modules in quick succession, and then continuing to multiuser mode. But that didn't do it. So yesterday I went back to the old way of loading zfs with "zfs_enable="YES"" in rc.conf instead of "zfs_load="YES"" in /boot/loader.conf, and loading amdgpu by setting kld_list="amdgpu" in rc.conf. And now I get the crashes again.
(In reply to George Mitchell from comment #170) I'm unclear on the contrasting case: when you use /boot/loader.conf material instead of /etc/rc.conf material what happens these days? No crashes? Fairly rare crashes of the usual types? Fairly rare crashes of other types? A mix of fairly rare crashes of the 2 categories? (I may well not be thinking of everything that would be of note. So take the questions as just illustrative.)
One of the things that makes this hard to analyze is that the first failure quickly leads to other failures and most of the evidence for for the later failure. For example, in the following note that original trap number is 12 but the backtrace is for/after a later trap, of type-number 22 instead. There is very little information directly about the original trap type-number 12: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80bf3727 stack pointer = 0x28:0xfffffe000e1a7ba0 frame pointer = 0x28:0xfffffe000e1a7bd0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1 (init) trap number = 12 WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:619 . . . WARNING !drm_modeset_is_locked(&plane->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:894 kernel trap 22 with interrupts disabled kernel trap 22 with interrupts disabled panic: page fault cpuid = 0 time = 1682435560 KDB: stack backtrace: #0 0xffffffff80c66ee5 at kdb_backtrace+0x65 #1 0xffffffff80c1bbef at vpanic+0x17f #2 0xffffffff80c1ba63 at panic+0x43 #3 0xffffffff810addf5 at trap_fatal+0x385 #4 0xffffffff810ade4f at trap_pfault+0x4f #5 0xffffffff81084fd8 at calltrap+0x8 #6 0xffffffff8261d251 at spl_nvlist_free+0x61 #7 0xffffffff826dd740 at fm_nvlist_destroy+0x20 #8 0xffffffff827b6e95 at zfs_zevent_post_cb+0x15 #9 0xffffffff826dcd02 at zfs_zevent_drain+0x62 #10 0xffffffff826dcbf8 at zfs_zevent_drain_all+0x58 #11 0xffffffff826dede9 at fm_fini+0x19 #12 0xffffffff82713b94 at spa_fini+0x54 #13 0xffffffff827be303 at zfs_kmod_fini+0x33 #14 0xffffffff8262fb3b at zfs_shutdown+0x2b #15 0xffffffff80c1b76c at kern_reboot+0x3dc #16 0xffffffff80c1b381 at sys_reboot+0x411 #17 0xffffffff810ae6ec at amd64_syscall+0x10c . . . The primary hint about what code execution context lead to the original instance of trap type 12 above is basically: instruction pointer = 0x20:0xffffffff80bf3727 amdgpu does not leave in place a clean context for debugging kernel crashes. Trying to keep the video context operational for a kernel that has crashed, while not messing up the analysis context for the original problem is problematical. My guess would be that normal analysis of such tries to have the problem occur in a virtual machine sort of context where another (outer) context is available that is independent and can look at the details from outside the failing context. But even that would require the failing context in the VM to stop before amdgpu or the like messed up the evidence in the VM. (Not that I've ever done that type of evidence gathering.)
Here are a collection of points in response to Mark Millard's request. 1. Regardless of the order in which I load kernel modules by hand in single-user mode, I can't ever duplicate the crash. 2. The crash never happens if amdgpu.ko is not loaded. 3. Emmanuel Vadot categorically states that the many, many references to drm_modeset_is_locked failures in the crash summaries are noise and don't indicate drm failures and are caused by virtual terminal switching. But I still get crashes even when there are no virtual terminal switches (because I didn't start X windows and I didn't type ALT-Fn). 4. The crash always happens after amdgpu.ko is loaded, and (in terms of time of occurrence) at about the time vboxnetflt.ko or acpi_wmi.ko is loaded. The seeming zfs crash can happen even when zfs.ko is loaded before amdgpu.ko, and I theorize that it happens when my large (1TB) USB ZFS-formatted drive comes on line and gets tasted (after amdgpu.ko is loaded). 5. But I can't come up with any theory in which I can blame the actual crash on vboxnetflt.ko, acpi_wmi.ko, or zfs.ko. This bug should not be assigned to freebsd-fs. But I can't tell you to whom it should be assigned.
Since my last note on April 27, I have been booting up in this manner: 1. Boot to single user mode. 2. Run a script that loads amdgpu.ko, zfs.ko, vboxnetflt.ko, and acpi-wmi.ko in immediate succession. 3. Exit to multiuser mode. In the course of roughly 50-60 bootups, there have been only two crashes during single user mode, but regrettably they leave no trace because the root partition is still mounted read-only. At least I think that's why there's no dump. So something about single-user mode makes the crash much less likely to occur. Anyway, jumping through these hoops does enable me to run my graphics with the improved driver.
(In reply to George Mitchell from comment #174) > … crashes during single user mode, but regrettably they leave no trace > … the root partition is still mounted read-only. … Hint (whilst in single-user mode): mount -uw / && zfs mount -a sysrc dumpdev – you'll probably find a different device, typically the swap partition. sysrc dumpdir – you'll probably find /var/crash. service dumpon describe – if you boot in single user mode after a kernel panic, then /var/crash will not yet include information about the panic. service savecore describe