Bug 267028 - kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko
Summary: kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko)...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-RELEASE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: crash, needs-qa
: 268416 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-10-13 22:47 UTC by George Mitchell
Modified: 2023-05-17 23:26 UTC (History)
8 users (show)

See Also:
grahamperrin: maintainer-feedback? (vbox)


Attachments
/var/crash/core.txt.1 crash description (80.48 KB, text/plain)
2022-10-13 22:47 UTC, George Mitchell
no flags Details
/var/crash/core.txt.3 crash description from today (79.99 KB, text/plain)
2022-11-11 23:56 UTC, George Mitchell
no flags Details
Another crash (87.45 KB, text/plain)
2022-11-14 17:40 UTC, George Mitchell
no flags Details
Another crash summary; looks like all the earlier ones (79.99 KB, text/plain)
2022-12-09 17:35 UTC, George Mitchell
no flags Details
New core.txt (90.81 KB, text/plain)
2022-12-14 22:37 UTC, George Mitchell
no flags Details
A new crash (96.09 KB, text/plain)
2022-12-16 22:40 UTC, George Mitchell
no flags Details
A new instance of the same crash (96.09 KB, text/plain)
2022-12-16 22:42 UTC, George Mitchell
no flags Details
Crash after updating kernel/world to 13.1-RELEASE-p5 (109.74 KB, text/plain)
2022-12-18 02:16 UTC, George Mitchell
no flags Details
Crash dump (78.93 KB, text/plain)
2023-01-07 18:08 UTC, George Mitchell
no flags Details
Latest crash dump (87.44 KB, text/plain)
2023-01-28 00:13 UTC, George Mitchell
no flags Details
Crash after loading vboxnetflt early by hand (119.82 KB, text/plain)
2023-02-07 15:10 UTC, George Mitchell
no flags Details
New version of the crash, from acpi_wmi (88.80 KB, text/plain)
2023-02-26 17:35 UTC, George Mitchell
no flags Details
A new but related crash (I think) (82.82 KB, text/plain)
2023-03-05 03:21 UTC, George Mitchell
no flags Details
Another crash summary; looks like all the earlier ones (152.66 KB, text/plain)
2023-03-06 18:15 UTC, George Mitchell
no flags Details
Crash without any use of ZFS, with acpi_wmi (120.99 KB, text/plain)
2023-03-07 18:40 UTC, George Mitchell
no flags Details
Relevant part of /var/log/messages (52.16 KB, text/plain)
2023-03-07 18:43 UTC, George Mitchell
no flags Details
New instance (112.96 KB, text/plain)
2023-03-08 22:35 UTC, George Mitchell
no flags Details
Crashes 2 and 3 (196.92 KB, text/plain)
2023-03-08 22:37 UTC, George Mitchell
no flags Details
Another instance of attachment #240591 crash at shutdown time (82.84 KB, text/plain)
2023-03-10 18:28 UTC, George Mitchell
no flags Details
After upgrading to v5.10.163_2 (115.48 KB, text/plain)
2023-03-10 18:45 UTC, George Mitchell
no flags Details
Four boot-time crashes in a row (94.09 KB, application/octet-stream)
2023-03-20 22:17 UTC, George Mitchell
no flags Details
Another shutdown-time crash (83.02 KB, text/plain)
2023-03-21 00:07 UTC, George Mitchell
no flags Details
Crash at shutdown time (106.50 KB, text/plain)
2023-03-22 00:23 UTC, George Mitchell
no flags Details
Crash that happened neither at startup nor shutdown (279.11 KB, text/plain)
2023-04-16 02:02 UTC, George Mitchell
no flags Details
Shutdown crash with version 5.10.163_5 (92.45 KB, text/plain)
2023-04-25 18:12 UTC, George Mitchell
no flags Details
And another plain old boot time crash (105.33 KB, text/plain)
2023-04-25 22:46 UTC, George Mitchell
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description George Mitchell 2022-10-13 22:47:29 UTC
Created attachment 237279 [details]
/var/crash/core.txt.1 crash description

It doesn't happen every time.  If I use kld_list="amdgpu" in /etc/rc.conf, it happens close to 50% of the time.  If instead I boot to single user mode and manually kldload amdgpu, it happens maybe 20% of the time.  If I have amdgpu_load="YES" in /boot/loader.conf, the module fails to load at all, without saying anything.

FreeBSD 13.1-RELEASE-p2, drm-510-kmod-5.10.113_7, AMD Ryzen 3 2200G with Radeon Vega Graphics.

Crashes are always general protection fault panics, replete with complaints about drm_modeset_is_locked being false.
Comment 1 George Mitchell 2022-10-13 22:50:06 UTC
This is marginally an improvement from FreeBSD 12, where kldload amdgpu would always immediately totally lock up the machine, with no recovery path short of powering down and back on.  And when this crash DOESN'T happen, everything works marvelously well (and considerably better than running in VESA mode), so thanks for the work so far!
Comment 2 George Mitchell 2022-10-13 22:52:23 UTC
I have four more of the /var/crash/core.txt files, and core dumps (very large, too big to attach here even compressed) for each of them.
Comment 3 Graham Perrin freebsd_committer freebsd_triage 2022-10-13 23:15:13 UTC
Thank you, and please note that issues for <https://github.com/freebsd/drm-kmod> are normally raised in GitHub.
Comment 4 George Mitchell 2022-10-13 23:22:33 UTC
Ugh, I don't have a GitHub account and I would rather not open one.  (Yes, that does seem selfish of me and I apologize.)
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2022-10-14 21:17:47 UTC
From a _very_ quick look, it does not appear that this is an amdgpu problem.
The crash is in the core kernel code and the stack trace has mentions of zfs.

#6  <signal handler called>
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:46
#8  0xffffffff80be8c3d in modlist_lookup (name=0xfffff80004b71000 "zfs", 
    ver=0) at /usr/src/sys/kern/kern_linker.c:1487
#9  modlist_lookup2 (name=0xfffff80004b71000 "zfs", verinfo=0x0)
    at /usr/src/sys/kern/kern_linker.c:1501
#10 linker_load_module (kldname=kldname@entry=0x0, 
    modname=modname@entry=0xfffff80004b71000 "zfs", parent=parent@entry=0x0, 
    verinfo=<optimized out>, verinfo@entry=0x0, 
    lfpp=lfpp@entry=0xfffffe0075fddd90)
    at /usr/src/sys/kern/kern_linker.c:2165
#11 0xffffffff80beb17a in kern_kldload (td=td@entry=0xfffffe007f505a00, 
    file=<optimized out>, file@entry=0xfffff80004b71000 "zfs", 
    fileid=fileid@entry=0xfffffe0075fddde4)
    at /usr/src/sys/kern/kern_linker.c:1150
#12 0xffffffff80beb29b in sys_kldload (td=0xfffffe007f505a00, 
    uap=<optimized out>) at /usr/src/sys/kern/kern_linker.c:1173
#13 0xffffffff810ae6ec in syscallenter (td=0xfffffe007f505a00)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#14 amd64_syscall (td=0xfffffe007f505a00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185

To the reporter: do you by chance have zfs in kld_list ?
Comment 6 Graham Perrin freebsd_committer freebsd_triage 2022-10-14 22:08:49 UTC
Also, how is the root file system tuned? 

tunefs -p /

(In reply to George Mitchell from comment #1)

> … from FreeBSD 12 𣀦…

Did you run 13.0⋯ for a while, or did you upgrade from 12.⋯ direct to 13.1⋯?
Comment 7 Graham Perrin freebsd_committer freebsd_triage 2022-10-14 22:36:56 UTC
> … immediately after kldload amdgpu …

(In reply to George Mitchell from comment #0)

If I understand correctly, the attachment shows: 

1. kldload amdgpu whilst in single user mode

2. a subsequent, but non-immediate, exit ^D to multi-user mode

3. panic 


…
ugen0.4: <Logitech USB Optical Mouse> at usbus0
<118>Enter full pathname of shell or RETURN for /bin/sh: Cannot read termcap database;
<118>using dumb terminal settings.
<118>root@:/ # kldload amdgpu
<6>[drm] amdgpu kernel modesetting enabled.
…
<6>[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0
<118>root@:/ # ^D
<118>Setting hostuuid: 032e02b4-0499-0547-c106-430700080009.
<118>Setting hostid: 0x82f0750c.


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870
stack pointer	        = 0x28:0xfffffe0075fdda60
frame pointer	        = 0x28:0xfffffe0075fdda60
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 52 (kldload)
…
Comment 8 George Mitchell 2022-10-14 22:38:31 UTC
Thanks for the work so far.  "zfs" is not explicitly in the kld_list, but I do use ZFS and zfs_enable is set to "YES".

Also:

tunefs: POSIX.1e ACLs: (-a)                                disabled
tunefs: NFSv4 ACLs: (-N)                                   disabled
tunefs: MAC multilabel: (-l)                               disabled
tunefs: soft updates: (-n)                                 disabled
tunefs: soft update journaling: (-j)                       disabled
tunefs: gjournal: (-J)                                     disabled
tunefs: trim: (-t)                                         disabled
tunefs: maximum blocks per file in a cylinder group: (-e)  4096
tunefs: average file size: (-f)                            16384
tunefs: average number of files in a directory: (-s)       64
tunefs: minimum percentage of free space: (-m)             8%
tunefs: space to hold for metadata blocks: (-k)            6408
tunefs: optimization preference: (-o)                      time
tunefs: volume label: (-L)                                 

I never ran 13.0; I'm always leery of upgrading to x.0 from x-1.  (My upgrade was from 12.3-p6.)  Also, I still remember a collection of severe crashes from years back with soft updates plus journaling.  Are those problems known to be solved now?  (Sorry to be getting off the main topic.)
Comment 9 George Mitchell 2022-10-14 22:39:33 UTC
In this particular crash, I manually loaded amdgpu in single-user mode, and then immediately hit control-D.
Comment 10 Graham Perrin freebsd_committer freebsd_triage 2022-10-15 08:07:38 UTC
sysrc -f /etc/rc.conf kld_list

– is there amdgpu alone, or are other modules listed? 


(In reply to George Mitchell from comment #9)

Given the brief analysis by avg@ (comment #5), I'm inclined to: 

* view the load of amdgpu as successful

* give thought to other modules, ones that are (or should be) subsequently loaded. 

Do you use IRC, Matrix (e.g. Element) or Discord?
Comment 11 Graham Perrin freebsd_committer freebsd_triage 2022-10-15 08:10:54 UTC
(In reply to George Mitchell from comment #8)

> …crashes from years back with soft updates plus journaling.  
> Are those problems known to be solved now? …

For what's described: without a bug number, it might be impossible for me to tell. 


> … I never ran 13.0; …

13.1 fixed a bug that involved soft updates _without_ soft update journaling: <https://www.freebsd.org/releases/13.1R/relnotes/#storage-ufs>

<https://docs.freebsd.org/en/books/handbook/config/#soft-updates> recommends soft updates. If there's no explicit recommendation to also enable soft update journaling, this could be because (bug 261944) there's not yet, in the Handbook, a suitable explanation of the feature. 

tunefs(8) <https://www.freebsd.org/cgi/man.cgi?query=tunefs&sektion=8&manpath=FreeBSD> for FreeBSD 13.1-RELEASE lacks a recently added explanation, you can gain this by switching the online view of the manual page to FreeBSD 14.0-CURRENT.
Comment 12 George Mitchell 2022-10-15 23:17:40 UTC
Without amdgpu in the kld_list, kld_list currently is not even defined.  Perhaps it's more helpful to show what gets loaded aside from amdgpu int he course of a normal boot:

kldstat
 1   64 0xffffffff80200000  1f300f0 kernel
 2    1 0xffffffff82132000     77e0 sem.ko
 3    3 0xffffffff8213a000    8cc90 vboxdrv.ko
 4    1 0xffffffff82600000   3df128 zfs.ko
 5    2 0xffffffff82518000     4240 vboxnetflt.ko
 6    2 0xffffffff8251d000     aac8 netgraph.ko
 7    1 0xffffffff82528000     31c8 ng_ether.ko
 8    1 0xffffffff8252c000     55e0 vboxnetadp.ko
 9    1 0xffffffff82532000     3378 acpi_wmi.ko
10    1 0xffffffff82536000     3218 intpm.ko
11    1 0xffffffff8253a000     2180 smbus.ko
12    1 0xffffffff8253d000     33c0 uslcom.ko
13    1 0xffffffff82541000     4d90 ucom.ko
14    1 0xffffffff82546000     2340 uhid.ko
15    1 0xffffffff82549000     3380 usbhid.ko
16    1 0xffffffff8254d000     31f8 hidbus.ko
17    1 0xffffffff82551000     3320 wmt.ko
18    1 0xffffffff82555000     4350 ums.ko
19    1 0xffffffff8255a000     5af8 autofs.ko
20    1 0xffffffff82560000     2a08 mac_ntpd.ko
21    1 0xffffffff82563000     20f0 green_saver.ko

The SU+J thing is totally anecdotal, based on what I used to see on freebsd-hackers.  Right now, I format my disks with UFS for root/var/tmp (no more than 8GB for fast fscking), and then a ZFS partition for /usr.

I don't use IRC, Matrix, or Element (not sure what those last two are) and on the rare occasions I use Discord, I use the web site.
Comment 13 George Mitchell 2022-11-06 18:15:41 UTC
As of today, with version drm-510-kmod-5.10.113_8:

1. I can reliably prevent a crash by booting to single user mode, manually kldloading amdgpu, and continuing (typing control-d).  dmesg then reports:

[drm] amdgpu kernel modesetting enabled.
drmn0: <drmn> on vgapci0
vgapci0: child drmn0 requested pci_enable_io
vgapci0: child drmn0 requested pci_enable_io
[drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1458:0xD000 0xC8).
drmn0: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[drm] register mmio base: 0xFE600000
[drm] register mmio size: 524288
[drm] add ip block number 0 <soc15_common>
[drm] add ip block number 1 <gmc_v9_0>
[drm] add ip block number 2 <vega10_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <gfx_v9_0>
[drm] add ip block number 5 <sdma_v4_0>
[drm] add ip block number 6 <powerplay>
[drm] add ip block number 7 <dm>
[drm] add ip block number 8 <vcn_v1_0>
drmn0: successfully loaded firmware image 'amdgpu/raven_gpu_info.bin'
[drm] BIOS signature incorrect 44 f
drmn0: Fetched VBIOS from ROM BAR
amdgpu: ATOM BIOS: 113-RAVEN-111
drmn0: successfully loaded firmware image 'amdgpu/raven_sdma.bin'
[drm] VCN decode is enabled in VM mode
[drm] VCN encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
drmn0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
drmn0: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
drmn0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=2048M, BAR=2048M
[drm] RAM width 128bits DDR4
[TTM] Zone  kernel: Available graphics memory: 3100774 KiB
[TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[TTM] Initializing pool allocator
[drm] amdgpu: 2048M of VRAM memory ready
[drm] amdgpu: 3072M of GTT memory ready.
[drm] GART: num cpu pages 262144, num gpu pages 262144
[drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
drmn0: successfully loaded firmware image 'amdgpu/raven_asd.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_ta.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_pfp.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_me.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_ce.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_rlc.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_mec.bin'
drmn0: successfully loaded firmware image 'amdgpu/raven_mec2.bin'
amdgpu: hwmgr_sw_init smu backed is smu10_smu
drmn0: successfully loaded firmware image 'amdgpu/raven_vcn.bin'
[drm] Found VCN firmware Version ENC: 1.12 DEC: 2 VEP: 0 Revision: 1
drmn0: Will use PSP to load VCN firmware
[drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
drmn0: RAS: optional ras ta ucode is not available
drmn0: RAP: optional rap ta ucode is not available
[drm] kiq ring mec 2 pipe 1 q 0
[drm] DM_PPLIB: values for F clock
[drm] DM_PPLIB:  400000 in kHz, 3649 in mV
[drm] DM_PPLIB:  933000 in kHz, 4074 in mV
[drm] DM_PPLIB:  1200000 in kHz, 4399 in mV
[drm] DM_PPLIB:  1333000 in kHz, 4399 in mV
[drm] DM_PPLIB: values for DCF clock
[drm] DM_PPLIB:  300000 in kHz, 3649 in mV
[drm] DM_PPLIB:  600000 in kHz, 4074 in mV
[drm] DM_PPLIB:  626000 in kHz, 4250 in mV
[drm] DM_PPLIB:  654000 in kHz, 4399 in mV
[drm] Display Core initialized with v3.2.104!
[drm] VCN decode and encode initialized successfully(under SPG Mode).
drmn0: SE 1, SH per SE 1, CU per SH 11, active_cu_number 8
[drm] fb mappable at 0x60BCA000
[drm] vram apper at 0x60000000
[drm] size 8294400
[drm] fb depth is 24
[drm]    pitch is 7680
VT: Replacing driver "vga" with new "fb".
start FB_INFO:
type=11 height=1080 width=1920 depth=32
pbase=0x60bca000 vbase=0xfffff80060bca000
name=drmn0 flags=0x0 stride=7680 bpp=32
end FB_INFO
drmn0: ring gfx uses VM inv eng 0 on hub 0
drmn0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
drmn0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
drmn0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
drmn0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
drmn0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
drmn0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
drmn0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
drmn0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
drmn0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
drmn0: ring sdma0 uses VM inv eng 0 on hub 1
drmn0: ring sdma0 uses VM inv eng 0 on hub 1
drmn0: ring vcn_dec uses VM inv eng 1 on hub 1
drmn0: ring vcn_enc0 uses VM inv eng 4 on hub 1
drmn0: ring vcn_enc1 uses VM inv eng 5 on hub 1
drmn0: ring jpeg_dec uses VM inv eng 6 on hub 1
vgapci0: child drmn0 requested pci_get_powerstate
sysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!
[drm] Initialized amdgpu 3.40.0 20150101 for drmn0 on minor 0
 
Is the sysctl_warn_reuse message anything to worry about?

2. Adding amdgpu to the kldlist in rc.conf still crashes more often than not, as previously reported.

3. Attempting to load amdgpu via /boot/loader.conf appears to load the module in memory but not actually make it functional.  (X uses VESA mode as if the module isn't there.)
Comment 14 George Mitchell 2022-11-11 23:56:38 UTC
Created attachment 238024 [details]
/var/crash/core.txt.3 crash description from today

Contrary to comment #13, today I got a crash despite booting to single user mode, typing "kldload amdgpu", and then control-d.  But it looks indistinguishable from the other /var/crash/core.txt.1 description.  Next I'll try booting to single user mode and kldloading zfs before kldloading amdgpu.
Comment 15 George Mitchell 2022-11-14 17:40:31 UTC
Created attachment 238075 [details]
Another crash

This time, I booted into single user mode and typed "kldload zfs amdgpu" with no problems.  Then when I typed ctrl-d I got this crash (which looks pretty much the same as all the other ones, except the places in the backtrace that used to refer to zfs now refer to vboxnetflt, which I load for VirtualBox).  So it seems likely that the crash has nothing to do with any specific other kernel loadable module might be cited in the backtrace.
Comment 16 George Mitchell 2022-11-18 22:34:21 UTC
The following comment is based on zero actual knowledge of how kernel loadable modules work.  Still, based on what I'm seeing with this bug, I hypothesize that after one module is loaded, there is a mechanism by which the next module (and maybe other later ones) call back to modules already loaded in order to prevent incompatible modules (whatever that might mean) from trying to coexist.  And somewhere in that path in the amdgpu module, it is detected that some lock that was taken while amdgpu was loading was erroneously not released.  (Most of the time, the lock IS released, and I don't know exactly under what circumstances it isn't.)

I hope this is helpful.
Comment 17 George Mitchell 2022-12-05 18:27:02 UTC
I've discovered how to avoid this crash (at least the last 20-30 times I have booted up): boot into single user mode, type <ENTER> to run /bin/sh, type "kldload amdgpu," and then (key step!) wait at least five seconds before typing ctrl-D to exit single user mode.  Since I don't know why this helps, I guess it falls into the voodoo category, but maybe it's a clue.
Comment 18 George Mitchell 2022-12-07 16:58:21 UTC
I hate to say again how little I know about kernel module loading, but by any chance is there multithreading in the code that gets called when amdgpu.ko is first loaded?  I can't help thinking that perhaps that code is returning prematurely, before some initialization is completely finished and all locks released.  If I knew where to put it, I would throw in a five-second delay at the end of whatever gets called to load amdgpu.ko.
Comment 19 George Mitchell 2022-12-09 17:35:52 UTC
Created attachment 238668 [details]
Another crash summary; looks like all the earlier ones

Contrary to my comment #17, I got this same crash this morning, even waiting five seconds after loading amdgpu.ko before proceeding.  So the delay doesn't prevent the crash.
Comment 20 George Mitchell 2022-12-11 17:38:02 UTC
I've figured out why this crash is timing related, and also why ZFS is involved.

My system has a 1 TB USB disk, which contains a ZFS file system.  When I power my system on, it takes a variable amount of time for that disk to become ready and for ZFS to take note of it.  (I'm booting from a SATA disk with a traditional old UFS file system.)  So if the USB disk becomes ready while amdgpu is still initializing, apparently this crash happens.  I have no clue why that is true, but I am pretty sure this explains why the the crash happens only part of the time and is timing dependent.

It remains true that the most reliable way to cause the crash is to include amdgpu in the kld_list in /etc/rc.conf and simply boot normally (and to have a *ZFS-formatted USB* disk attached to the system).
Comment 21 George Mitchell 2022-12-12 14:34:05 UTC
I've updated to version drm-510-kmod-5.10.113_8 and it hasn't crashed yet, but I've only had time for one test so far.
Comment 22 Graham Perrin freebsd_committer freebsd_triage 2022-12-13 21:06:07 UTC
(In reply to George Mitchell from comment #21)

If a crash _does_ occur/recur, then maybe test for reproducibility with this in your /boot/loader.conf

kern.smp.disabled=1

<https://www.freebsd.org/cgi/man.cgi?query=smp&sektion=4&manpath=FreeBSD>

(Be prepared for significantly reduced performance after restarting with SMP disabled.)

This is a gut feeling, more than anything (apologies for the noise), partly based on experiences with virtual hardware …
Comment 23 George Mitchell 2022-12-13 22:45:23 UTC
Thanks!  So far, I've booted four times with amdgpu in my kld_list, which previously would likely have yielded at least one crash, with no crash.  So I have my fingers crossed, but I'll try your hack if it crashes again (and your theory certainly sounds plausible).
Comment 24 George Mitchell 2022-12-14 22:37:31 UTC
Created attachment 238802 [details]
New core.txt

The latest version definitely crashes less often, but I just now got a new crash that (to me) looks different than the earlier one.  I was just about ready to mark this fixed!
Comment 25 George Mitchell 2022-12-15 18:11:35 UTC
After further consideration (and a partly sleepless night), I've decided that the latest crash is not an instance of this bug and possibly isn't related to amdgpu.ko at all.  So I'm going to close this bug and maybe open a new one when I understand the new one better.

Anyone looking at this bug in the future should pay no attention to "New core.txt" attachment, but should refer to the obsolete attachments.
Comment 26 George Mitchell 2022-12-16 22:40:09 UTC
Created attachment 238849 [details]
A new crash

I regret to say I'm going to have to reopen this bug.  But I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall).
Comment 27 George Mitchell 2022-12-16 22:42:46 UTC
Created attachment 238850 [details]
A new instance of the same crash

I regret to say the crash has happened again.  I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
Comment 28 George Mitchell 2022-12-16 22:47:49 UTC
Reopening bug.
Comment 29 Graham Perrin freebsd_committer freebsd_triage 2022-12-17 10:02:15 UTC
(In reply to George Mitchell from comment #26)

> FreeBSD court 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 752f813d6 M5P  amd64

Please update the OS. 


----

Given comment #5 from avg@, and (for example) comment #24 the different types of kernel panic: 

fs@ x11@ please: if panics recur with an updated OS, would you recommend continuing with this report (267028)? Or start afresh, with a new report for the more recent type of panic?
Comment 30 Emmanuel Vadot freebsd_committer freebsd_triage 2022-12-17 10:21:09 UTC
This is not drm related, the drm message are noises that we should fix one day when we switch ttys.
Comment 31 George Mitchell 2022-12-17 15:12:17 UTC
(In reply to Graham Perrin from comment #29)
I'm on the release branch, not the stable branch.  So you are suggesting I update from 13.1-RELEASE-p2 to 13.1-RELEASE-p5?  And then recompile the kernel module as well, I assume?
Comment 32 George Mitchell 2022-12-17 17:45:23 UTC
For what it's worth, I'm doing this testing on a desktop machine, so setting kern.smp.disabled=1 actually doesn't impact operation too much -- except for Thunderbird.  And so far I haven't seen the crash with that setting.
Comment 33 Tomasz "CeDeROM" CEDRO 2022-12-17 17:51:30 UTC
Does switch to graphics/drm-510-kmod and updating graphics/gpu-firmware-amd-kmod helps?
Comment 34 George Mitchell 2022-12-17 18:03:21 UTC
In fact, switching to graphics/drm-510-kmod from the generic VESA driver is what originally triggered this bug.  Without using amdgpu.ko there is no problem.
Comment 35 Emmanuel Vadot freebsd_committer freebsd_triage 2022-12-17 18:33:59 UTC
All your reports show that it's from zfs, again the drm messages are noises.
Comment 36 George Mitchell 2022-12-18 02:16:34 UTC
Created attachment 238886 [details]
Crash after updating kernel/world to 13.1-RELEASE-p5

This is after updating my kernel and world to 13.1-RELEASE-p5.  I grant you the backtrace here sure points to the openzfs code, but why does the crash happen only with graphics/drm-510-kmod installed and amdgpu.ko loaded, but not otherwise?  For the time being, I will be running WITHOUT amdgpu.ko in my kld_list, and I am confident this crash will not occur.

I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2.  But based on an earlier comment I updated to 13.1-RELEASE-p5.  Then after going back to kern.smp.disabled=0 I got another of the crash.

I did observe that something in sys/contrib/openzfs/module/zfs got updated between p2 and p5, but it doesn't seem to have fixed this crash.

Compiling graphics/drm-510-kmod under p5 yielded an amdgpu.ko that was identical to amdgpu.ko compiled under p2.
Comment 37 George Mitchell 2022-12-29 00:12:03 UTC
I'm still having this problem, though I can reduce its frequency by booting in single-user mode, kldloading amdgpu, waiting five or ten seconds, and then going to multi-user mode with control-D.  I've updated the title to emphasize that the bug happens only when amdgpu.ko (from graphics/drm-510-kmod version 5.10.113_8) and ZFS are both in use.  Also, it happens during booting, or else never.
Comment 38 George Mitchell 2022-12-29 00:22:31 UTC
I don't want the title to become too wordy, but also I'll note again that my 1TB USB disk (GPT formatted with one ZFS partition only) that takes a measurable, variable amount of time to become ready may be the main reason this crash doesn't always happen.
Comment 39 Graham Perrin freebsd_committer freebsd_triage 2023-01-06 19:19:13 UTC
(In reply to George Mitchell from comment #37)

grep -e solaris -e zfs /boot/loader.conf

grep zfs /etc/rc.conf

What's reported?
Comment 40 George Mitchell 2023-01-06 19:53:18 UTC
(In reply to Graham Perrin from comment #39)

> grep -e solaris -e zfs /boot/loader.conf

> grep zfs /etc/rc.conf
zfs_enable="YES"		# Set to YES to automatically mount ZFS file systems
Comment 41 Graham Perrin freebsd_committer freebsd_triage 2023-01-07 00:33:26 UTC
(In reply to George Mitchell from comment #40)

(In reply to George Mitchell from comment #20)

> … timing related, …

Please add to /boot/loader.conf


zfs_load="YES"
Comment 42 George Mitchell 2023-01-07 18:08:53 UTC
Created attachment 239336 [details]
Crash dump

Well, this helps a bit.  By adding that line to /boot/loader.conf and restoring kld_list="amdgpu" to my /etc/rc.conf, I was able to reboot without the crash four times in a row, whereas before it would crash about every other time.  But it crashed on the fifth time. (See attached core.txt.0.)
Comment 43 George Mitchell 2023-01-07 18:13:56 UTC
In the new core.txt.0, there are about 19 lines of text from the previous shutdown near the beginning of the file.  But the substance of the backtrace looks identical to all the previous ones.  So loading ZFS early mitigates the problem but does not fix it.
Comment 44 Andriy Gapon freebsd_committer freebsd_triage 2023-01-07 23:36:50 UTC
I think that in these frames we clearly see a bogus pointer / address:
#7  <signal handler called>
#8  vtozoneslab (va=18446735277616529408, zone=<optimized out>, 
    slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635
#9  free (addr=0xfffff80000000007, mtp=0xffffffff824332b0 <M_SOLARIS>)
    at /usr/src/sys/kern/kern_malloc.c:911
#10 0xffffffff8214d251 in nv_mem_free (nvp=<optimized out>, 
    buf=0xfffff80000000007, size=16688648)
    at /usr/src/sys/contrib/openzfs/module/nvpair/nvpair.c:216

I'd recommend poking around frames 11-13 to see from where that address comes.

Also, I don't get an impression that the latest crash is similar to earlier ones.
kern_reboot / zfs__fini vs dbuf_evict_thread.
Comment 45 George Mitchell 2023-01-15 00:28:56 UTC
It appears I could mitigate this problem if I could load amdgpu.ko from /boot/loader.conf, which currently doesn't work.  See bug #268962.  Alternatively, at present I can completely avoid this crash by:

1. having zfs_load="YES" in /boot/loader.conf.
2. booting into single user mode.
3. typing kldload amdgpu.
4. typing control-D.
Comment 46 George Mitchell 2023-01-20 17:43:31 UTC
(In reply to George Mitchell from comment #45)
Correction to comment #45: I can avoid the problem around 95% of the time with the specified steps, but not 100%.
Comment 47 George Mitchell 2023-01-28 00:13:57 UTC
Created attachment 239752 [details]
Latest crash dump

The last couple of crashes strongly resemble all the earlier ones, but they are now less frequent with zfs.ko being loaded at /boot/loader.conf time and amdgpu.ko loaded while booted into single user mode.  The difference (see core.txt.2 from today's date) is that the backtrace line where modlist2_lookup is called is now looking up vboxnetflt instead of zfs.  My rcorder list shows:

/etc/rc.d/dumpon
/etc/rc.d/sysctl
/etc/rc.d/natd
/etc/rc.d/dhclient
/etc/rc.d/hostid
/etc/rc.d/ddb
/etc/rc.d/ccd
/etc/rc.d/gbde
/etc/rc.d/geli
/etc/rc.d/zpool
/etc/rc.d/swap
/etc/rc.d/zfskeys
/etc/rc.d/fsck
/etc/rc.d/zvol
/etc/rc.d/growfs
/etc/rc.d/root
/etc/rc.d/sppp
/etc/rc.d/mdconfig
/etc/rc.d/hostid_save
/etc/rc.d/serial
/etc/rc.d/mountcritlocal
/etc/rc.d/zfsbe
/etc/rc.d/tmp
/etc/rc.d/zfs
/etc/rc.d/var
/etc/rc.d/cfumass
/etc/rc.d/cleanvar
/etc/rc.d/FILESYSTEMS
/etc/rc.d/geli2
/etc/rc.d/ldconfig
/etc/rc.d/kldxref
/etc/rc.d/adjkerntz
/etc/rc.d/hostname
/etc/rc.d/ip6addrctl
/etc/rc.d/ippool
/etc/rc.d/netoptions
/etc/rc.d/opensm
/etc/rc.d/random
/etc/rc.d/iovctl
/etc/rc.d/rctl
/usr/local/etc/rc.d/vboxnet
/etc/rc.d/ugidfw
/etc/rc.d/autounmountd
/etc/rc.d/mixer
/etc/rc.d/ipsec
/usr/local/etc/rc.d/uuidd
/etc/rc.d/kld
/etc/rc.d/ipfilter
/etc/rc.d/devmatch
/etc/rc.d/addswap
/etc/rc.d/ipnat
/etc/rc.d/ipmon
/etc/rc.d/ipfs
/etc/rc.d/netif
/etc/rc.d/ppp
/etc/rc.d/pfsync
/etc/rc.d/pflog
/etc/rc.d/rtsold
/etc/rc.d/static_ndp
/etc/rc.d/static_arp
/etc/rc.d/devd
/etc/rc.d/resolv
/etc/rc.d/stf
/etc/rc.d/ipfw
/etc/rc.d/routing
/etc/rc.d/bridge
/etc/rc.d/zfsd
/etc/rc.d/defaultroute
/etc/rc.d/routed
/etc/rc.d/pf
/etc/rc.d/route6d
/etc/rc.d/ipfw_netflow
/etc/rc.d/blacklistd
/etc/rc.d/netwait
/etc/rc.d/local_unbound
/etc/rc.d/NETWORKING
/etc/rc.d/kdc
/etc/rc.d/tlsservd
/etc/rc.d/iscsid
/etc/rc.d/pppoed
/etc/rc.d/ctld
/etc/rc.d/nfsuserd
/etc/rc.d/tlsclntd
/etc/rc.d/kfd
/usr/local/etc/rc.d/sndiod
/etc/rc.d/gssd
/etc/rc.d/nfscbd
/etc/rc.d/ipropd_master
/etc/rc.d/ipropd_slave
/etc/rc.d/kadmind
/etc/rc.d/kpasswdd
/etc/rc.d/iscsictl
/etc/rc.d/mountcritremote
/etc/rc.d/archdep
/etc/rc.d/dmesg
/etc/rc.d/wpa_supplicant
/etc/rc.d/hostapd
/etc/rc.d/accounting
/etc/rc.d/mdconfig2
/etc/rc.d/devfs
/etc/rc.d/gptboot
/etc/rc.d/virecover
/etc/rc.d/os-release
/etc/rc.d/motd
/etc/rc.d/cleartmp
/etc/rc.d/newsyslog
/etc/rc.d/syslogd
/etc/rc.d/linux
/etc/rc.d/sysvipc
/etc/rc.d/hastd
/etc/rc.d/localpkg
/etc/rc.d/auditd
/etc/rc.d/bsnmpd
/etc/rc.d/ntpdate
/etc/rc.d/watchdogd
/etc/rc.d/savecore
/etc/rc.d/pwcheck
/etc/rc.d/power_profile
/etc/rc.d/auditdistd
/etc/rc.d/SERVERS
/etc/rc.d/rpcbind
/etc/rc.d/nisdomain
/etc/rc.d/nfsclient
/etc/rc.d/ypserv
/etc/rc.d/ypupdated
/etc/rc.d/ypxfrd
/etc/rc.d/ypbind
/etc/rc.d/ypldap
/etc/rc.d/ypset
/etc/rc.d/keyserv
/etc/rc.d/automountd
/etc/rc.d/yppasswdd
/etc/rc.d/quota
/etc/rc.d/automount
/etc/rc.d/mountd
/etc/rc.d/nfsd
/etc/rc.d/statd
/etc/rc.d/lockd
/etc/rc.d/DAEMON
/etc/rc.d/rwho
/etc/rc.d/utx
/etc/rc.d/bootparams
/etc/rc.d/hcsecd
/etc/rc.d/ftp-proxy
/etc/rc.d/local
/usr/local/etc/rc.d/git_daemon
/etc/rc.d/lpd
/usr/local/etc/rc.d/dbus
/etc/rc.d/mountlate
/etc/rc.d/nscd
/etc/rc.d/ntpd
/etc/rc.d/powerd
/usr/local/etc/rc.d/slurmd
/usr/local/etc/rc.d/slurmctld
/etc/rc.d/ubthidhci
/etc/rc.d/rarpd
/etc/rc.d/sdpd
/etc/rc.d/apm
/etc/rc.d/rtadvd
/etc/rc.d/moused
/etc/rc.d/rfcomm_pppd_server
/usr/local/etc/rc.d/avahi-daemon
/etc/rc.d/swaplate
/etc/rc.d/bthidd
/etc/rc.d/bluetooth
/usr/local/etc/rc.d/avahi-dnsconfd
/etc/rc.d/LOGIN
/etc/rc.d/sshd
/usr/local/etc/rc.d/vboxheadless
/etc/rc.d/syscons
/etc/rc.d/sysctl_lastload
/usr/local/etc/rc.d/xdm
/usr/local/etc/rc.d/vboxwatchdog
/etc/rc.d/inetd
/usr/local/etc/rc.d/dnetc
/usr/local/etc/rc.d/munged
/etc/rc.d/sendmail
/etc/rc.d/ftpd
/usr/local/etc/rc.d/rsyncd
/usr/local/etc/rc.d/saned
/etc/rc.d/cron
/etc/rc.d/msgs
/etc/rc.d/othermta
/etc/rc.d/jail
/etc/rc.d/bgfsck
/usr/local/etc/rc.d/smartd
/etc/rc.d/securelevel

The vboxnetflt.ko module is loaded by /usr/local/etc/rc.d/vboxnet.
Comment 48 George Mitchell 2023-01-28 00:29:36 UTC
And the list of kernel modules loaded by a non-crashing boot is:

kernel
sem.ko
zfs.ko
if_re.ko
vboxdrv.ko
amdgpu.ko
drm.ko
linuxkpi_gplv2.ko
dmabuf.ko
ttm.ko
amdgpu_raven_sdma_bin.ko
amdgpu_raven_asd_bin.ko
amdgpu_raven_ta_bin.ko
amdgpu_raven_pfp_bin.ko
amdgpu_raven_me_bin.ko
amdgpu_raven_ce_bin.ko
amdgpu_raven_rlc_bin.ko
amdgpu_raven_mec_bin.ko
amdgpu_raven_mec2_bin.ko
amdgpu_raven_vcn_bin.ko
vboxnetflt.ko
(and a whole bunch more)

In other words, when the crash happens, it always involves a call to modlist_lookup2 from whatever kernel module gets loaded following amdgpu.
Comment 49 George Mitchell 2023-02-07 15:05:42 UTC
*** Bug 268416 has been marked as a duplicate of this bug. ***
Comment 50 George Mitchell 2023-02-07 15:10:25 UTC
Created attachment 239967 [details]
Crash after loading vboxnetflt early by hand

Since the previous crash included a reference to vboxnetflt.ko, I experimented a few times with amdgpu.ko added to my kld_lst in /etc/rc.conf, and loading vboxnetflt by hand after booting to single user mode.

I think it's pretty clear at this point that there is no problem in ZFS code.  It's a lock mismanagement problem of some sort in amggpu.ko (from graphics/drm-510-kmod).  If I have permission to change the assignee of this bug, I will.
Comment 51 George Mitchell 2023-02-07 15:12:55 UTC
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
Comment 52 George Mitchell 2023-02-07 15:15:48 UTC
I think this needs to be assigned to x11@freebsd.org, but I don't seem to have the permission to do it.
Comment 53 George Mitchell 2023-02-07 15:23:23 UTC
(In reply to Graham Perrin from comment #10)

It does appear that amdgpu.ko always loads successfully.  But then the loading of some other module subsequently (which might be zfs.ko or vboxnetflt.ko or maybe something else) somehow causes an unexpected call back into the amdgpu code.  I have no idea how.

The current situation:
1. zfs.ko is loaded from /boot/loader.conf.
2. I always boot into single user mode.
3. The last few times, I had kld_list="amdgpu.ko" in my /etc/rc.conf, but for now I'm taking it back out.
4. So I'm loading amdgpu.ko manually in single user mode and then waiting ten seconds or so before going multiuser.  It's voodoo but it usually avoids the crash.
Comment 54 Emmanuel Vadot freebsd_committer freebsd_triage 2023-02-07 15:23:42 UTC
(In reply to George Mitchell from comment #52)

No it's not, I've told you already that what's printed by drm is not the panic it's noise when we switch ttys during a panic.
All you crash logs talk about zfs dbufs, this isn't amdgpu.
Comment 55 George Mitchell 2023-02-07 15:27:40 UTC
(In reply to Emmanuel Vadot from comment #54)

If I boot up without loading amdgpu.ko at all, then I NEVER get the crash.  Confirmed many many times.
Comment 56 Andriy Gapon freebsd_committer freebsd_triage 2023-02-07 15:31:15 UTC
(In reply to Emmanuel Vadot from comment #54)
I think that George's point was not about anything that gets printed, but what happens depending on whether amdgpu gets loaded (and when) or not.

It's not unimaginable that an exotic bug in one module (or in the module loading code or the code for resolving symbols) results in a memory corruption and a crash elsewhere.

A very wild guess, but I'd check if there are any duplicate symbols between amdgpu and zfs.ko... and even kernel itself.
Comment 57 Emmanuel Vadot freebsd_committer freebsd_triage 2023-02-07 15:33:20 UTC
(In reply to Andriy Gapon from comment #56)

But then anyone else using zfs+amdgpu will have the same problem and that's not the case (I use both on multiple machine running either 13.1, stable/13 or CURRENT).
Comment 58 George Mitchell 2023-02-07 17:41:17 UTC
If it is ZFS, then the only exotic factor on my system is an external USB one-terabyte drive (WDC WD10EZEX-08WN4A0), formatted with GPT and one ZFS partition, that seems to take a variable amount of time to come on line at power up.  I theorized at one point that tasting that drive at an unpredictable time was a factor in the crash.  Your mileage may vary.
Comment 59 Mark Millard 2023-02-07 17:46:27 UTC
(In reply to Emmanuel Vadot from comment #54)

QUOTE
All you crash logs talk about zfs dbufs
END QUOTE

Not true: "Crash dump" and "Latest crash dump" have no
examples of "dbuf" in the submitted text.

Also: The backtrace in "Latest crash dump" makes no mention
of "zfs" at all. (It does occur in other text.)
Comment 60 Mark Millard 2023-02-07 17:55:01 UTC
(In reply to George Mitchell from comment #58)

Could a test be formed on your hardware, loading ZFS
but having no actual import of any pool, possibly
not even a pool to find (empty "zpool import")?

As stands your context is hard for anyone else to
make an analogous context for testing. Finding a
failure in a simpler to replicate context could
help with avoiding your having the only known
failure context.

So any other variations that are simpler contexts for
others to replicate and test would be a good thing.

But, also, if such effort ends up unable to replicate
the problem in your environment, that might be usefull
information as well.
Comment 61 George Mitchell 2023-02-07 19:06:42 UTC
(In reply to Mark Millard from comment #60)
In addition to my external USB ZFS drive, also my /usr file system is a ZFS slice.  My main hard drive has a very small UFS root (and /var and /tmp) slice, because I have a superstitious fear of ZFS on root.  The /usr slice (the rest of the drive) is big enough to take an annoying amount of time to fsck, so when I first added this drive to my system (which was also when I updated from 12 to 13), I chose ZFS for /usr to minimize that time.  For a while, I suppose I could copy my /usr slice onto the /usr slice from my old internal drive and mount that in place of the current /usr slice for some tests, and I could do without the external drive.  I'll have to think about this.
Comment 62 Mark Millard 2023-02-07 19:25:38 UTC
(In reply to George Mitchell from comment #61)

If you can boot an external USB3 drive or some such,
may be a minimal separate drive: UFS 13.1-RELEASE with
enough added to also have amdgpu.ko . With such a
context, do you still manage to see boot failures?

Progressing from the simplest independent context
towards an independent one more like your normal 
context might be easier --and might avoid needing to
change your normal context as much.

Just a test context, not a normal use one. Fewer
constraints on the configuration that way.

Food for thought.
Comment 63 Tomasz "CeDeROM" CEDRO 2023-02-07 21:27:38 UTC
I had the same problem on 13.1-STABLE, vbox module load caused immediate kernel panic, I had rolled back to 13.1-STABLE because of this.

In the bare loader when kernel was loaded the vbox drivers load was okay. When vbox drivers was part of /boot/loader.conf or /etc/rc.conf is caused immediate kernel panic (no dump). virtualbox-ose-kmod was recompiled from ports on a newly installer kernel and system. Not sure if this is amdgpu nor zfs related though..?
Comment 64 Tomasz "CeDeROM" CEDRO 2023-02-07 21:28:49 UTC
*rolled back to 13.1-RELEASE sorry :-) All works fine here. Might be vbox + amdgpu api desync?
Comment 65 Mark Millard 2023-02-08 03:53:41 UTC
(In reply to George Mitchell from comment #36)

Have you ever gotten a crash with kern.smp.disabled=1 ?
If not, how many tests did you try?
Comment 66 Mark Millard 2023-02-08 04:01:22 UTC
(In reply to George Mitchell from comment #53)

A test might be to load something simple or unusual for
your context after amdgpu.ko and seeing if it still crashes.
I'm not sure it is a good example, but does, say, loading
amdgpu.ko and then filemon.ko also lead to a crash (not
loading more after that)?
Comment 67 Andriy Gapon freebsd_committer freebsd_triage 2023-02-08 05:52:00 UTC
(In reply to Emmanuel Vadot from comment #57)
Only "good", easy bugs are like that. That's why I said that this one must be exotic. But there must be something specific about George's environment too. Maybe configuration, maybe build, maybe specific hardware, maybe even a hardware glitch.
E.g., maybe if the graphics is active the RAM is more likely to randomly flip a bit.
Comment 68 George Mitchell 2023-02-08 14:58:54 UTC
(In reply to Mark Millard from comment #65)
Yes, I got the crash.  See comment #26.
Comment 69 George Mitchell 2023-02-08 15:00:46 UTC
I have a spare disk I can use for a test without ZFS.  It's currently at 12.0-RELEASE so it will take me a while to update it to 13.  Possibly I won't have a chance today, but I will try it.
Comment 70 Mark Millard 2023-02-08 17:30:16 UTC
(In reply to George Mitchell from comment #68)

#26 and #27 indicate that you would try the workaround kern.smp.disabled=1 ,
not the result of trying:

#26:
I will try the proposed workaround and see if it helps (at least until the single-core performance drives me up the wall)

#27:
I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).

That is part of why I asked.

Did the failure result with kern.smp.disabled=1 seem the same/similar to the
other failures --or was it distinct in some way?
Comment 71 Mark Millard 2023-02-08 17:49:28 UTC
(In reply to George Mitchell from comment #24)

It looks to me like the backtrace in "Latest crash dump":

KDB: stack backtrace:
#0 0xffffffff80c66ec5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbcf at vpanic+0x17f
#2 0xffffffff80c1ba43 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff81084fb8 at calltrap+0x8
#5 0xffffffff80be8c3d at linker_load_module+0x17d
#6 0xffffffff80beb17a at kern_kldload+0x16a
#7 0xffffffff80beb29b at sys_kldload+0x5b
#8 0xffffffff810ae6ec at amd64_syscall+0x10c
#9 0xffffffff810858cb at fast_syscall_common+0xf8

basically matches the 4 attachments that have been set to be Obsolete.

Should the Obsolete status be undone on the 4? Vs.: Should "Latest
crash dump" be made to also be Obsolete?

I'm guessing that none of the attachments should be obsolete at this
point.
Comment 72 Mark Millard 2023-02-08 17:55:11 UTC
(In reply to Mark Millard from comment #70)

There was also #36 with:

QUOTE
I got maybe six boots with kern.smp.disabled=1 with no crashes on 13.1-RELEASE-p2.  But based on an earlier comment I updated to 13.1-RELEASE-p5.  Then after going back to kern.smp.disabled=0 I got another of the crash.
END QUOTE

It only reported not getting a crash for kern.smp.disabled=1 .
Comment 73 George Mitchell 2023-02-08 17:58:00 UTC
(In reply to Mark Millard from comment #70)
I should have referred you to comment #27, not #26.  But I definitely got the crash with smp.disabled=1.
(In reply to Mark Millard from comment #71)
I could make a case for obsoleting all but two of them, but possibly I would be throwing away useful information.  To my unpracticed eye, though, the ones I DID obsolete were pretty redundant with the ones I kept.  They all look pretty similar to me.
Comment 74 Mark Millard 2023-02-08 18:12:44 UTC
(In reply to George Mitchell from comment #73)

But they are all the examples were the backtraces having nothing
from zfs or dbuf. Having 5 of 11 reports that way looks rather
different from 1 out of 7.

I'd say that the frequency is notable.
Comment 75 Mark Millard 2023-02-08 18:18:45 UTC
(In reply to George Mitchell from comment #73)

#27:
QUOTE
I'll try the proposed workaround and see if it helps (at least until the reduced performance drives me up the wall).
END QUOTE

It still says that you will try in the future, not explicitly that you
had a failure with kern.smp.disabled=1 .

#36 reports not having failures with kern.smp.disabled=1 .

I did not find any wording I could interpret as reporting a failure with
kern.smp.disabled=1 (prior to #73).

Do you remember noticing anything distinct? (Probably not, or you would
have commented in #73. But just to be sure . . .)
Comment 76 George Mitchell 2023-02-08 18:29:47 UTC
(In reply to Mark Millard from comment #75)
It's close to two months ago, so my memory may be misleading me, since my age is beginning to resemble the number of this comment.  But I'm pretty sure smp.disabled=1 did not prevent the bug.  I could be wrong.
Comment 77 George Mitchell 2023-02-14 14:55:36 UTC
I have been remiss in testing this without ZFS, because I will have to shuffle a couple of disks around.  I apologize for the delay.  I hope to be able to try this test later this week.
Comment 78 George Mitchell 2023-02-24 22:45:08 UTC
Although I have not yet managed to test this without ZFS, I have established that with zfs_load="YES" but without "vboxnet_enable="YES"" in /etc/rc.conf (zfs.ko and vboxnetflt.ko seeming to be the two modules with which amdgpu.ko has, um, personality conflicts), I can now boot up without crashing (so far).  Does anyone have any idea what zfs.ko and vboxnetflt.ko do that other modules don't do?
Comment 79 George Mitchell 2023-02-24 22:46:20 UTC
I omitted an important phrase.  It should have said, "with with zfs_load="YES" in /boot/loader.conf ..."
Comment 80 George Mitchell 2023-02-26 17:35:29 UTC
Created attachment 240427 [details]
New version of the crash, from acpi_wmi

Here's another module that doesn't get along well with amdgpu.ko on my system: acpi_wmi.ko.  Other than that this crash looks identical to all the earlier ones, as far as I can tell.

It's been about a dozen boot-up tries since I put zfs_load="YES" into /boot/loader.conf (so that ZFS gets loaded early to minimize its interaction with amdgpu.ko) and vboxnet_enable="NO" in /etc/rc.conf (so that vboxnetflt.ko doesn't get its chance to cause trouble either) until I got this new crash.

I'll mention again that this crash always happens within a minute of booting up, or else never.  Anyone have any ideas about what acpi_wmi.ko has in common with zfs.ko and vboxnetflt.ko?
Comment 81 Mark Millard 2023-02-26 20:15:09 UTC
(In reply to George Mitchell from comment #80)

There are multiple, distinct backtraces in your various examples.
This one matches the 4 still-listed-as Obsolete ones and the
"Latest crash dump" one, but not the others (if I remember right).

So it is another example where there is no mention of
dbuf or of zfs in the backtrace's text, unlike some other
backtraces.

So far as I can tell, there still has been no evidence gathering
seeing if the problem can happen absent zfs being loaded or zfs
loaded but no pools ever imported.

If I gather correctly, we now do have evidence that the specific
type of backtrace can happen without vboxnetflt.ko ever having
been loaded, proving it is not necessary for that kind of failure.
That is a form of progress as far as evidence goes. It also
suggests that merely being listed in a backtrace does not mean
that fact necessarily tells one much about the basic problem.

There is some possibility here that there is more than one basic
problem and some of the backtrace variability is associated with
that.
Comment 82 Mark Millard 2023-02-26 20:56:08 UTC
Using the gdb-based backtrace information:

#8  0xffffffff80be8c5d in modlist_lookup (name=0xfffff80006217400 "acpi_wmi", 
    ver=0) at /usr/src/sys/kern/kern_linker.c:1487

is for the strcmp code line in:

static modlist_t
modlist_lookup(const char *name, int ver)
{
        modlist_t mod;

        TAILQ_FOREACH(mod, &found_modules, link) {
                if (strcmp(mod->name, name) == 0 &&
                    (ver == 0 || mod->version == ver))
                        return (mod);
        }
        return (NULL);
}

We also see that strcmp was called via:

#6  <signal handler called>
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:46

We also see name was accessible, as shown in the "#8" line above.
We see from #7 that strcmp was considered called, suggesting that
the mod->name of itself did not fail. The implication would be that
that value of name in mod->name was a bad pointer when strcmp tried
to use the value.

Nothing says that mod->name was or should have been for acpi_wmi
at all. The "acpi_wmi" side of the comparison need not be
relevant information. Other backtraces that look similar may
well have a similar status for the name in the right had argument
to the strcmp.

This might be a useful hint to someone with appropriate background
or suggest some way of detecting the bad value in mod->name earlier
when that earlier context might be of more use for investigations.
Comment 83 George Mitchell 2023-02-26 23:21:09 UTC
I have set up a disk with FREEBSD 13.1-RELEASE-p7 and drm-510-kmod 5.10.113_8 WITHOUT ZFS and vbox-anything.  I don't know how to avoid loading acpi-wmi.ko.  So far it hasn't crashed, but I will try a whole bunch of reboots tomorrow with that disk.
Comment 84 Mark Millard 2023-02-26 23:59:06 UTC
(In reply to George Mitchell from comment #83)

I found the following text on https://cateee.net/lkddb/web-lkddb/ACPI_WMI.html :

QUOTE
ACPI-WMI is a proprietary extension to ACPI to expose parts of the ACPI firmware to userspace - this is done through various vendor defined methods and data blocks in a PNP0C14 device, which are then made available for userspace to call.

The implementation of this in Linux currently only exposes this to other kernel space drivers.

This driver is a required dependency to build the firmware specific drivers needed on many machines, including Acer and HP laptops.
END QUOTE

So, I expect that if acpi_wmi.ko is being loaded by FreeBSD, it may
well be a requirement for that machine to boot and/or operate via
ACPI. But I'm not familiar with the details.
Comment 85 George Mitchell 2023-02-27 17:57:55 UTC
I have a new crash, but I did not get a dump because of an issue I will explain below.

For those who came in late, here's a summary of my system.  dmesg says I have:CPU: AMD Ryzen 3 2200G with Radeon Vega Graphics     (3493.71-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x810f10  Family=0x17  Model=0x11  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x7ed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35c233ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,SKINIT,WDT,TCE,Topology,PCXC,PNXC,DBE,PL2I,MWAITX>
  Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x1007<CLZERO,IRPerf,XSaveErPtr,IBPB>
  SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=32768
  TSC: P-state invariant, performance statistics

My motherboard is a Gigabyte B450M D53H.
BIOS is American Megatrends version F4, dated 1/25/2019.

pciconf -lv says:
vgapci0@pci0:6:0:0:     class=0x030000 rev=0xc8 hdr=0x00 vendor=0x1002 device=0x15dd subvendor=0x1458 subdevice=0xd000
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]'
    class      = display
    subclass   = VGA

Until recently, when I was running FBSD 12-RELEASE, my box had one hard drive.  I added a new drive when I upgraded to FBSD 13-RELEASE so I would still have FBSD 12 as an emergency backup.  Part of the upgrade is that on the new disk I created a small UFS slice for /, /var, and /tmp, and most of the rest of the disk is a ZFS slice for /usr (so I wouldn't have to wait for fsck on reboot after crashes).  That means that it isn't practical to do a test without ZFS on that new disk (I'll call it my regular disk now).  So I installed FBSD 13 (same version as my regular disk) on the old disk (I'll call it the test disk now), which had (and still has) a small UFS slice for /, /var, and /tmp and a big UFS slice for /usr.

To boot from the test disk, I use the BIOS boot menu, since (unsurprisingly) I have set the default boot disk to my regular disk.

I removed all mentions of ZFS and VBOX from /boot/loader.conf and /etc/rc.conf on the test disk.  Then I booted up a whole bunch of times.  On the thirteenth try, I got the crash.  Unfortunately, I don't have a crash summary from it because the system rebooted from my regular disk instead of the test disk while I was still staring at the crash message on the screen.  Subsequently, I booted 20 more times from the test disk without getting the crash again.

What I saw (for a few seconds) on the screen from the one crash sure looked like the same old backtrace, and I have to say, to an ignorant yokel like myself, it seemed to be saying that there's a locking problem in amdgpu.  There was absolutely no virtual terminal switching, because I had not started an X server and I did not type ALT+Fn.

I'll try getting a proper crash dump later (possibly tomorrow).  My thanks to all of you for your patience.
Comment 86 Mark Millard 2023-02-27 20:02:46 UTC
(In reply to George Mitchell from comment #85)

Where does dumpdev point for "test disk"? Someplace also on
the "test disk" that a "regular disk" boot would not change?

If yes, the first boot of the "test disk" after the crash
should have picked up the dump information, even if the
"regular disk" was booted between times. But if the dumpdev
place is common to both types of boot, then the regular disk
boot would have processed the dump. likely using a different
/var/crash/ place to store things.

Another question would be if there is sufficient room for
/var/crash/ to contain the saved vmcore.* and related files.

Yet another question is if the test disk has /usr/local/bin/gdb
installed vs. not. ( When present, /usr/local/bin/gdb is used
to provides one of the forms of backtrace, the one with source
file references and line numbers and such. Much nicer to deal
with.)

If a vmcore.* was saved but some related information was
not for some reason, it should be possible to have the
related information produced based on the vmcore.* file.


Side note:

In case it is relevant, I'll note that defining dumpdev
in /boot/loader.conf in a form the kernel chan handle, instead
of in /etc/rc.conf , can be used to allow the system to produce
dumps for earlier crashes. (But I'm guessing the crash was not
that earliy to need such.)
Comment 87 Mark Millard 2023-02-27 20:13:35 UTC
(In reply to George Mitchell from comment #85)

For booting the test disk, getting the kldstat output
from a successful boot might prove useful reference
material at some point: it should show what to expect
to be loaded by the kernel and in what order.

Since you got a crash before starting the X server and
had not used ALT+Fn, that would be appropriate context
for the kldstat relative to the known UFS-only crash.

Other time frames for kldstat may be relevant at some
point.
Comment 88 Mark Millard 2023-02-27 20:50:05 UTC
I booted a ThreadRipper 1950X system via its UFS-only boot media
alternative. The system is not set up for X. For example, no
use/installation of amdgpu.ko for use with its video card. For
reference:

# kldstat
Id Refs Address                Size Name
 1   58 0xffffffff80200000  295a5a0 kernel
 2    1 0xffffffff83210000     3370 acpi_wmi.ko
 3    1 0xffffffff83214000     3210 intpm.ko
 4    1 0xffffffff83218000     2178 smbus.ko
 5    1 0xffffffff8321b000     2220 cpuctl.ko
 6    1 0xffffffff8321e000     3360 uhid.ko
 7    1 0xffffffff83222000     4364 ums.ko
 8    1 0xffffffff83227000     33a0 usbhid.ko
 9    1 0xffffffff8322b000     32a8 hidbus.ko
10    1 0xffffffff8322f000     4d00 ng_ubt.ko
11    6 0xffffffff83234000     ab28 netgraph.ko
12    2 0xffffffff8323f000     a238 ng_hci.ko
13    4 0xffffffff8324a000     2668 ng_bluetooth.ko
14    1 0xffffffff8324d000     8380 uftdi.ko
15    1 0xffffffff83256000     4e48 ucom.ko
16    1 0xffffffff8325b000     3340 wmt.ko
17    1 0xffffffff8325f000     e250 ng_l2cap.ko
18    1 0xffffffff8326e000    1bf08 ng_btsocket.ko
19    1 0xffffffff8328a000     38b8 ng_socket.ko
20    1 0xffffffff8328e000     2a50 mac_ntpd.ko

# uname -apKU
FreeBSD amd64_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #61 main-n261026-d04c86717c8c-dirty: Sun Feb 19 15:03:52 PST 2023     root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400081 1400081
Comment 89 George Mitchell 2023-03-03 23:20:14 UTC
After getting another instance of my crash on my test disk and then booting from the correct disk, I got a crash summary that said:

Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/lib/debug/boot/kernel/kernel.debug]

It occurred to me that when I updated my test disk from FBSD 12 to 13 I had forgotten to run mergemaster.  So I did so today.  But I haven't been able to reproduce the crash in 25 tries since then.  I'm convinced that running mergemaster did not fix the crash, which is after all highly random.  So I will try some more tomorrow.

I appreciate everybody's patience.
Comment 90 Mark Millard 2023-03-04 02:22:23 UTC
(In reply to George Mitchell from comment #89)

What vintage/version of *gdb was in use? (If it was
gdb that complained.) Was it /usr/local/bin/*gdb ?
/usr/libexec/*gdb ? Actually, for the backtrace
activity, it is kgdb that is used, not gdb. Thus my
use of "*gdb" notation.

But a core.txt.* file in my context shows:

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]

which would be for /usr/local/bin/*gdb ( not
/usr/libexec/*gdb ). This is because I have:

# pkg info gdb
gdb-12.1_3
Name           : gdb
Version        : 12.1_3
. . .

installed. (I had to a livecore.* to have something
to reference/illustrate with, having no example
vmcore.* files around for a long time.)

A significantly older gdb might indicate use of
an old /usr/libexec/*gdb that had not been cleaned
out.

I'll note that I got no DWARF complaints from
kgdb and:

# llvm-dwarfdump -r 1 /usr/lib/debug/boot/kernel/kernel.debug | grep DWARF | head -1
0x00000000: Compile Unit: length = 0x000001d3, format = DWARF32, version = 0x0004, abbr_offset = 0x0000, addr_size = 0x08 (next unit at 0x000001d7)

indicates  version = 0x0004 .

This leads me to expect that you have an old
gdb (kgdb) around that is in use.


It sounds like you got a savecore into /var/crash/ .
It should be possible to try investigating that without
having to cause another crash, presuming the system
is not updated (so that it matches the crash contents).
For example, the same sort of command that crashinfo
uses on the saved system-core file could be manually
tried, possibly with a more modern kgdb vintage being
used that would handle the more recent dwarf version.

Attaching your core.txt.* file content might prove
useful.
Comment 91 George Mitchell 2023-03-05 03:21:43 UTC
Created attachment 240591 [details]
A new but related crash (I think)

This one was at shutdown time rather than boot-up time, so potentially virtual terminal switching was involved.  But once again there are references to "WARNING !drm_modeset_is_locked(&plane->mutex) failed" along with a mention of ZFS.  I don't know what it means.
Comment 92 Mark Millard 2023-03-05 06:57:25 UTC
(In reply to George Mitchell from comment #91)

So, apparently, this was not one of the UFS-only experiments.


The gdb backtrace is messy:

. . .
#7  <signal handler called>
. . .
#27 <signal handler called>
#28 0x00000000002881da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd688

This indicates  that we are not seeing evidence from the
earlier problem that got #27. That, in turn, may or may not have
been the original problem.

The context looks to be a very different context than prior
reports. But not seeing what lead to #27 makes forming solid
judgments problematical.


I see from this that a modern gdb (kgdb) was in use for this
failure for the crashinfo generation after the savecore operation,
having no problems with DWARF 4  vs. 2. But it would seem to be
the boot media normally used with ZFS instead of the boot media
intended for UFS-only testing. The two might be different for
what is around for gdb (kgdb) for crashinfo to use.
Comment 93 Mark Millard 2023-03-05 07:06:44 UTC
(In reply to Mark Millard from comment #92)

Looking at it some more and comparing to

#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff8214d251 at spl_nvlist_free+0x61
#7 0xffffffff8220d740 at fm_nvlist_destroy+0x20
#8 0xffffffff822e6e95 at zfs_zevent_post_cb+0x15
#9 0xffffffff8220cd02 at zfs_zevent_drain+0x62
#10 0xffffffff8220cbf8 at zfs_zevent_drain_all+0x58
#11 0xffffffff8220ede9 at fm_fini+0x19
#12 0xffffffff82243b94 at spa_fini+0x54
#13 0xffffffff822ee303 at zfs_kmod_fini+0x33
#14 0xffffffff8215fb3b at zfs_shutdown+0x2b
#15 0xffffffff80c1b76c at kern_reboot+0x3dc
#16 0xffffffff80c1b381 at sys_reboot+0x411
#17 0xffffffff810ae6ec at amd64_syscall+0x10c

both #27 and #28 in:

#26 amd64_syscall (td=0xfffffe000f43ca00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#27 <signal handler called>
#28 0x00000000002881da in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffffffd688

are possibly just the normal difficulty with finding where
to stop listing.
Comment 94 Mark Millard 2023-03-05 07:19:56 UTC
(In reply to Mark Millard from comment #93)

#7  <signal handler called>
#8  vtozoneslab (va=18446735277616529408, zone=<optimized out>, 
    slab=<optimized out>) at /usr/src/sys/vm/uma_int.h:635

looks to be the "*slab" line in:

static __inline void
vtozoneslab(vm_offset_t va, uma_zone_t *zone, uma_slab_t *slab)
{
        vm_page_t p;
  
        p = PHYS_TO_VM_PAGE(pmap_kextract(va));
        *slab = p->plinks.uma.slab;
        *zone = p->plinks.uma.zone;
}

For reference: 18446735277616529408 == 0xFFFFF80000000000
Comment 95 George Mitchell 2023-03-06 18:15:30 UTC
Created attachment 240622 [details]
Another crash summary; looks like all the earlier ones

Quick summary: I can't cause this crash on my test setup (amdgpu but no ZFS) over close to 50 tries.

In more detail: I deleted all ports from my test setup and then added drm-510-kmod and gpu-firmware-amd-kmod, and (most importantly) gdb.  I then made many fruitless attempts to reproduce the crash.  Experimentally, I added "zfs" to my mod_list in /etc/rc.conf and got another instance of the crash after 11 attempts (see attachment).  This crash looks like all the ones from my regular setup, but at least it appears to be in the right format to get a backtrace, etc.

I then took "zfs" out of my mod_list and tried another 20 times to get the crash to recur.  It did not recur.
Comment 96 John F. Carr 2023-03-06 20:09:29 UTC
(In reply to Mark Millard from comment #94)

The "signal handler called" line hides a function call.  I think the crash is due to a null pointer dereference ("fault virtual address = 0x0") in pmap_kextract called from the line above.  Tracking down the PC address 0xffffffff80bf3727 in the kernel image should clarify.
Comment 97 Mark Millard 2023-03-06 20:20:59 UTC
(In reply to George Mitchell from comment #95)

But, as I understand, comments #85 and #89 reported
crashes of the test setup (no ZFS), as I understand.
(I ignore #91 that was at shutdown and looks different.)

If true, we do have some existence-proof type evidence
for without ZFS involved. It just may be less common.
(Unfortunately some detail was not available for
validating a context match.)

You may not want to spend all your time with the
no-ZFS style tests, but spending some time on
occasion could eventually prove useful. Any big,
complicated thing (like ZFS) that can be eliminated
may help isolate the problem.
Comment 98 Mark Millard 2023-03-06 20:24:11 UTC
(In reply to John F. Carr from comment #96)

As I understand it, "fault virtual address = 0x0" is
for #7 and not for number #27. As far as I can tell
what lead  to #27 and its specific type is not
available to us.
Comment 99 Mark Millard 2023-03-06 20:43:24 UTC
(In reply to George Mitchell from comment #95)

FYI: "Another crash summary; looks like all the earlier ones"
is a crash when it is getting ready to load ZFS, not after
ZFS has been loaded. So ZFS had not been started yet.

So it is evidence for a problem without having ZFS in
operation at all.
Comment 100 John F. Carr 2023-03-06 20:50:18 UTC
(In reply to Mark Millard from comment #98)

Frame 27 is the entry into the kernel via the system call trap.  We know this because it calls amd64_syscall.  Frame 28 is a user program.  We know this because the addresses are at the user address space and not the kernel address space (program counter at 0x2881da, stack frame at 0x7fffffffd688).
Comment 101 Mark Millard 2023-03-06 20:51:49 UTC
(In reply to George Mitchell from comment #95)

FYI: "Another crash summary; looks like all the earlier ones"
is a crash when it is getting ready to load ZFS, not after
ZFS has been loaded. So ZFS had not been started yet.

So it is evidence for a problem without having had ZFS in
operation at all.
Comment 102 George Mitchell 2023-03-06 20:57:27 UTC
(In reply to Mark Millard from comment #97)
You are correct that I did get two dumps without ZFS, but they did not appear to have decipherable dumps.  I'll keep trying for another dump without ZFS now that I know we will obtain a usable dump on the test setup.
(In reply to Mark Millard from comment #101)
That's why we stopped seeing the reference to ZFS when I took "zfs" out of mod_list and put "zfs_load="YES"" in /boot/loader.conf in response to comment #41.
Comment 103 Mark Millard 2023-03-06 21:00:43 UTC
(In reply to John F. Carr from comment #100)

Ahh, so kgdb ends up with fast_syscall_common+0xf8
or the like translated to a <signal handler called> .
For this part, beleive and look at the kernel's
backtrace for the area that says
fast_syscall_common+0xf8 (or whatever).

Good to know. Thanks.
Comment 104 John F. Carr 2023-03-06 21:13:38 UTC
If the problem is memory corruption running a debug kernel might find the corruption closer to when it happens.  Are you able to build and run your own kernel with a configuration file like

include GENERIC
ident   DEBUG
options       INVARIANTS
options       INVARIANT_SUPPORT

?
Comment 105 Mark Millard 2023-03-06 21:26:35 UTC
(In reply to George Mitchell from comment #102)

So are all the load-time crashes with things
loaded via use of:

     kld_list	 (str) A whitespace-separated list of kernel modules to	load
		 right after the local disks are mounted, without any .ko ex-
		 tension or path.  Loading modules at this point in the	boot
		 process is much faster	than doing it via /boot/loader.conf
		 for those modules not necessary for mounting local disks.

and never with things that are loaded via
/boot/loader.conf activity?

It is a possible distinction in the test
results that I'd managed to miss.


(I'll note that the "for those modules not
necessary for mounting local disks" may make
zfs being listed kld_list unusual. That, in
turn, might help explain why, so far, you are
the only one known to be having the load-time
crash problem examples.)
Comment 106 George Mitchell 2023-03-07 13:39:24 UTC
(In reply to John F. Carr from comment #104)
I will try this today.  By the way, perhaps I should have mentioned already that I use SCHED_4BSD (I'm the guy who periodically rants that it should be the default, or at least that the scheduler should be a kernel loadable module), though it's hard to see how that could be a factor.
(In reply to Mark Millard from comment #105)
Yes, I had an occurrence of brain fade when I put zfs into mod_list.  I promise never to have brain fade ever again.
Comment 107 George Mitchell 2023-03-07 18:40:34 UTC
Created attachment 240642 [details]
Crash without any use of ZFS, with acpi_wmi

Here's a crash from my test setup with no use of ZFS at all.  It looks like the earlier crash with acpi_wmi, without which I suspect this hardware won't run.  Also, this kernel had INVARIANTS and INVARIANTS_SUPPORT compiled in (confirmed by the config shown in the summary), though I couldn't tell from anything I saw on the screen.  Next I'll attach the relevant part of /var/log/messages, though I didn't see anything there either.
Comment 108 George Mitchell 2023-03-07 18:43:11 UTC
Created attachment 240643 [details]
Relevant part of /var/log/messages

Here's the log from the time of the crash, up to now.
Comment 109 Mark Millard 2023-03-07 18:52:05 UTC
(In reply to George Mitchell from comment #107)

I'll note that in the example kldstat that I reported
earlier the order started with:

# kldstat
Id Refs Address                Size Name
 1   58 0xffffffff80200000  295a5a0 kernel
 2    1 0xffffffff83210000     3370 acpi_wmi.ko
. . .

So acpi_wm.ko appears to be the first module loaded
in my context. I'd guess that is true for your context
as well.

This would mean that prior module loads are not
required for the problem to happen (loading the
first of the modules). That shold narrow the
range of possibilities (for someone sufficiently
knowledgeable in the subject area).
Comment 110 George Mitchell 2023-03-08 22:35:17 UTC
Created attachment 240683 [details]
New instance

This is from running my regular setup, not the debug setup.  Almost immediately after I got this dump, my system crashed two more times in a row; see next attachment, which appears to contain a summary of both crashes (the 2nd and the 3rd).  None of the stack dumps seem to have a call to modlist_lookup2, so possibly all three of these are some new amdgpu crash.
Comment 111 George Mitchell 2023-03-08 22:37:30 UTC
Created attachment 240684 [details]
Crashes 2 and 3

The second crash was very late in the boot process, unlike most of the others.  Running meld on these files might prove enlightening.
Comment 112 Mark Millard 2023-03-08 23:41:07 UTC
(In reply to George Mitchell from comment #110)

The backtraces mentioning "zap_evict_sync" are not new.
You submitted prior examples as attachments, such as
"New core.txt".

The backtrace(s) with "spa_all_configs" may well be new.
I do not remember such.
Comment 113 George Mitchell 2023-03-09 19:29:58 UTC
Would it help if I attached my system log from the period of time yesterday when I got three crashes in a row?
Comment 114 George Mitchell 2023-03-10 18:28:16 UTC
Created attachment 240729 [details]
Another instance of attachment #240591 [details] crash at shutdown time

For the sake of completeness I'm attaching one more instance of the crash I see every few days at shutdown time instead of boot-up time.

My plan for now is to restore my configuration to the one that most frequently provokes the crash: namely, I load ZFS with zfs_enable in /etc/rc.conf instead of zfs_load in /boot/loader.conf, and I'm adding vbox_enable="YES' back into /etc/rc.conf.  Also, I'm updating from drm-510-kmod-5.10.113_8 to drm-510-kmod-5.10.163_2 since it's available, and I'll see if that crashes still.  If so, then I will stop using amdgpu for a week and verify, for the purpose of maintaining my own sanity, that the crashes stop.  And I'll report back here.
Comment 115 Mark Millard 2023-03-10 18:35:47 UTC
(In reply to George Mitchell from comment #114)

All of the crashes that listed "acpi_wmi" were before
amdgpu could have been involved: acpi_wmi loads first
amdgpu would be later.
Comment 116 George Mitchell 2023-03-10 18:45:49 UTC
Created attachment 240731 [details]
After upgrading to v5.10.163_2

I re-enabled the crashes (i.e. stopped loading ZFS early and turned vbox_enable back on) and got a crash on my very first reboot.  Now I have disabled amdgpu and I'll be astonished if I get a crash before the twelfth of never.

This crash does look slightly different, though, and seems to have had a trap 22 in ZFS code.
Comment 117 George Mitchell 2023-03-10 18:50:20 UTC
(In reply to Mark Millard from comment #115)
Be that as it may, over the period of time from when I first upgraded to FBSD 13.1 until I started seriously trying to use drm-510-kmod, I never saw any occurrences at all of the ZFS crash, the vboxnetflt crash, or the acpi_wmi crash.  And I don't expect to see any of them as long as I don't load amdgpu.ko.
Comment 118 Mark Millard 2023-03-11 22:45:37 UTC
(In reply to George Mitchell from comment #117)

Yea, my expectation that acpi_wmi would always be loaded first
was just wrong. Sorry. With the ZFS boot media, I see:

Id Refs Address                Size Name
 1   94 0xffffffff80200000  295a9b0 kernel
 2    1 0xffffffff82b5b000   5b80d8 zfs.ko
 3    1 0xffffffff83115000     76f8 cryptodev.ko
 4    1 0xffffffff83a10000     3370 acpi_wmi.ko
. . .

I looked at all your attachments again. It appears amdgpu
was already present before the first crash point in all of
them.
Comment 119 Mark Millard 2023-03-11 23:45:30 UTC
For:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870

objdump -d --prefix-addresses /boot/kernel/kernel | less

shows:

ffffffff80d1786b <qsort+0x12ab> mov    %esi,0x4(%r11,%rdx,4)
ffffffff80d17870 <qsort+0x12b0> mov    0x8(%rcx,%rdx,4),%esi

As for other "instruction pointer" examples . . .

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer	= 0x20:0xffffffff80d17890

ffffffff80d1788f <qsort+0x12cf> mov    %esi,0xc(%r11,%rdx,4)
ffffffff80d17894 <qsort+0x12d4> add    $0x4,%rdx

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

The above is outside the kernel's code.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3701 <free+0x11> je     ffffffff80bf378d <free+0x9d>
ffffffff80bf3707 <free+0x17> mov    %rsi,%r14

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff82231ba6

The above is outside the kernel's code.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3701 <free+0x11> je     ffffffff80bf378d <free+0x9d>
ffffffff80bf3707 <free+0x17> mov    %rsi,%r14

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727

ffffffff80bf3722 <free+0x32> call   ffffffff80f66670 <PHYS_TO_VM_PAGE>
ffffffff80bf3727 <free+0x37> mov    (%rax),%r13

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80d0cea0

ffffffff80d0ce9c <vn_ioctl+0x1fc> jne    ffffffff80d0cff2 <vn_ioctl+0x352>
ffffffff80d0cea2 <vn_ioctl+0x202> movzwl 0x2(%r13),%ecx
Comment 120 Mark Millard 2023-03-12 20:14:25 UTC
(In reply to Mark Millard from comment #119)

[Sorry for the accidental duplication of the block that
had "instruction pointer = 0x20:0xffffffff80bf3707".]

The qsort, free, and vn_ioctl addresses do not look to
match up with any of the multi-level backtraces. So we
have very little evidence about what the context was.

I've no clue for the addresses that were outside the
kernel.
Comment 121 Mark Millard 2023-03-12 20:54:27 UTC
(In reply to Mark Millard from comment #120)

Ugg. I just realized that I'd not looked at an official
releng/13.1 build. So using a download of an official
kernel.txz this time . . . (the subroutines stay the same
but the detailed code is different).


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d17870

ffffffff80d1786d <qsort+0x130d> mov    -0x38(%rbp),%rdi
ffffffff80d17871 <qsort+0x1311> mov    %dl,(%rdi,%rsi,1)


As for other "instruction pointer" examples . . .

Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer	= 0x20:0xffffffff80d17890

ffffffff80d1788f <qsort+0x132f> cmp    $0x3,%r8
ffffffff80d17893 <qsort+0x1333> jae    ffffffff80d17910 <qsort+0x13b0>


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

The above is outside the kernel's code.


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff82231ba6

The above is outside the kernel's code.


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3707

ffffffff80bf3700 <free+0x70> mov    %gs:0xb0,%rax
ffffffff80bf3709 <free+0x79> add    %r15,0x8(%rcx,%rax,1)


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727

ffffffff80bf3724 <free+0x94> cmpb   $0x0,0x128(%rbx)
ffffffff80bf372b <free+0x9b> jne    ffffffff80bf3777 <free+0xe7>


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80d0cea0

ffffffff80d0ce9a <vn_ioctl+0x25a> mov    %r14,-0xc8(%rbp)
ffffffff80d0cea1 <vn_ioctl+0x261> cmpb   $0x0,0xaf417e(%rip)        # ffffffff81801026 <sdt_probes_enabled>
Comment 122 Mark Millard 2023-03-12 23:13:57 UTC
(In reply to George Mitchell from comment #117)

Would it be reasonable to have some testing with amdgpu.ko
loaded but never having a desktop environment active?

Or, may be I should form the idea as questions: What is
the minimal form of having amdgpu.ko loaded in the system?
Can that be tested (if it has not been already)? Does this
minimal form behave any differently than more involved
use of amdgpu.ko (and the associated card firmware)?

In a different direction . . .

In/for a separate context, I once built amdgpu and its
firmware and installed it. But I did not set up an
automatic load. For the rare test, I manually loaded
amdgpu and then started lumina. (It is an old memory. I
might not have the details correct.) This procedure might
have largely avoided later loads of kernel modules and,
so, avoided discovering a problem.
Comment 123 George Mitchell 2023-03-12 23:23:49 UTC
All my so-called test setup tests were run without starting a desktop environment (by which I assume you mean not starting X).  There were still crashes such as in comment #107, attachment #240642 [details].

With my normal setup, kldloading amdgpu manually instead of automatically noticeably reduced the incidence of crashes but did not eliminate them.
Comment 124 Mark Millard 2023-03-13 01:22:33 UTC
(In reply to George Mitchell from comment #123)

"kldloading amdgpu manually": there are two possibilities:

A) Using boot -s and doing kldload and then exiting to
   normal mode. There are examples in your attachments
   of doing this.

B) Getting to normal mode, logging in, and only after that
   doing the first kldload of amdgpu. I do not remember any
   of the attachments clearly indicating such a sequence.
   It puts the amdgpu load after other other normal loads.
Comment 125 Mark Millard 2023-03-13 03:51:59 UTC
Well, I was going to try testing in an environment were I've
got a serial console: an aarch64 main [so: 14] context. But
it turns out that there is at least one missing function
declaration foor the type of context at this point:

/wrkdirs/usr/ports/graphics/drm-515-kmod/work/drm-kmod-drm_v5.15.25/drivers/gpu/drm/drm_cache.c:362:10: error: call to undeclared function 'in_interrupt'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
        WARN_ON(in_interrupt());
                ^
1 error generated.
*** [drm_cache.o] Error code 1

as is visible in the official build log:

http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-515-kmod-5.15.25.log

Turns out the drm-510-kmod variant allowed for releng/13.1
and later is missing possible macro definitions for aarch64:

/wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:741:3: error: call to undeclared function 'DC_FP_START'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                DC_FP_START();
                ^
/wrkdirs/usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_2/drivers/gpu/drm/amd/display/dc/core/dc.c:743:3: error: call to undeclared function 'DC_FP_END'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                DC_FP_END();
                ^
2 errors generated.
*** [dc.o] Error code 1

as is visible in:

http://ampere2.nyi.freebsd.org/data/main-arm64-default/p64e3eb722c17_s7fc82fd1f8/logs/errors/drm-510-kmod-5.10.163_2.log

(It is not just my builds that have such issues:
official builds have the problems as well.)

I was hoping I'd be able to do some testing in the
alternative type of context (likely never starting
X11). That looks to not be in the cards at this
time.
Comment 126 Mark Millard 2023-03-14 03:22:27 UTC
(In reply to Mark Millard from comment #125)

Picking the drm-515-kmod one: it looks like the source 
file referenced needs to include the content of the
file providing the #define :

/usr/main-src/sys/compat/linuxkpi/common/include/linux/preempt.h:#define        in_interrupt() \


There are overall, some other uses:

drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:     if (r < 1 && in_interrupt())
drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:      if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt()))
drm-kmod//drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:      if (r < 1 && (amdgpu_in_reset(adev) || in_interrupt()))
drm-kmod//drivers/gpu/drm/drm_cache.c:  if (WARN_ON(in_interrupt())) {
drm-kmod//drivers/gpu/drm/drm_cache.c:  WARN_ON(in_interrupt());

I have not checked if any others of those do get preempt.h
already.

amd64 might be working via header pollution in some way
that aarch64 does not?
Comment 127 Mark Millard 2023-03-17 05:22:39 UTC
(In reply to Mark Millard from comment #126)

Further inspection of what comes next after making
drm_cache.c pick up the in_interrupt definition
suggests that trying builds of aarch64 is premature
at this point, making the type of test I was
intending also premature.
Comment 128 George Mitchell 2023-03-17 17:43:18 UTC
(In reply to George Mitchell from comment #114)
> [...] I will stop using amdgpu for a week and verify, for the purpose
> of maintaining my own sanity, that the crashes stop.   [...]

Back in amd64 land, since the time of that comment, I have rebooted my system 25 times and there have been no crashes at all.  I guess I'm sane.
Comment 129 Mark Millard 2023-03-17 20:01:45 UTC
(In reply to George Mitchell from comment #12)

Could you also share your "kldstat" output for when amdgpu
has been loaded?

More than just amdgpu might be added to what is loaded
before amdgpu compared to when amdgpu is not loaded at
all. For example some of:

# find /boot/ker*/ -name 'linux*' -print | more
/boot/kernel/linux64.ko
/boot/kernel/linux_common.ko
/boot/kernel/linuxkpi.ko
/boot/kernel/linuxkpi_wlan.ko

might be involved, not just amdgpu.

Loading only some prerequisites for amdgpu, but not
amdgpu itself, might prove a useful isolation test.
Comment 130 Mark Millard 2023-03-17 20:30:16 UTC
(In reply to Mark Millard from comment #129)

I wrote "what is loaded before" relative to amdgpu.
But what amdgpu in turn leads to loading that is
listed after amdgpu in "kld stat" output likely is
just as relevant. For all I know all of it may be
from after amdgpu's position in the "kld stat" list.
Comment 131 Mark Millard 2023-03-18 04:20:49 UTC
(In reply to Mark Millard from comment #130)

Based on drm-515-kmod related materials on/for amd64
runing main [so: 14] and the type of card that happened
to be present, I saw:

22    1 0xffffffff83c00000   4fd918 amdgpu.ko
23    2 0xffffffff83a8e000    79f50 drm.ko
24    1 0xffffffff83b08000     22a8 iic.ko
25    3 0xffffffff83b0b000     30d8 linuxkpi_gplv2.ko
26    4 0xffffffff83b0f000     6320 dmabuf.ko
27    3 0xffffffff83b16000     3360 lindebugfs.ko
28    1 0xffffffff83b1a000     b350 ttm.ko
29    1 0xffffffff83b26000     a118 amdgpu_polaris11_k_mc_bin.ko
30    1 0xffffffff83b31000     6370 amdgpu_polaris11_pfp_2_bin.ko
31    1 0xffffffff83b38000     6370 amdgpu_polaris11_me_2_bin.ko
32    1 0xffffffff83b3f000     4370 amdgpu_polaris11_ce_2_bin.ko
33    1 0xffffffff83b44000     7978 amdgpu_polaris11_rlc_bin.ko
34    1 0xffffffff83b4c000    42380 amdgpu_polaris11_mec_2_bin.ko
35    1 0xffffffff83b8f000    42380 amdgpu_polaris11_mec2_2_bin.ko
36    1 0xffffffff83bd2000     5270 amdgpu_polaris11_sdma_bin.ko
37    1 0xffffffff83bd8000     5270 amdgpu_polaris11_sdma1_bin.ko
38    1 0xffffffff840fe000    5db58 amdgpu_polaris11_uvd_bin.ko
39    1 0xffffffff8415c000    2ac78 amdgpu_polaris11_vce_bin.ko
40    1 0xffffffff83bde000    21d90 amdgpu_polaris11_k_smc_bin.ko

This was from deliberately using kldload amdgpu after all the
normal boot/login load activity. No kld_list= use involved at
all.

I wonder how much your environment would crash for amdgpu loaded
this late.

FYI: The prior load activity was:

Id Refs Address                Size Name
 1  132 0xffffffff80200000  295b050 kernel
 2    1 0xffffffff82b5d000     76f8 cryptodev.ko
 3    1 0xffffffff82b65000   5b80d8 zfs.ko
 4    1 0xffffffff83a10000     3370 acpi_wmi.ko
 5    1 0xffffffff83a14000     3210 intpm.ko
 6    1 0xffffffff83a18000     2178 smbus.ko
 7    1 0xffffffff83a1b000     2220 cpuctl.ko
 8    1 0xffffffff83a1e000     3360 uhid.ko
 9    1 0xffffffff83a22000     4364 ums.ko
10    1 0xffffffff83a27000     33a0 usbhid.ko
11    1 0xffffffff83a2b000     32a8 hidbus.ko
12    1 0xffffffff83a2f000     4d00 ng_ubt.ko
13    6 0xffffffff83a34000     ab28 netgraph.ko
14    2 0xffffffff83a3f000     a238 ng_hci.ko
15    4 0xffffffff83a4a000     2668 ng_bluetooth.ko
16    1 0xffffffff83a4d000     8380 uftdi.ko
17    1 0xffffffff83a56000     4e48 ucom.ko
18    1 0xffffffff83a5b000     3340 wmt.ko
19    1 0xffffffff83a5f000     e250 ng_l2cap.ko
20    1 0xffffffff83a6e000    1bf08 ng_btsocket.ko
21    1 0xffffffff83a8a000     38b8 ng_socket.ko
Comment 132 George Mitchell 2023-03-18 17:49:14 UTC
(In reply to Mark Millard from comment #129)
When I boot up to single-user mode, kldstat says:
Id Refs Address                Size Name
 1    7 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
After "kldload amdgpu," it says:
Id Refs Address                Size Name
 1   59 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
 5    1 0xffffffff82600000   417220 amdgpu.ko
 6    2 0xffffffff82518000    739e0 drm.ko
 7    3 0xffffffff8258c000     5220 linuxkpi_gplv2.ko
 8    4 0xffffffff82592000     62d8 dmabuf.ko
 9    1 0xffffffff82599000     c758 ttm.ko
10    1 0xffffffff825a6000     2218 amdgpu_raven_gpu_info_bin.ko
11    1 0xffffffff825a9000     64d8 amdgpu_raven_sdma_bin.ko
12    1 0xffffffff825b0000    2e2d8 amdgpu_raven_asd_bin.ko
13    1 0xffffffff825df000     93d8 amdgpu_raven_ta_bin.ko
14    1 0xffffffff825e9000     7558 amdgpu_raven_pfp_bin.ko
15    1 0xffffffff825f1000     6558 amdgpu_raven_me_bin.ko
16    1 0xffffffff825f8000     4558 amdgpu_raven_ce_bin.ko
17    1 0xffffffff82a18000     b9c0 amdgpu_raven_rlc_bin.ko
18    1 0xffffffff82a24000    437e8 amdgpu_raven_mec_bin.ko
19    1 0xffffffff82a68000    437e8 amdgpu_raven_mec2_bin.ko
20    1 0xffffffff82aac000    5a638 amdgpu_raven_vcn_bin.ko
But after a full boot without amdgpu, it says:
Id Refs Address                Size Name
 1   66 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    ff4b8 if_re.ko
 3    3 0xffffffff82230000    8cc90 vboxdrv.ko
 4    1 0xffffffff822bd000     77e0 sem.ko
 5    1 0xffffffff82600000   3df128 zfs.ko
 6    2 0xffffffff82518000     4240 vboxnetflt.ko
 7    2 0xffffffff8251d000     aac8 netgraph.ko
 8    1 0xffffffff82528000     31c8 ng_ether.ko
 9    1 0xffffffff8252c000     55e0 vboxnetadp.ko
10    1 0xffffffff82532000     3378 acpi_wmi.ko
11    1 0xffffffff82536000     3218 intpm.ko
12    1 0xffffffff8253a000     2180 smbus.ko
13    1 0xffffffff8253d000     33c0 uslcom.ko
14    1 0xffffffff82541000     4d90 ucom.ko
15    1 0xffffffff82546000     2340 uhid.ko
16    1 0xffffffff82549000     3380 usbhid.ko
17    1 0xffffffff8254d000     31f8 hidbus.ko
18    1 0xffffffff82551000     3320 wmt.ko
19    1 0xffffffff82555000     4350 ums.ko
20    1 0xffffffff8255a000     5af8 autofs.ko
21    1 0xffffffff82560000     2a08 mac_ntpd.ko
22    1 0xffffffff82563000     20f0 green_saver.ko
Comment 133 Mark Millard 2023-03-18 18:46:34 UTC
(In reply to George Mitchell from comment #132)

I wonder if, in your context, the following boot
sequencing might sidestep the boot-crash issue:

"A full boot without amdgpu"
then: "kldload amdgpu"
then: normal use.

Basically: doing the amdgpu load as late as
possible relative to everything else loaded,
limiting what all loads after amdgpu.
Comment 134 George Mitchell 2023-03-19 22:23:03 UTC
Okay, my machine is set up as you requested.  It boots to multiuser mode without starting an X session, at which point I load amdgpu and then start my normal XFCE session.  I'll run it this way for a week.

Undoubtedly, it won't exhibit the bootup crash in this mode of operation, but I won't be surprised if I still get a shutdown crash or two.  And in any case this isn't a fix for the underlying bug.

Not sure what new information this is likely to yield.
Comment 135 Mark Millard 2023-03-20 01:39:10 UTC
(In reply to George Mitchell from comment #134)

Having the kldstat output for this combination would
help identify what module is initially involved in any
crash.

Part of what may be of use is how often you see the
dbuf_evict_thread type of backtrace and what module
the first "instruction pointer	=" references in
such cases (if any). Another would be if new crash
contexts show up that have not been seen before.

So far there is no evidence for how many bugs there
are, given the varying failure-structures that show
up. There could even be the possibility of unreliable
memory or bugs specific to amdgpu_raven_*.ko files (such
as sometimes trashing some memory).

I've yet to induce any failure in the amdgpu_polaris11_*.ko
based amd64 context that I have access to (a ThreadRipper
1950X), although by no means is it a close match to your
context. To my knowledge, you still have the only known
examples of any of the failures.

To some extent, if trying new things leads to new forms
of failure for you, it potentially gives me new sequences
to try on the ThreadRipper 1950X. How (un)likely that is
to yield useful information I do not know. (My hope to
also try on aarch64, where I've access to a serial
console, did not pan out.)
Comment 136 George Mitchell 2023-03-20 15:18:16 UTC
Sorry, meant to put these in yesterday.  After booting to single-user mode, kldstat reports:

Id Refs Address                Size Name
 1    7 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko

If I boot to single-user mode and kldload amdgpu, kldstat reports:

Id Refs Address                Size Name
 1   59 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    8cc90 vboxdrv.ko
 3    1 0xffffffff821be000    ff4b8 if_re.ko
 4    1 0xffffffff822be000     77e0 sem.ko
 5    1 0xffffffff82600000   417220 amdgpu.ko
 6    2 0xffffffff82518000    739e0 drm.ko
 7    3 0xffffffff8258c000     5220 linuxkpi_gplv2.ko
 8    4 0xffffffff82592000     62d8 dmabuf.ko
 9    1 0xffffffff82599000     c758 ttm.ko
10    1 0xffffffff825a6000     2218 amdgpu_raven_gpu_info_bin.ko
11    1 0xffffffff825a9000     64d8 amdgpu_raven_sdma_bin.ko
12    1 0xffffffff825b0000    2e2d8 amdgpu_raven_asd_bin.ko
13    1 0xffffffff825df000     93d8 amdgpu_raven_ta_bin.ko
14    1 0xffffffff825e9000     7558 amdgpu_raven_pfp_bin.ko
15    1 0xffffffff825f1000     6558 amdgpu_raven_me_bin.ko
16    1 0xffffffff825f8000     4558 amdgpu_raven_ce_bin.ko
17    1 0xffffffff82a18000     b9c0 amdgpu_raven_rlc_bin.ko
18    1 0xffffffff82a24000    437e8 amdgpu_raven_mec_bin.ko
19    1 0xffffffff82a68000    437e8 amdgpu_raven_mec2_bin.ko
20    1 0xffffffff82aac000    5a638 amdgpu_raven_vcn_bin.ko

If I boot to multi-user mode without kldloading amdgpu, kldstat reports;

Id Refs Address                Size Name
 1   66 0xffffffff80200000  1f2ffd0 kernel
 2    1 0xffffffff82130000    ff4b8 if_re.ko
 3    1 0xffffffff82231000     77e0 sem.ko
 4    3 0xffffffff82239000    8cc90 vboxdrv.ko
 5    1 0xffffffff82600000   3df128 zfs.ko
 6    2 0xffffffff82518000     4240 vboxnetflt.ko
 7    2 0xffffffff8251d000     aac8 netgraph.ko
 8    1 0xffffffff82528000     31c8 ng_ether.ko
 9    1 0xffffffff8252c000     55e0 vboxnetadp.ko
10    1 0xffffffff82532000     3378 acpi_wmi.ko
11    1 0xffffffff82536000     3218 intpm.ko
12    1 0xffffffff8253a000     2180 smbus.ko
13    1 0xffffffff8253d000     33c0 uslcom.ko
14    1 0xffffffff82541000     4d90 ucom.ko
15    1 0xffffffff82546000     2340 uhid.ko
16    1 0xffffffff82549000     3380 usbhid.ko
17    1 0xffffffff8254d000     31f8 hidbus.ko
18    1 0xffffffff82551000     3320 wmt.ko
19    1 0xffffffff82555000     4350 ums.ko
20    1 0xffffffff8255a000     5af8 autofs.ko
21    1 0xffffffff82560000     2a08 mac_ntpd.ko
22    1 0xffffffff82563000     20f0 green_saver.ko

If I then kldload amdgpu, it says the same as above, plus:

23    1 0xffffffff82a00000   417220 amdgpu.ko
24    2 0xffffffff82566000    739e0 drm.ko
25    3 0xffffffff825da000     5220 linuxkpi_gplv2.ko
26    4 0xffffffff825e0000     62d8 dmabuf.ko
27    1 0xffffffff825e7000     c758 ttm.ko
28    1 0xffffffff825f4000     2218 amdgpu_raven_gpu_info_bin.ko
29    1 0xffffffff825f7000     64d8 amdgpu_raven_sdma_bin.ko
30    1 0xffffffff82e18000    2e2d8 amdgpu_raven_asd_bin.ko
31    1 0xffffffff829e0000     93d8 amdgpu_raven_ta_bin.ko
32    1 0xffffffff829ea000     7558 amdgpu_raven_pfp_bin.ko
33    1 0xffffffff829f2000     6558 amdgpu_raven_me_bin.ko
34    1 0xffffffff829f9000     4558 amdgpu_raven_ce_bin.ko
35    1 0xffffffff82e47000     b9c0 amdgpu_raven_rlc_bin.ko
36    1 0xffffffff82e53000    437e8 amdgpu_raven_mec_bin.ko
37    1 0xffffffff82e97000    437e8 amdgpu_raven_mec2_bin.ko
38    1 0xffffffff82edb000    5a638 amdgpu_raven_vcn_bin.ko
Comment 137 George Mitchell 2023-03-20 22:17:43 UTC
Created attachment 241022 [details]
Four boot-time crashes in a row

For some reason, I just got four boot-up crashes immediately in a row.  After I cycled power, I was able to boot up without crashing.  I think I'm going to load zfs.ko from /boot/loader.conf to get it loaded earlier, which mitigates this pronlem.  (It's currently loaded with zfs_enable="YES" in /etc/rc.conf.)
Comment 138 Mark Millard 2023-03-20 22:22:34 UTC
(In reply to George Mitchell from comment #137)

Your upload ended up being: application/octet-stream this
time, instead of text/plain .
Comment 139 George Mitchell 2023-03-20 22:25:37 UTC
Yes.  It's a compressed tar file with four core.txt files for the price of one.  They are different enough that I thought I'd better attach them all, though mainly the later ones include increasing portions of the earlier ones because they were on immediately successive boots.
Comment 140 Mark Millard 2023-03-20 23:03:19 UTC
(In reply to George Mitchell from comment #137)

All 4 are examples related to dbuf_evict_thread (a.k.a.
zfs dbuf related crashes), as I feared. All 4 look like:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address	= 0x7
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82600ba6

Looks to be in:

 5    1 0xffffffff82600000   3df128 zfs.ko


 panic: page fault
cpuid = 1
time = 1679349400
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff827ac768 at zap_evict_sync+0x68
#7 0xffffffff8267d74a at dbuf_destroy+0xba
#8 0xffffffff82683129 at dbuf_evict_one+0xf9
#9 0xffffffff8267b43d at dbuf_evict_thread+0x31d
#10 0xffffffff80bd8abe at fork_exit+0x7e
#11 0xffffffff8108604e at fork_trampoline+0xe

#6  0xffffffff810ade4f in trap_pfault (frame=0xfffffe00b3bb6d00, 
    usermode=false, signo=<optimized out>, ucode=<optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  avl_destroy_nodes (tree=tree@entry=0xfffff8001a80b5a0, 
    cookie=cookie@entry=0xfffffe00b3bb6dd0)
    at /usr/src/sys/contrib/openzfs/module/avl/avl.c:1023
#9  0xffffffff827ac768 in mze_destroy (zap=0xfffff8001a80b480)
    at /usr/src/sys/contrib/openzfs/module/zfs/zap_micro.c:402

A question would be if this repeats based on amdgpu having been
loaded (again last) but no X11 like activity having ever been
started: limiting amdgpu use to just the load activity or as
close to that limited of use as is possible. (This is separate
from your zfs load time adjustment test.)

My guess is that the content of some memory area(s) is being
trashed in your context. I'm not sure how to track down
what is doing the trashing or were all the trashed area(s)
are if that is what is going on.

At least we now have a clue how to get the specific type of
crash. Before I had no clue what an example initial-context
might be like.


Note: Changing the load order should get a matching kldstat
report to indicate the address ranges that end up involved.
Comment 141 Mark Millard 2023-03-20 23:12:37 UTC
(In reply to George Mitchell from comment #139)

The upload did not look compressed to me: I just
had to use tools that would tolerate the binary
content at the start and end. The rest looked like
normal text without me doing anything to decompress
the file.

But, looking, the prefix text does look like a
partially-binary header, likely added by a tool.
The tail end might just be binary padding.

At least I've a clue for next time.
Comment 142 George Mitchell 2023-03-20 23:17:15 UTC
So I should just boot up to multi-user mode and kldload amdgpu, but not start XFCE?  And repeat until it crashes again?
Comment 143 Mark Millard 2023-03-20 23:29:08 UTC
(In reply to George Mitchell from comment #142)

Seeing if that no-XFCE context crashes vs. not would
be a good idea. If it crashes similarly, then XFCE
activity is not likely to be involved. If it does
not crash, then XFCE activity is likely involved.


FYI: all 4 crashes had:

fault virtual address	= 0x7

(the same small offset from a NULL pointer in
C terms). This does not look like random trashing
of memory (for the few examples available).
Comment 144 George Mitchell 2023-03-21 00:07:59 UTC
Created attachment 241027 [details]
Another shutdown-time crash

I got another shutdown-time crash.  The part of this file that is relevant to this crash starts around line 1400; all the earlier stuff appears to be from the crashes earlier today.
Comment 145 Mark Millard 2023-03-21 01:07:55 UTC
(In reply to George Mitchell from comment #144)

Looking at your full list of attachments, it appears that . . .

All the shutdown time crashes have:

fault virtual address	= 0x0

(And we might now have a known type of context
for getting the type of failure: late amdgpu
but no XFCE.)

All the dbuf_evict_thread related crashes have:

fault virtual address	= 0x7

(Late admgpu but having used XFCE.)

All the kldload related crashes have:

Fatal trap 9: general protection fault while in kernel mode
(but no explicit fault address listed)

(Early amdgpu loading.)


My guess is something is trashing memory in a way
that involves writing zeros over some pointer values
that it should not be touching. Later code extracts
such zeros and applies any offset and then tries to
dereference the result, resulting in a crash.

That you got "fault virtual address = 0x0" for shutdown
without having involved XFCE, suggests that a problem is
already in place before XFCE is potentially involved:
XFCE is not required. (XFCE use might lead to more
trashed memory than otherwise, leading to the 0x7
fault address cases.)

But I do not see how to get solid evidence for or
against such the hypothesis (or related ones).

The only thing I can identify that is likely unique to
your context --but is involved with amdgpu-- is the
involvement of the amdgpu_raven_gpu_*.ko modules.

Unfortunately moving your context to a different system
that avoids such module use or finding someone with a
separate system that does have such (and is willing to
set up experiments), is non-trivial for both directions
of testing.

Beyond possibly some checking on the degree/ease of
repeatability, I do not see how to gather better
information, much less get anywhere near directly
actionable information for fixing the crashes.

The one thing we have not looked at is the crash
dumps themselves, examining what memory looks like
and such. But I do not know what to do for that
either, relative to known-useful information. Such a
direction would be very exploratory and likely very
time consuming.
Comment 146 Mark Millard 2023-03-21 19:32:10 UTC
(In reply to Mark Millard from comment #145)

For the:

fault virtual address	= 0x7

examples, it looks like the value stored in RAM has the 0x7
in it instead of being a later offset addition. The loop
in question in avl_destroy_nodes just uses "mov (%rdi),%rdi"
with no offset involved:

NOTE: Loop starts below
   0x0000000000000ba0 <+64>:	mov    %rdi,%rax
   0x0000000000000ba3 <+67>:	mov    %rdx,%rcx
   0x0000000000000ba6 <+70>:	mov    (%rdi),%rdi
   0x0000000000000ba9 <+73>:	mov    %rax,%rdx
   0x0000000000000bac <+76>:	test   %rdi,%rdi
   0x0000000000000baf <+79>:	jne    0xba0 <avl_destroy_nodes+64>
NOTE: The above is the loop end
Comment 147 George Mitchell 2023-03-22 00:23:01 UTC
Created attachment 241046 [details]
Crash at shutdown time

Another occurrence of the crash at shutdown time rather than boot time.  I'm reluctant to post a vmcore file here, but I can make it available to anyone who thinks it will be useful.
Comment 148 Mark Millard 2023-03-22 01:09:47 UTC
(In reply to George Mitchell from comment #147)

That crash is difference from all prior ones. It crashed
in nfsd via a:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 01
instruction pointer	= 0x20:0xffffffff80c895cb
stack pointer	        = 0x28:0xfffffe00b555dba0
frame pointer	        = 0x28:0xfffffe00b555dbb0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1109 (nfsd)

None of the prior kldstat outputs have shown nfsd as
loaded.

For reference:

panic: general protection fault
cpuid = 1
time = 1679441112
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff81084fd8 at calltrap+0x8
#5 0xffffffff80c8866b at seltdclear+0x2b
#6 0xffffffff80c88355 at kern_select+0xbd5
#7 0xffffffff80c88456 at sys_select+0x56
#8 0xffffffff810ae6ec at amd64_syscall+0x10c
#9 0xffffffff810858eb at fast_syscall_common+0xf8

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c1b7ec in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80c1bc5e in vpanic (fmt=0xffffffff811b2f41 "%s", 
    ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff80c1ba63 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:844
#5  0xffffffff810addf5 in trap_fatal (frame=0xfffffe00b555dae0, eva=0)
    at /usr/src/sys/amd64/amd64/trap.c:944
#6  <signal handler called>
#7  0xffffffff80c895cb in atomic_fcmpset_long (src=18446741877726026240, 
    dst=<optimized out>, expect=<optimized out>)
    at /usr/src/sys/amd64/include/atomic.h:225
#8  selfdfree (stp=stp@entry=0xfffff80012aa8080, sfp=0xfffff80000000007)
    at /usr/src/sys/kern/sys_generic.c:1755
#9  0xffffffff80c8866b in seltdclear (td=td@entry=0xfffffe00b52e9a00)
    at /usr/src/sys/kern/sys_generic.c:1967
#10 0xffffffff80c88355 in kern_select (td=<optimized out>, 
    td@entry=0xfffffe00b52e9a00, nd=7, fd_in=<optimized out>, 
    fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized out>, 
    tvp@entry=0x0, abi_nfdbits=64) at /usr/src/sys/kern/sys_generic.c:1210
#11 0xffffffff80c88456 in sys_select (td=0xfffffe00b52e9a00, 
    uap=0xfffffe00b52e9de8) at /usr/src/sys/kern/sys_generic.c:1014
#12 0xffffffff810ae6ec in syscallenter (td=0xfffffe00b52e9a00)
    at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#13 amd64_syscall (td=0xfffffe00b52e9a00, traced=0)
    at /usr/src/sys/amd64/amd64/trap.c:1185
#14 <signal handler called>
#15 0x00000008011a373a in ?? ()

Note: 18446741877726026240 == 0xfffffe00b52e9a00
Comment 149 George Mitchell 2023-03-22 01:15:31 UTC
(In reply to Mark Millard from comment #148)
> None of the prior kldstat outputs have shown nfsd as loaded.
That's because they weren't verbose kldstats.  nfsd is statically linked into the kernel.  kldstat -v definitely shows that nfsd is present.
Comment 150 George Mitchell 2023-03-23 00:30:00 UTC
In order to reconfirm my sincere belief that the key factor in these crashes is amdgpu (and also because I need a respite from the crashes), I'm running without amdgpu (and running X in VESA mode) for a while.  I fully expect that the crashes will stop as a result.
Comment 151 Mark Millard 2023-03-23 01:21:54 UTC
(In reply to George Mitchell from comment #150)

Sounds appropriate.

"amdgpu" is really the sort of bundle:

23    1 0xffffffff82a00000   417220 amdgpu.ko
24    2 0xffffffff82566000    739e0 drm.ko
25    3 0xffffffff825da000     5220 linuxkpi_gplv2.ko
26    4 0xffffffff825e0000     62d8 dmabuf.ko
27    1 0xffffffff825e7000     c758 ttm.ko
28    1 0xffffffff825f4000     2218 amdgpu_raven_gpu_info_bin.ko
29    1 0xffffffff825f7000     64d8 amdgpu_raven_sdma_bin.ko
30    1 0xffffffff82e18000    2e2d8 amdgpu_raven_asd_bin.ko
31    1 0xffffffff829e0000     93d8 amdgpu_raven_ta_bin.ko
32    1 0xffffffff829ea000     7558 amdgpu_raven_pfp_bin.ko
33    1 0xffffffff829f2000     6558 amdgpu_raven_me_bin.ko
34    1 0xffffffff829f9000     4558 amdgpu_raven_ce_bin.ko
35    1 0xffffffff82e47000     b9c0 amdgpu_raven_rlc_bin.ko
36    1 0xffffffff82e53000    437e8 amdgpu_raven_mec_bin.ko
37    1 0xffffffff82e97000    437e8 amdgpu_raven_mec2_bin.ko
38    1 0xffffffff82edb000    5a638 amdgpu_raven_vcn_bin.ko

I'm still at a loss for getting any improved type of
evidence. Spending time related to the dnetc related
scheduler benchmarking today has been a nice break
from pondering this.
Comment 152 George Mitchell 2023-03-28 14:43:26 UTC
As expected, I have had no crashes since avoiding drm-510-kmod and running in VESA mode.  Might it be worth updating 5.10.163_2 to 5.10.163_3?

Notes I haven't mentioned recently: Prior to FBSD 13, whenever I tried drm-510-kmod, my machine would lock up hard and not respond to anything other than cycling power.  I have a AMD Ryzen 3 2200G with Radeon Vega Graphics running on a Gigabyte B450M D53H motherboard.  Every time I boot up, I see the following ACPI warnings, which don't otherwise seem to affect operation:

Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)
ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)
ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20201113/nsarguments-361)

Do any of you understand these?
Comment 153 Mark Millard 2023-03-28 16:50:42 UTC
(In reply to George Mitchell from comment #152)

I'm not sure what all is involved in setting up the VESA
usage test, but it sounds like it was a great test for
isolating the problem to the material associated with
amdgpu loading for your Radeon Vega Graphics context.

Are there any negative consequences to the use of VESA?

If the notes are simple/short could you supply instructions
so that I could try the analogous thing in the Polaris 11
context that I have access to?
Comment 154 Mark Millard 2023-03-28 17:02:36 UTC
(In reply to George Mitchell from comment #152)

Looked at my ACPI boot warning/error messages and I get just (with a little
context shown from the grep for ACPI lines):

acpi_wmi0: <ACPI-WMI mapping> on acpi0
ACPI: \134AOD.WQBA: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361)
acpi_wmi1: <ACPI-WMI mapping> on acpi0
ACPI: \134GSA1.WQCC: 1 arguments were passed to a non-method ACPI object (Buffer) (20221020/nsarguments-361)
acpi_wmi2: <ACPI-WMI mapping> on acpi0

But I do not get anything analogous to your reported:

Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-796)

So that last has some chance of being involved in your context
since I've been unable to reproduce your problems and the message
is unioque to your context. (Only suggestive.)

Any chance that there is an UEFI update available for your machine?
Comment 155 Mark Millard 2023-03-28 22:15:02 UTC
(In reply to Mark Millard from comment #153)

Hmm. I see that:

https://docs.freebsd.org/en/books/handbook/x11/#x-install

reports:

"VESA module must be used when booting in BIOS mode and SCFB module must
be used when booting in UEFI mode."

My context is UEFI so VESA looks to be inappropriate for my context.

Your using BIOS (non-UEFI) vs. my using UEFI (not-BIOS) is another
context difference relative to my not managing to reproduce the
problems.
Comment 156 George Mitchell 2023-03-29 18:34:07 UTC
Ironically, I am presently forced back into using amdgpu.ko because the xorg-server update from 21.1.6,1 to 21.1.7,1 broke the VESA driver (bug #270509).
Comment 157 George Mitchell 2023-03-29 20:34:43 UTC
I forgot to mention earlier: Whenever I start chrome from a terminal window, I see the message:

amdgpu: os_same_file_description couldn't determine if two DRM fds reference the same file description.

Probably not related to this bug, but I thought I'd better mention it.
Comment 158 Graham Perrin freebsd_committer freebsd_triage 2023-04-01 19:42:15 UTC
(In reply to George Mitchell from comment #32)

> … I'm doing this testing on a desktop machine, …

(In reply to George Mitchell from comment #152)

> … not respond to anything other than cycling power. …

In that situation, does the system respond to a normal (not long) press on the power button? 

----

On my everyday notebook here, I have this in sysctl.conf(5): 

hw.acpi.power_button_state="S5"
Comment 159 George Mitchell 2023-04-01 19:58:58 UTC
(In reply to Graham Perrin from comment #158)
When I referred to cycling power, I meant by a long press of the power button, which worked just fine (except that I was going to have to run fsck on the next boot).  Also, that was when I was running FBSD 12 and I'm not in a position to repeat that test any more.  Thanks for the input.
Comment 160 Tomasz "CeDeROM" CEDRO 2023-04-13 12:49:11 UTC
I also use vbox + zfs + zmdgpu. On 13.2-STABLE I had kernel panic on vboxdrv / vboxnetadp load. So I switched to 13.1-RELEASE. Now after upgrading to 13.2 I have this problem again. Maybe related?

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270809
Comment 161 Mark Millard 2023-04-13 13:39:22 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #160)

The package builds via 13.2-RELEASE have not even started yet.

Systems using/needing kernel-specific ports should wait to upgrade
to 13.2-RELEASE until the packages are known to be available if
they are updating via binary packages.

This is normal when a new release happens. FreeBSD does not hold
the release until after the packages are available. 13.1-RELEASE
is still supported for some time but cannot use 13.2-RELEASE based
packages generally.
Comment 162 Tomasz "CeDeROM" CEDRO 2023-04-13 14:03:22 UTC
Thanks Mark :-) The problem is that build from ports kernel crashes on module load :-(
Comment 163 Mark Millard 2023-04-13 14:17:27 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #162)

Crashing from having a wrong module vintage for the
kernel is normal/historical as I understand. So,
unfortunately, not anything new.

The package build servers will not start building
based on 13.2-RELEASE until 13.1-RELEASE goes EOL
as I understand. Prior to that building from source
is what is supported when such kernel-dependent
ports are involved. FreeBSD still has some
build-from-source biases in its handling of things.
Resource limitations may well still be forcing such,
for all I know.

So, either wait to use 13.2-RELEASE or build and
install (some) ports via source based builds if
you require ports with kernel-dependent modules.
Comment 164 Mark Millard 2023-04-13 14:32:34 UTC
(In reply to Tomasz "CeDeROM" CEDRO from comment #162)

Sorry that I misinterpreted some of the context/wording.

And nice to see that the 13.1-RELEASE build is rejected
with a message, now that I look again.
Comment 165 George Mitchell 2023-04-16 02:02:10 UTC
Created attachment 241523 [details]
Crash that happened neither at startup nor shutdown

Perhaps not related to my original crash, but undoubtedly a crash that happened in amdgpu code.  I was watching a movie using vlc.  I decided I was finished watching and I typed control-q.  The screen froze with a frame from the movie still showing, and after a few seconds the machine rebooted and saved a coredump, with the attached crash summary that really doesn't resemble any of the earlier ones saved here.  Does anyone have any words of wisdom?

To avoid the startup crash, I had booted to single user mode and had kldloaded vboxnetflt and amdgpu before continuing to multiuser mode.
Comment 166 Tomasz "CeDeROM" CEDRO 2023-04-16 02:26:38 UTC
I got tired of all those VirtualBox problems. I do not really care anymore about that program, if its problems are related to amdgpu or zfs. I have switched to bhyve that can be easily managed from a shell with vm utility [1]. I recommend doing the same.

[1] https://github.com/churchers/vm-bhyve
Comment 167 George Mitchell 2023-04-16 02:48:46 UTC
This is an amdgpu problem.  Although vboxnetflt is one of the kernel modules that can, in cooperation with amdgpu, exhibit the crash, zfs and acpi_vmi have also exhibited the same failure -- and the most recent crash summary contains no reference to vboxnetflt participating in the crash.  (It does show that I manually typed "kldload vboxnetflt" in single-user mode about an hour and a half before the crash occurred.)
Comment 168 George Mitchell 2023-04-17 22:21:52 UTC
After upgrading to 5.10.163_5 today, I haven't yet had this crash -- but I've booted only a couple of times so far and it's too soon to jump to any conclusions.
Comment 169 George Mitchell 2023-04-25 18:12:20 UTC
Created attachment 241741 [details]
Shutdown crash with version 5.10.163_5

5.10.163_5 still crashes.  This time it was a shutdown time.
Comment 170 George Mitchell 2023-04-25 22:46:16 UTC
Created attachment 241750 [details]
And another plain old boot time crash

I had thought I could artificially provoke the crash by booting to single user mode, loading the amdgpu, zfs, vboxnetflt, and acpi_wmi kernel modules in quick succession, and then continuing to multiuser mode.  But that didn't do it.  So yesterday I went back to the old way of loading zfs with "zfs_enable="YES"" in rc.conf instead of "zfs_load="YES"" in /boot/loader.conf, and loading amdgpu by setting kld_list="amdgpu" in rc.conf.  And now I get the crashes again.
Comment 171 Mark Millard 2023-04-25 23:02:18 UTC
(In reply to George Mitchell from comment #170)

I'm unclear on the contrasting case: when you use
/boot/loader.conf material instead of /etc/rc.conf
material what happens these days? No crashes?
Fairly rare crashes of the usual types? Fairly
rare crashes of other types? A mix of fairly rare
crashes of the 2 categories? (I may well not be
thinking of everything that would be of note.
So take the questions as just illustrative.)
Comment 172 Mark Millard 2023-04-26 23:24:29 UTC
One of the things that makes this hard to analyze is
that the first failure quickly leads to other failures
and most of the evidence for for the later failure.
For example, in the following note that original
trap number is 12 but the backtrace is for/after
a later trap, of type-number 22 instead. There
is very little information directly about the
original trap type-number 12:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80bf3727
stack pointer	        = 0x28:0xfffffe000e1a7ba0
frame pointer	        = 0x28:0xfffffe000e1a7bd0
code segment		= base rx0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 1 (init)
trap number		= 12
WARNING !drm_modeset_is_locked(&crtc->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:619
. . .
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /usr/ports/graphics/drm-510-kmod/work/drm-kmod-drm_v5.10.163_4/drivers/gpu/drm/drm_atomic_helper.c:894
kernel trap 22 with interrupts disabled
                            kernel trap 22 with interrupts disabled
 panic: page fault
cpuid = 0
time = 1682435560
KDB: stack backtrace:
#0 0xffffffff80c66ee5 at kdb_backtrace+0x65
#1 0xffffffff80c1bbef at vpanic+0x17f
#2 0xffffffff80c1ba63 at panic+0x43
#3 0xffffffff810addf5 at trap_fatal+0x385
#4 0xffffffff810ade4f at trap_pfault+0x4f
#5 0xffffffff81084fd8 at calltrap+0x8
#6 0xffffffff8261d251 at spl_nvlist_free+0x61
#7 0xffffffff826dd740 at fm_nvlist_destroy+0x20
#8 0xffffffff827b6e95 at zfs_zevent_post_cb+0x15
#9 0xffffffff826dcd02 at zfs_zevent_drain+0x62
#10 0xffffffff826dcbf8 at zfs_zevent_drain_all+0x58
#11 0xffffffff826dede9 at fm_fini+0x19
#12 0xffffffff82713b94 at spa_fini+0x54
#13 0xffffffff827be303 at zfs_kmod_fini+0x33
#14 0xffffffff8262fb3b at zfs_shutdown+0x2b
#15 0xffffffff80c1b76c at kern_reboot+0x3dc
#16 0xffffffff80c1b381 at sys_reboot+0x411
#17 0xffffffff810ae6ec at amd64_syscall+0x10c
. . .

The primary hint about what code execution context lead
to the original instance of trap type 12 above is
basically:

instruction pointer	= 0x20:0xffffffff80bf3727

amdgpu does not leave in place a clean context for
debugging kernel crashes. Trying to keep the video
context operational for a kernel that has crashed,
while not messing up the analysis context for the
original problem is problematical.

My guess would be that normal analysis of such tries
to have the problem occur in a virtual machine sort
of context where another (outer) context is available
that is independent and can look at the details from
outside the failing context. But even that would
require the failing context in the VM to stop before
amdgpu or the like messed up the evidence in the VM.
(Not that I've ever done that type of evidence
gathering.)
Comment 173 George Mitchell 2023-04-27 00:17:34 UTC
Here are a collection of points in response to Mark Millard's request.

1. Regardless of the order in which I load kernel modules by hand in single-user mode, I can't ever duplicate the crash.

2. The crash never happens if amdgpu.ko is not loaded.

3. Emmanuel Vadot categorically states that the many, many references to drm_modeset_is_locked failures in the crash summaries are noise and don't indicate drm failures and are caused by virtual terminal switching.  But I still get crashes even when there are no virtual terminal switches (because I didn't start X windows and I didn't type ALT-Fn).

4. The crash always happens after amdgpu.ko is loaded, and (in terms of time of occurrence) at about the time vboxnetflt.ko or acpi_wmi.ko is loaded.  The seeming zfs crash can happen even when zfs.ko is loaded before amdgpu.ko, and I theorize that it happens when my large (1TB) USB ZFS-formatted drive comes on line and gets tasted (after amdgpu.ko is loaded).

5. But I can't come up with any theory in which I can blame the actual crash on vboxnetflt.ko, acpi_wmi.ko, or zfs.ko.  This bug should not be assigned to freebsd-fs. But I can't tell you to whom it should be assigned.
Comment 174 George Mitchell 2023-05-17 22:13:48 UTC
Since my last note on April 27, I have been booting up in this manner:

1. Boot to single user mode.
2. Run a script that loads amdgpu.ko, zfs.ko, vboxnetflt.ko, and acpi-wmi.ko in immediate succession.
3. Exit to multiuser mode.

In the course of roughly 50-60 bootups, there have been only two crashes during single user mode, but regrettably they leave no trace because the root partition is still mounted read-only.  At least I think that's why there's no dump.  So something about single-user mode makes the crash much less likely to occur.  Anyway, jumping through these hoops does enable me to run my graphics with the improved driver.
Comment 175 Graham Perrin freebsd_committer freebsd_triage 2023-05-17 23:26:52 UTC
(In reply to George Mitchell from comment #174)

> … crashes during single user mode, but regrettably they leave no trace 
> … the root partition is still mounted read-only. …

Hint (whilst in single-user mode): 

mount -uw / && zfs mount -a


sysrc dumpdev

– you'll probably find a different device, typically the swap partition. 


sysrc dumpdir

– you'll probably find /var/crash.


service dumpon describe

– if you boot in single user mode after a kernel panic, then /var/crash will not yet include information about the panic.


service savecore describe