Hello all, It seems that there is a weird bug that is affecting my motherboard. Each time that I turn on my computer (cold boot, and not a reboot), the freebsd kernel starts up, I see the first few messages, and then the system will reboot. Upon reboot, the second time it will succeed and there will be no issues. If I do a 'reboot', it will work. However, if I do a 'poweroff' and then manually start the machine up, it will again reboot once upon booting and then continue working. I tried isolating the issue by commenting out any modules that load (amdgpu, vboxdrv, and if_re. The built in if_re doesn't support my mobo's ethernet chipset (RTL8125 2.5GbE Controller) so I have to use the net/realtek-re-kmod port's version of it). Attached is some information about my system: - FreeBSD 13.1-STABLE #4 stable/13-n253282-50f61166f7b9 - Motherboard: ASUS TUF GAMING X670E-PLUS WIFI -> https://www.asus.com/us/motherboards-components/motherboards/tuf-gaming/tuf-gaming-x670e-plus-wifi/ I'll be linking a video demonstrating the issue as well.
Created attachment 238814 [details] dmesg The video reboots a little after the following message (but as described, this shows up even when a subsequent reboot works, but there maybe something to that message on a cold boot): Firmware Error (ACPI): Could not resolve symbol [\134_SB.PCI0.GPP7.UP00.DP40.UP00.DP68], AE_NOT_FOUND (20201113/dswload2-315) ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20201113/psobject-372)
Created attachment 238815 [details] loader.conf
Created attachment 238816 [details] pciconf -vl
Created attachment 238817 [details] rc.conf
Video can be found here: https://xyinn.org/freebsd/bugs/268393/1.mp4
I compiled a debugging kernel with the following options (in order to attempt and make the panic not reboot, and also display more info). This successfully yielded more information :D. There seems to be a page fault happening in the AMD Raven HDA Controller. Picture and Video -------------- https://xyinn.org/freebsd/bugs/268393/268393-2.jpg https://xyinn.org/freebsd/bugs/268393/268393-2.mp4 Kernel Options -------------- include GENERIC ident GENERIC-DEBUG options KDB options DDB options INVARIANTS options INVARIANT_SUPPORT options WITNESS options DIAGNOSTIC Info ---------- root@leslie:~ # uname -a FreeBSD leslie 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n253296-384a885111ad: Tue Dec 20 12:46:46 EST 2022 root@leslie:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-DEBUG amd64
I forgot to also mention that the keyboard isn't functional at this stage. I tried disconnecting my keyboard usb receiver from my KVM to the front IO ports (USB 3.0) to see if that worked but nothing. I could try connecting it to the back ports to see if that helps. I'll need to check again if this motherboard has any USB 2.0 rear ports left since sometimes (on Linux, maybe FreeBSD) USB 3.0 ports have issues - especially if IOMMU is disabled on the BIOS and the system tries to remap the memory space to allow those ports to work.
I experimented a little more today to further narrow it down. The following allows the system to boot from a powered off (cold) state without crashing and rebooting once: /boot/loader.conf: hint.hdac.2.disabled=1 where hdac2 is the audio connector directly on the back of my mobo. Once the driver is attached, the Realtek ALC1220 is expanded and used. jon@leslie:~ $ dmesg | grep hdac hdac0: <ATI (0xab28) HDA Controller> mem 0xfcb20000-0xfcb23fff at device 0.1 on pci3 hdac1: <ATI (0x1640) HDA Controller> mem 0xfc988000-0xfc98bfff at device 0.1 on pci19 hdac2: <AMD Raven HDA Controller> mem 0xfc980000-0xfc987fff at device 0.6 on pci19 hdacc0: <ATI R6xx HDA CODEC> at cad 0 on hdac0 hdaa0: <ATI R6xx Audio Function Group> at nid 1 on hdacc0 hdacc1: <ATI R6xx HDA CODEC> at cad 0 on hdac1 hdaa1: <ATI R6xx Audio Function Group> at nid 1 on hdacc1 hdacc2: <Realtek ALC1220 HDA CODEC> at cad 0 on hdac2 hdaa2: <Realtek ALC1220 Audio Function Group> at nid 1 on hdacc2 The only problem with this is that I no longer have audio on this machine. I do see audio activity being redirected to my monitor's HDMI connection but my monitor doesn't have any speakers, but just writing this for completeness. If there is anything I can provide to someone to further debug this page fault, please lmk.
Huh, what a strange problem. Nice work narrowing it down. The ALC1220 is a very common chipset, I wonder if you've stumbled upon a weird edge case. Can you compile and boot a MINIMAL config 13 STABLE kernel and then load snd_hda by hand with kldload? (make sure you disable the sysctl that disables hdac2) Does the system still panic?
Hey John, Thanks for that. I got some interesting results! It's been a few months since my last post and since then I've reinstalled FreeBSD, it's currently on 13.2-STABLE (stable/13-n254729-3912f99ecae6/GENERIC). There is nothing in /etc/sysctl.conf at the moment. So let's begin from what perspective. I first compiled the /usr/src/sys/amd64/conf/MINIMAL kernel and rebooted. The first time I did this the system locked up since it couldn't find my root filesystem, which is on ZFS on an NVMe drive. After some digging, I added a few options (not the minimum options needed but I casted a wide enough net within reason to allow the system to boot). After I got it booting successfully, I wasn't able to type anything. Makes sense.. MINIMAL has no USB support lol. I added those in as well, so I ended up with a MINIMAL config with the following extra info: device crypto device acpi device nvme device nvd options ZSTDIO device uhci device ohci device ehci device xhci device usb device hid --------- Now that I was in the system successfully, we can notice that the system didn't crash. I did a 'poweroff' as well to get the system back to the cold state which causes it to crash on boot (first time, once it's hot it won't crash). I did the 'kldload snd_hda' and the system immediately crashed, and I noticed that I saw some messages regarding 'drm-510-kmod'. I thought, ah! yea I forgot I needed to uncomment the kld_list in my /etc/rc.conf since I have 'amdgpu vboxdrv' in there. So I was thinking, the AMD Radeon XT 6900 (sienna_cichlid) and the snd_hda may be having a conflict. I commented out the kld_list line and did a 'poweroff' again. I turned the machine back on immediately and booted up. I did another 'kldload snd_hda' and the system didn't crash! I was like yea .. maybe there is a conflict between those two drivers. But I was skeptical. I decided to do another 'poweroff' and wait 5 seconds before I continue, to give any internal system components time to properly reset themselves, just in case. After the 5 seconds, I turned it back on and booted. I did another 'kldload snd_hda', and the system crashed again! This time with no 'drm-510-kmod' messages, just a clean dump. So that makes me think that there potentially could be two issues here, or it could just be one underlying issue (the page fault) that's causing it to appear in two places. I also did a final test with loading the 'vboxdrv' and re-testing, that driver didn't conflict, I just crashed with the same scenario as just mentioned (without amdgpu loaded.. our clean dump). I've attached the following crash dumps for inspection: - 1_with_amdgpu_core.txt - 2_clean_core.txt - 3_clean_core.txt (this is the third run that has vboxdrv loaded but same info as without vboxdrv.. so vboxdrv doesnt seem to cause an issue). Thank you!
Created attachment 240519 [details] 1_with_amdgpu_core.txt
Created attachment 240520 [details] 2_clean_core.txt
Created attachment 240521 [details] 3_clean_core.txt
I forgot to mention, I also tested this on 13.1-RELEASE and it also happens, so it's not something that was introduced after 13.1-RELEASE and is an existing bug.
Created attachment 240647 [details] debugging printfs for hda driver
(In reply to Jonathan Vasquez from comment #14) Hi Jonathan, All of the crash dump reports you posted appear to have panicked in the hdac_rirb_flush function. Specifically dereferencing the `rirb` pointer on line 968 from `sys/dev/sound/pci/hda/hdac.c` I'm thinking there *may* be a bug in calculating the `rirb` read pointer in the code just above the dereference. Would patch your kernel hda driver (should work with stable 13) with the attached patch and output the results? It just adds a few debug prints to the hda driver. You'll also need to update a sysctl variable to ensure the debugs actually print. So the procedure is: 1) sysctl debug.bootverbose=1 2) patch hda driver and recompile 3) unload snd_hda and reload Here's an example of what it looks like on a bhyve VM: root@fbsd-current:~ # sysctl debug.bootverbose=1 root@fbsd-current:~ # kldload snd_hda pci0: driver added found-> vendor=0x8086, dev=0x27d8, revid=0x00 domain=0, bus=0, slot=6, func=0 class=04-03-00, hdrtype=0x00, mfdev=0 cmdreg=0x0406, statreg=0x0000, cachelnsz=0 (dwords) lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns) intpin=a, irq=18 pci0:0:6:0: reprobing on driver added hdac0: <Intel 82801G HDA Controller> mem 0xc0004000-0xc0007fff irq 18 at device 6.0 on pci0 hdac0: PCI card vendor: 0x0000, device: 0x0000 hdac0: HDA Driver Revision: 20120126_0002 hdac0: Config options: on=0x00000000 off=0x00000000 ioapic0: routing intpin 18 (PCI IRQ 18) to lapic 2 vector 51 hdac0: Caps: OSS 4, ISS 4, BSS 0, NSDO 1, 64bit, CORB 256, RIRB 256 hdac0: rirb_base 0xfffffe01205ff000 hdac0: rirb_size 256 hdac0: sc->rirb_rp 1 hdac0: rirb address 0xfffffe01205ff008 hdac0: response 00008086 hdac0: rirb_base 0xfffffe01205ff000 hdac0: rirb_size 256 hdac0: sc->rirb_rp 2 hdac0: rirb address 0xfffffe01205ff010 hdac0: response 0000ffff hdacc0: <Generic (0x8086) HDA CODEC> at cad 0 on hdac0 hdac0: rirb_base 0xfffffe01205ff000 hdac0: rirb_size 256 hdac0: sc->rirb_rp 3 hdac0: rirb address 0xfffffe01205ff018 hdac0: response 00010001 hdac0: rirb_base 0xfffffe01205ff000 hdac0: rirb_size 256 hdac0: sc->rirb_rp 4 ...
Hey John, Thanks for that. I've attached the crash dump with the patch applied to my MINIMAL config for both a cold boot and my dmesg output for a hot boot.
Created attachment 240800 [details] 2023-03-12-1625 - cold boot (with print patch applied)
Created attachment 240801 [details] 2023-03-12-1625 - hot boot dmesg (with print patch applied)
This bug appears to be a duplicate of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264305
Hi Jonathan, Is the hot boot dmesg output truncated or was that all of it? From the output it appears hdac2 isn't properly initialized. The ring buffer read pointer (rirb_rp) starts at 189 instead of 1 like hdac_attach hasn't been run. There's no 'hdac2: <AMD Raven HDA Controller> mem 0xfc980000-0xfc987fff at device 0.6 on pci19' line. Does the hdac2 device work? Is it assigned an IRQ? sysctl hw.intrs | grep hdac
The hot boot logs are truncated from /var/log/messages directly at the point of BOOT. Everything else before that is from a previous boot instance. The 'dmesg' output with the print patch will immediately overload the dmesg ring buffer so I wouldn't be able to get the output of the beginning of the boot sessions through it after doing 'kldload snd_hda'. Regarding your other comment about the other bug, just it seems to be similar if not identical.
(In reply to Jonathan Vasquez from comment #22) The duplicate comment was just to alert FreeBSD devs that this issue may be affecting others with the same hardware. No problem about the dmesg buffer, you can increase the buffer size to 1M by adding the following line to /boot/loader.conf: kern.msgbufsize=1048576 Thanks for the help debugging!
Hey John, No, thank you for helping me look into this :). I've attached a new cold core dump and a hot dmesg output that has an increased kernel buffer size. Now we can see everything. The interesting part starts at line 230 when the kldload snd_hda happened (snd_unit_init() u=0x00ff8000 [512] d=0x00007c00 [32] c=0x000003ff [1024]). - Jonathan
Created attachment 240909 [details] 2023-03-16-1919-cold-core
Created attachment 240910 [details] 2023-03-16-1919-hot-dmesg
Created attachment 241040 [details] lock hdac during init I have a theory that an interrupt is being triggered before the dma memory is allocated correctly for hdac2. I've included a patch that locks the hdac driver initialization sequence which should block any interrupts (that call hdac_rirb_flush) before the initialization completes. This patch is just for testing. You should see some WITNESS debugger complaints about bouncing off the lock. If the system actually boots and doesn't crash, will you send the dmesg?
Hey John, That sounds good, will do. Do you want me to run this with the `snd_hda` automatically being loaded as normal, or from the MINIMAL kernel? I'm guessing I'll need to enable WITNESS and other debugging features for this to be displayed? Could you give me a list of your suggested debugging options so I can add them to my MINIMAL kernel config? Thank you! I also noticed something today (an anamoly but still interesting). Today when I booted my machine from a cold boot (for the first time this morning), the system didn't crash. I suspect this may be because yesterday when I powered off my machine, FreeBSD didn't actually shut down completely when I did the `poweroff` command. Usually when I run that command (which is my normal way of shutting down the machine for the most part), it would run its normal shutdown sequence, and it would say "all buffers synced" at the end, and then power off the machine completely. Yesterday it displayed everything up to "all buffers synced", but it got stuck and didn't shut down (almost as if it did a `halt` instead of a `poweroff`). After that, I hard powered off the machine by holding the power button, and that was the end of that until this morning. I wonder if there was some type of state remaining some way even though it was powered off completely. Doing another `poweroff` this morning correctly shutdown the machine, and turning it back on allowed the crash to re-occur. Weird, but noteworthy of a mention. - Jonathan
(In reply to Jonathan Vasquez from comment #28) Yes, compile using the GENERIC kernel config, it includes the witness debugger. Then boot from the kernel as normal and see if the system still crashes. There *may* be a race within the driver which only crashes given specific parameters. From the data you've already provided, it appears an interrupt handler is called before memory is initialized for it. Hopefully, locking the initialization functions will give us a clue to what's going on in your system. Intermittent bugs are the most annoying to debug because they can sometimes be difficult to replicate. We just have to keep poking at it to see if we can find a solution. :)
Hey John, I just got some time to try out the latest lock patch. I also updated to stable/13-n255040-535fc5f75e20 as well. Using the lock patch, the system didn't crash at boot up and we got some nice messages: hdac2: <AMD Raven HDA Controller> mem 0xfc980000-0xfc987fff at device 0.6 on pci19 hdac2: Unexpected unsolicited response from address 0: 00000000 ... pcib20: <ACPI PCI-PCI bridge>hdac2: Unexpected unsolicited response from address 0: 00000000 ... hdac2: Unexpected unsolicited response from address 0: 00000000 at device 8.3 on pci0 pci20: <ACPI PCI bus> on pcib20 xhci4: <XHCI (generic) USB 3.0 controller> mem 0xfcd00000-0xfcdfffff at device 0.0 on pci20 ... When in this state the system starts up without crashing, but the audio device is missing (which makes sense). Once I do another reboot (hot start), no more error messages show up and the audio devices are available. I've attached the hot/cold.txt labeled with "2023-04-09-2125". Thank you! Jonathan
Created attachment 241393 [details] cold-2023-04-09-2125.txt
Created attachment 241394 [details] hot-2023-04-09-2125.txt
Interesting! Okay, I have a couple of patches for you to try. The first patch reverts a commit to hdac.c that may have caused the problem to occur. The second patch introduces a delay before the driver attaches to the device Thanks!
Created attachment 241460 [details] revert commit that coincides with occurance of problem
Created attachment 241461 [details] delay driver attach by 10 ms
Thanks John! I'll try those patches independently when I get a chance and report back.
Hey John, I just tested both patches (separately) on stable/13-n255158 and unfortunately no effect. The first one (revert) didn't work, and the second one I thought there may have been something to it since I didn't see the second freebsd logo (1st one is initial power on, second one shows up since the machine crashed), I was thinking maybe it was working, but there is a chance it wasn't since the monitor lost signal twice between booting up / delay / and re-presenting to the screen. So I compiled the kernel a few times with 1000, 5000, and again with 10000. The 1000 showed the second freebsd logo, 5000 also showed it, and then 10000 showed it when I tried it again. - Jonathan
Hi Jonathan, You have a common sound chipset that comes with a lot of recent AMD motherboards (within last couple of years). Here's a link to systems that have run a BSD hardware probe that have your same chipset: https://bsd-hardware.info/?view=search&vendorid=1022&deviceid=15e3&typeid=sound&page=1#list I'd expect to see other folks running into the same issue as they upgrade to their AMD systems. Have you booted other OS's on this system? Like Windows or Linux? Are there any problems with the sound on them? Is the BIOS upgraded with the latest revision? Thanks! John
Hey John, I've just uploaded my hw-probe just in case: https://bsd-hardware.info/?probe=d3373e972b I believe I ran Linux (and Windows 10) I believe maybe for like a few days, but I've been mostly on FreeBSD on this machine. I don't remember having any audio issues or reboot issues at start up, which is what caught me off guard in the beginning. I'm going to re-check my BIOS firmware now and update/re-test if anything. Will report back soon. - Jonathan
(will report back in a bit after the update, but there have been new updates since I bought/built this machine): Current BIOS Version: 0821 (2022/11/15 reported on computer. Website says 2022/11/25). Previous BIOSes (I already have these as part of 0821 but listed for the record) - 0613 (2022/09/26) No patch notes other than: Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer. - 0809 (2022/10/25) "1. Update AGESA version to ComboAM5PI 1.0.0.3 patch A 2. Improve system performance and stability 3. Improve GPU compatibility for GeForce RTX 40 series ------------ BIOS updates released since (I don't have these yet but will update. I will be applying patch 1409, the latest version that isn't beta). - 1222 (2023/02/24) "1. Update AGESA version to ComboAM5PI 1.0.0.5 patch C 2. Improve better performance for AMD new CPUs Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer." - 1223 (2023/03/20) "Improve memory compatibility Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer." - 1406 (2023/04/07) "1. Update AGESA version to ComboAM5PI 1.0.0.6 2. Please make sure to update to BIOS 1406 for better compatibility with the Ryzen™ 7000X3D series processor. 3. TPM 2.0 security update - 1408 (2023/04/13) "Improve system performance and stabilize AMD Ryzen 7000 X3D series processors. Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer." - 1409 (2023/04/21) 1. Update AGESA version to ComboAM5PI 1.0.0.6 2. TPM 2.0 security update 3. Recommended for optimum performance with AMD Ryzen™ 7000X3D series processors Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer. - 1410 (BETA - 2023/04/14) "Beta BIOS 1. Update AGESA version to ComboAM5PI 1.0.0.6 2. Supports high density DDR5 module 3. TPM 2.0 security update Please note that this is a beta BIOS version of the motherboard which is still undergoing final testing before its official release. The UEFI, its firmware and all content found on it are provided on an “as is” and “as available” basis. ASUS does not give any warranties, whether express or limited, as to the suitability, compatibility, or usability of the UEFI, its firmware or any of its content. Except as provided in the Product warranty and to the maximum extent permitted by law, ASUS is not responsible for direct, special, incidental or consequential damages resulting from using this beta BIOS. Before running the USB BIOS Flashback tool, please rename the BIOS file (TX670ELW.CAP) using BIOSRenamer."
Alright, I'm back and updated on the latest stable bios 1409. Unfortunately, this did not fix the issue. But I have some very interesting information to report. 1. I'm able to reproduce the error by booting from the FreeBSD 13.2 USB from a cold state. 2. Booting from a Fedora 37 USB from a cold boot worked properly. 3. This is the interesting part................ If I boot from Fedora 37 from a cold boot, and then I completely power off the machine, and then boot up FreeBSD from a cold boot, the machine will _NOT_ crash. I tried this entire cycle twice and both times it worked fine. Booting FreeBSD and then shutting it down completely, and then starting it up again will cause the problem to arise. This makes me think that there is a driver issue in either the FreeBSD start up sequence, or most specifically, the shutdown sequence, where some flags on the hardware are set to an invalid state, which causes FreeBSD to freak out during a subsequent cold boot. The Linux side must be "correcting" this issue by either clearing out some flags on the chip, or just overriding the flags already on the chip with the correct ones in a way that doesn't cause a crash. I've attached the Fedora 37 cold boot dmesg just in case it may provide anything (it is just a normal dmesg, not a verbose one).
Created attachment 241709 [details] Fedora 37 dmesg (2023-04-24)
I vaguely remember booting from the FreeBSD 13.0 (or 13.1) USB on this machine and remembering it not crashing, but I couldn't remember correctly if this was the case. The above situation would make me believe that my memory wasn't failing me and it was just the sequence of events I did things in when I had Linux on the machine (or Windows) and then was installing FreeBSD. If it was a Linux/Windows -> FreeBSD migration, it makes sense that from that state, the machine would be "good" so FreeBSD wouldn't freak out. But I could have also been booting from a hot boot if I did something like "boot machine, backup files, reboot to FreeBSD USB". Hard to know at this point lol. Another thing was that the UEFI BIOS Firmware file on my UEFI USB for my old BIOS firmware (0821) had a last modified timestamp of 2022/12/05, this is probably the date I downloaded the firmware and upgraded my BIOS from whatever version it came with from the store, to this one.
(In reply to Jonathan Vasquez from comment #43) Huh, likely a driver problem then. Have you tested booting a 13.0 or 12.x kernel on this system? I'm wondering if the problem was introduced in 13.1.
It existed throughout all of the 13.X releases I've used on this machine, I haven't tried 12.X on it. But I think I'm good on calling it quits on this one, it's getting close to the half a year mark and I don't want to keep putting more energy into this. Thanks for helping debugging this John, I really appreciated it. Hopefully this ticket can be resolved eventually for some future people. I'll see what I'll do about continued FreeBSD usage on my desktops/laptops.
Thanks for the bug report. Hopefully someone finds it helpful in the future. Sorry we didn't get it fixed!
I did a quick search online to see if anyone else was having this issue, it seems I found someone else with a different machine but seems to be the same driver: https://www.reddit.com/r/freebsd/comments/vwkx2w/kernel_paniccrash_every_boot_after_a_shutdown/ https://forum.opnsense.org/index.php?topic=29276.msg141378#msg141378 We can see from their screenshot (https://imgur.com/a/0UGkRta) that they also happen to be using an "AMD Raven HDA Controller". Their workaround is to disable the HD Audio Controller in the BIOS (which we already tested as a workaround that stops the crashing, but then I have no audio, given that this is for desktop use and not server, I can't really go with this). However, I'm thinking it may just be easier to disable it and buy a better supported Audio card, avoiding the integrated one completely. I'll need to research for better supported audio cards for FreeBSD and see what happens.
Hey all, So I spent a few hours today debugging this issue on 13.2-RELEASE and I have interesting stuff to report. TLDR: 1. There definitely seems to be a race condition somewhere with how either the AMD Raven HDA Controller is being enumerated, or how it's being accessed. 2. I was able to build on John's idea regarding the delays and come up with something that seems to no longer crash my system. Although I don't think it might be an acceptable solution since it would introduce a delay to all "hdac_intr_handler()" calls for any device that uses that function. But I'll keep testing it locally to see if I notice any new types of weirdness (outside of any known ones that I've experienced before this patch), and also because I don't want to have my system continuing to crash. A side note is that I ordered 2 PCIe sound cards that I want to see if they are FreeBSD compatible, which would help mitigate this issue if anything. Best case scenario, we fix this issue, and I also end up having a better sounding sound card that's not the on-board sound :). 3. We can experience different types of severity levels depending on the length of the delay. ----- So this is how the patch looks like in order to allow my system to no longer crash on first boot: diff --git a/sys/dev/sound/pci/hda/hdac.c b/sys/dev/sound/pci/hda/hdac.c index 9aa0e4bffdc8..e9d581a422cb 100644 --- a/sys/dev/sound/pci/hda/hdac.c +++ b/sys/dev/sound/pci/hda/hdac.c @@ -378,6 +378,11 @@ hdac_one_intr(struct hdac_softc *sc, uint32_t intsts) static void hdac_intr_handler(void *context) { + /* + * Add slight delay to avoid crashes with AMD Raven HDA Controllers + */ + DELAY(5000); + struct hdac_softc *sc; uint32_t intsts; ----- - If there is no DELAY (the default), the system will crash. - If there is a DELAY of 1000, the system won't crash, but we will see access errors! Which is revealing. Example: hdac2: <AMD Raven HDA Controller> mem 0xfc980000-0xfc987fff at device 0.6 on pci19 hdac2: Unexpected unsolicited response from address 0: 00000000 hdac2: Unexpected unsolicited response from address 0: 00000000 hdac2: Unexpected unsolicited response from address 0: 00000000 hdac2: Unexpected unsolicited response from address 0: 00000000 - If there is a DELAY of 5000, the system won't crash, and we no longer see any errors. In the situations where I don't use delays (and leading up to this reduced solution), I was able to have the machine stop crashing if I added at least 4 printf statements lol. If I used 3 printf, it would crash. I suppose 4 printf is relatively equal to a DELAY of 5000 for me. As stated before, with the above patch, the machine no longer crashes for me on a cold boot. I was also able to access and use my pcm8 device immediately and sound worked. This is progress. I've attached the following files: - bad.0.txt - Shows the access errors with a delay of 1000 with my previous expanded debug messages. - good.0.txt - Shows a good cold boot with a delay of 5000 with my previous expanded debug messages. - bad.1.txt - Shows the access errors with a delay of 1000 (minimal logging). root@weshly:/usr/src # uname -a FreeBSD weshly 13.2-RELEASE-p1 FreeBSD 13.2-RELEASE-p1 #23 releng/13.2-n254621-08b87f63a046-dirty: Thu Jul 6 21:22:10 EDT 2023 root@weshly:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 debugging on: commit 08b87f63a046bd966bd0ed548211ae98ff50e638 (HEAD -> releng/13.2, origin/releng/13.2) Author: Gordon Tetlow <gordon@FreeBSD.org> Date: Tue Jun 20 22:40:02 2023 -0700 Add UPDATING entries and bump version. Approved by: so
Created attachment 243289 [details] bad.0.txt
Created attachment 243290 [details] bad.1.txt
Created attachment 243291 [details] good.0.txt
Just to add some more info and subscribe myself to this issue: I am having the same problem on 14.0-CURRENT (main-n263810-eb550615eff7) using Tuxedo Pulse 15 Gen1 laptop, https://bsd-hardware.info/?probe=6f779d5170. If I reboot the system from FreeBSD boot menu on the first start, it boots successfully on the second attempt. No issues with Ubuntu 20/22 on the same machine.
For convenience (In reply to Jonathan Vasquez from comment #39) > … my hw-probe … Three hdac detected lines under <https://bsd-hardware.info/?probe=d3373e972b#pci:1002-1640-1043-8877>. Condensed: 1002:1640:1043:8877 AMD Rembrandt Radeon High Definition Audio Controller 1002:ab28:1002:ab28 AMD Navi 21/23 HDMI/DP Audio Controller 1022:15e3:1043:886d AMD Family 17h/19h HD Audio Controller (In reply to Alexander Sherikov from comment #52) Two hdac detected lines under <https://bsd-hardware.info/?probe=6f779d5170#pci:1002-1637-1d05-109f>. Condensed: 1002:1637:1d05:109f AMD Renoir Radeon High Definition Audio Controller 1022:15e3:1d05:109f AMD Family 17h/19h HD Audio Controller
So what happens if you just revert the locking part of the commit that created the problem? Eg aff just the 0xffffffff check?
*** Bug 272878 has been marked as a duplicate of this bug. ***
*** Bug 273151 has been marked as a duplicate of this bug. ***
(In reply to Warner Losh from comment #54) You mean like this? diff --git a/sys/dev/sound/pci/hda/hdac.c b/sys/dev/sound/pci/hda/hdac.c index 79ab71516cd9..78c99db8e813 100644 --- a/sys/dev/sound/pci/hda/hdac.c +++ b/sys/dev/sound/pci/hda/hdac.c @@ -393,13 +393,13 @@ hdac_intr_handler(void *context) * re-examine GIS then we can leave it set and never get an interrupt * again. */ - hdac_lock(sc); intsts = HDAC_READ_4(&sc->mem, HDAC_INTSTS); while (intsts != 0xffffffff && (intsts & HDAC_INTSTS_GIS) != 0) { + hdac_lock(sc); hdac_one_intr(sc, intsts); + hdac_unlock(sc); intsts = HDAC_READ_4(&sc->mem, HDAC_INTSTS); } - hdac_unlock(sc); }
(In reply to Warner Losh from comment #54) I just removed intsts != 0xffffffff && from while (intsts != 0xffffffff && (intsts & HDAC_INTSTS_GIS) != 0) { to get while ((intsts & HDAC_INTSTS_GIS) != 0) { and no more hangs on cold boot. Sound also works. Reverting https://reviews.freebsd.org/D34117 should also fix the issue.
(In reply to Oleh Hushchenkov from comment #58) We tested reverting the patch completely in comment #34 and Jonathan reported that the sound didn't work but the system booted. Removing 'intsts != 0xffffffff &&' will cause my laptop to freeze after waking from suspend! :)
(In reply to John Grafton from comment #59) > We tested reverting the patch completely in comment #34 and Jonathan reported that the sound didn't work but the system booted. I see. > Removing 'intsts != 0xffffffff &&' will cause my laptop to freeze after waking from suspend! :) I didn't notice because I don't use suspend. So for me it's a viable workaround.
To be clear, the DELAY(5000) that I mention in comment 48 (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268393#c48) allowed the system to boot fine and sound worked (without reverting any current code), but I don't know if adding a generic DELAY like that is good since every interrupt call will have that delay.. atm we only know that some AMD specific audio chips would require that workaround. However, if reverting existing code can help fix the issue and doesn't cause issues for others, that would be much better than the DELAY.
https://reviews.freebsd.org/D34117 is best candidate :) It is after new year and before my report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264305 and I do no see other relevant changes in this time frame. I do not like DELAY() in intr handler code, probably this can break something else. I will test revert D34117 in near future.
Why no one suggest patch like this? diff --git a/sys/dev/sound/pci/hda/hdac.c b/sys/dev/sound/pci/hda/hdac.c index 82b1baacfa9..f7be436aef6 100644 --- a/sys/dev/sound/pci/hda/hdac.c +++ b/sys/dev/sound/pci/hda/hdac.c @@ -956,6 +956,8 @@ hdac_rirb_flush(struct hdac_softc *sc) int ret; rirb_base = (struct hdac_rirb *)sc->rirb_dma.dma_vaddr; + if (rirb_base == NULL) + return (0); rirbwp = HDAC_READ_1(&sc->mem, HDAC_RIRBWP); bus_dmamap_sync(sc->rirb_dma.dma_tag, sc->rirb_dma.dma_map, BUS_DMASYNC_POSTREAD); @@ -965,6 +967,8 @@ hdac_rirb_flush(struct hdac_softc *sc) sc->rirb_rp++; sc->rirb_rp %= sc->rirb_size; rirb = &rirb_base[sc->rirb_rp]; + if (rirb == NULL) + break; resp = le32toh(rirb->response); resp_ex = le32toh(rirb->response_ex); cad = HDAC_RIRB_RESPONSE_EX_SDATA_IN(resp_ex);
(In reply to Ivan Rozhuk from comment #63) This one also avoids crash on first boot.
Created attachment 244703 [details] patch
(In reply to Oleh Hushchenkov from comment #64) Can you please test attached patch? (it use continue instead break on NULL pointer)
(In reply to Ivan Rozhuk from comment #66) > Can you please test attached patch? > (it use continue instead break on NULL pointer) Attached patch also works for me. No panic on first boot, sound works.
(In reply to Jonathan Vasquez from comment #18) Looking at this log, it seems that we're getting an interrupt before hdac2 has finished initializing itself. In particular, hdac_attach() allocates an MSI vector prior to allocating the RIRB DMA buffer. I'm not particularly sure why my commit would introduce a problem there. Could anyone affected by the problem please test moving the hdac_irq_alloc() call in hdac_attach() to just after the hdac_rirb_init() call? I can't easily provide a patch at the moment but it should be easy to do.
(In reply to Mark Johnston from comment #68) Just moved hdac_irq_alloc() call in hdac_attach() to be after the hdac_rirb_init() call and rebuilt the kernel. It fixed the issue for me. Now I'm wondering which patch should I use? 1. Removing "intsts != 0xffffffff" from "while (intsts != 0xffffffff && (intsts & HDAC_INTSTS_GIS) != 0)" in "hdac_intr_handler()". 2. Adding "rirb_base == NULL" and "rirb == NULL" checks in "hdac_rirb_flush()". 3. Moving "hdac_irq_alloc()" to be after "hdac_rirb_init()" in "hdac_attach()".
(In reply to Mark Johnston from comment #68) > I'm not particularly sure why my commit would introduce a problem there. I'm not familiar with the code base, but for me it looks like D34117 added "intsts != 0xffffffff" to "while()" condition and thus changed some timings. Important part is that removing "intsts != 0xffffffff" form "while()" condition fixes the issue, but suspend stops working again.
This and Mark's comment lead me to believe we are getting stray interrupts of various flavors. The 0xfffffff hack will keep things working while we're accessing the device in an isr with the card unmapped or asleep.I think this also suggests some sloppiness with establishing or tearing down interrupts. Marks case of getting an interrupt before things are complete is another example.
(In reply to Oleh Hushchenkov from comment #69) I suggest commit 2 and 3 together to make sure that we never see this panic again.
(In reply to Warner Losh from comment #71) So how it should be fixed it the right way? The issue affects many different laptops/computers, looks like AMD based only. Thas said we can't blame buggy hardware/firmware. Too many different devices from different vendors. At least my laptop running latest BIOS/UEFI. At the same time Linux works well on such hardware.
(In reply to Oleh Hushchenkov from comment #69) The right way is option number 3. I'll get a patch along those lines reviewed and committed later this week.
(In reply to Mark Johnston from comment #74) I got it. Thank you Mark.
https://reviews.freebsd.org/D41883 Any additional testing would be welcome.
(In reply to Mark Johnston from comment #76) Why not add additional checks from https://bugs.freebsd.org/bugzilla/attachment.cgi?id=244703&action=diff ?
(In reply to Ivan Rozhuk from comment #77) Because they do not fix the underlying problem, and the second check is incorrect.
(In reply to Mark Johnston from comment #76) Works for me on Tuxedo laptop (https://bsd-hardware.info/?probe=6f779d5170#pci:1002-1637-1d05-109f), FreeBSD 15.0-CURRENT #2 main-n265359-d643925a79ca-dirty.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=015daf5221f7588b9258fe0242cee09bde39fe21 commit 015daf5221f7588b9258fe0242cee09bde39fe21 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-09-27 12:23:58 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-09-27 12:28:27 +0000 hdac: Defer interrupt allocation in hdac_attach() hdac_attach() registers an interrupt handler before allocating various driver resources which are accessed by the interrupt handler. On some platforms we observe what appear to be spurious interrupts upon a cold boot, resulting in panics. Partially work around the problem by deferring irq allocation until after other resources are allocated. I think this is not a complete solution, but is correct and sufficient to work around the problems reported in the PR. PR: 268393 Tested by: Alexander Sherikov <asherikov@yandex.com> Tested by: Oleh Hushchenkov <o.hushchenkov@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D41883 sys/dev/sound/pci/hda/hdac.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
*** Bug 264305 has been marked as a duplicate of this bug. ***
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=32ca3712a8c4295d14770ad9cc55c1c837d834ad commit 32ca3712a8c4295d14770ad9cc55c1c837d834ad Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-09-27 12:23:58 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-04 13:43:14 +0000 hdac: Defer interrupt allocation in hdac_attach() hdac_attach() registers an interrupt handler before allocating various driver resources which are accessed by the interrupt handler. On some platforms we observe what appear to be spurious interrupts upon a cold boot, resulting in panics. Partially work around the problem by deferring irq allocation until after other resources are allocated. I think this is not a complete solution, but is correct and sufficient to work around the problems reported in the PR. PR: 268393 Tested by: Alexander Sherikov <asherikov@yandex.com> Tested by: Oleh Hushchenkov <o.hushchenkov@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D41883 (cherry picked from commit 015daf5221f7588b9258fe0242cee09bde39fe21) sys/dev/sound/pci/hda/hdac.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=1e8737f4e884fdf4b966056662e4e6003d3379d9 commit 1e8737f4e884fdf4b966056662e4e6003d3379d9 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-09-27 12:23:58 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-04 13:41:52 +0000 hdac: Defer interrupt allocation in hdac_attach() hdac_attach() registers an interrupt handler before allocating various driver resources which are accessed by the interrupt handler. On some platforms we observe what appear to be spurious interrupts upon a cold boot, resulting in panics. Partially work around the problem by deferring irq allocation until after other resources are allocated. I think this is not a complete solution, but is correct and sufficient to work around the problems reported in the PR. PR: 268393 Tested by: Alexander Sherikov <asherikov@yandex.com> Tested by: Oleh Hushchenkov <o.hushchenkov@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D41883 (cherry picked from commit 015daf5221f7588b9258fe0242cee09bde39fe21) sys/dev/sound/pci/hda/hdac.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
The patch will appear in the next 14.0 pre-release build.
Thanks for all the hard work everyone on this. Much appreciated!
A commit in branch releng/14.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=901d81c3e0f43cb0e4e10bb42ab9f0a71cfcda0a commit 901d81c3e0f43cb0e4e10bb42ab9f0a71cfcda0a Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-09-27 12:23:58 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-05 14:01:19 +0000 hdac: Defer interrupt allocation in hdac_attach() hdac_attach() registers an interrupt handler before allocating various driver resources which are accessed by the interrupt handler. On some platforms we observe what appear to be spurious interrupts upon a cold boot, resulting in panics. Partially work around the problem by deferring irq allocation until after other resources are allocated. I think this is not a complete solution, but is correct and sufficient to work around the problems reported in the PR. Approved by: re (gjb) PR: 268393 Tested by: Alexander Sherikov <asherikov@yandex.com> Tested by: Oleh Hushchenkov <o.hushchenkov@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D41883 (cherry picked from commit 015daf5221f7588b9258fe0242cee09bde39fe21) (cherry picked from commit 1e8737f4e884fdf4b966056662e4e6003d3379d9) sys/dev/sound/pci/hda/hdac.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)