Summary: | panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Dave Cottlehuber <dch> | ||||||||||
Component: | kern | Assignee: | John Baldwin <jhb> | ||||||||||
Status: | Closed FIXED | ||||||||||||
Severity: | Affects Only Me | CC: | glebius, jhb, tuexen | ||||||||||
Priority: | --- | Keywords: | crash | ||||||||||
Version: | 15.0-CURRENT | ||||||||||||
Hardware: | arm64 | ||||||||||||
OS: | Any | ||||||||||||
Attachments: |
|
^Triage: notify jhb of possible problem with his change. I also see it on my 32 core Ampere system: CPU 0: APM eMAG 8180 r3p2 affinity: 0 0 Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL> Instruction Set Attributes 1 = <> Instruction Set Attributes 2 = <> Trying to mount root from ufs:/dev/ada0p2 [rw]... Processor Features 0 = <GIC,AdvSIMD,FP,EL3,EL2,EL1 32,EL0 32> Processor Features 1 = <> Memory Model Features 0 = <TGran4,TGran64,TGran16,SNSMem,BigEnd,16bit ASID,4TB PA> Memory Model Features 1 = <8bit VMID> Memory Model Features 2 = <32bit CCIDX,48bit VA> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> Debug Features 1 = <> Auxiliary Features 0 = <> Auxiliary Features 1 = <> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> CPU 1: APM eMAG 8180 r3p2 affinity: 0 1 CPU 2: APM eMAG 8180 r3p2 affinity: 1 0 CPU 3: APM eMAG 8180 r3p2 affinity: 1 1 CPU 4: APM eMAG 8180 r3p2 affinity: 2 0 CPU 5: APM eMAG 8180 r3p2 affinity: 2 1 CPU 6: APM eMAG 8180 r3p2 affinity: 3 0 CPU 7: APM eMAG 8180 r3p2 affinity: 3 1 CPU 8: APM eMAG 8180 r3p2 affinity: 4 0 CPU 9: APM eMAG 8180 r3p2 affinity: 4 1 CPU 10: APM eMAG 8180 r3p2 affinity: 5 0 CPU 11: APM eMAG 8180 r3p2 affinity: 5 1 CPU 12: APM eMAG 8180 r3p2 affinity: 6 0 CPU 13: APM eMAG 8180 r3p2 affinity: 6 1 CPU 14: APM eMAG 8180 r3p2 affinity: 7 0 CPU 15: APM eMAG 8180 r3p2 affinity: 7 1 CPU 16: APM eMAG 8180 r3p2 affinity: 8 0 CPU 17: APM eMAG 8180 r3p2 affinity: 8 1 CPU 18: APM eMAG 8180 r3p2 affinity: 9 0 CPU 19: APM eMAG 8180 r3p2 affinity: 9 1 CPU 20: APM eMAG 8180 r3p2 affinity: 10 0 CPU 21: APM eMAG 8180 r3p2 affinity: 10 1 CPU 22: APM eMAG 8180 r3p2 affinity: 11 0 CPU 23: APM eMAG 8180 r3p2 affinity: 11 1 CPU 24: APM eMAG 8180 r3p2 affinity: 12 0 CPU 25: APM eMAG 8180 r3p2 affinity: 12 1 CPU 26: APM eMAG 8180 r3p2 affinity: 13 0 CPU 27: APM eMAG 8180 r3p2 affinity: 13 1 CPU 28: APM eMAG 8180 r3p2 affinity: 14 0 CPU 29: APM eMAG 8180 r3p2 affinity: 14 1 CPU 30: APM eMAG 8180 r3p2 affinity: 15 0 CPU 31: APM eMAG 8180 r3p2 affinity: 15 1 A version, which is running fine is: FreeBSD ampere32.nplab.de 15.0-CURRENT FreeBSD 15.0-CURRENT #76 main-n268036-d682c1eaa598-dirty: Sat Feb 3 23:32:31 CET 2024 root@ampere32.nplab.de:/usr/obj/usr/home/tuexen/freebsd-src/arm64.aarch64/sys/TCP arm64 Not sure which change introduced the problem. It there for about a week or so. From what I could tell when I looked at this before for Dave, this appears to be an issue specific to linuxkpi used by the mlx5 driver. In particular, it uses bus_translate_resource in its wrapper for pci_resource_start and I had asked Dave to boot with an additional patch to try to trace what is going on there to see if it is getting the wrong answer. The mlx5 driver assumes that pci_resource_start gives a valid physical address it can pass to ioremap to create a memory mapping of the BAR. The patch with the debugging trace is here: https://reviews.freebsd.org/P632 Ah, looks like the dmesg from Dave does actually include this patch as it has this line of output: mlx5_core0: translate 0x14082000000 -> 0x24082000000 That looks correct, but unfortunately, we only display the ranges in bootverbose for FDT, not ACPI. The patch below fixes the pcib driver to always log the ranges which would be useful to confirm the window: diff --git a/sys/dev/pci/pci_host_generic.c b/sys/dev/pci/pci_host_generic.c index 386b8411d29a..46b84ff3004b 100644 --- a/sys/dev/pci/pci_host_generic.c +++ b/sys/dev/pci/pci_host_generic.c @@ -83,6 +83,7 @@ pci_host_generic_core_attach(device_t dev) uint64_t phys_base; uint64_t pci_base; uint64_t size; + const char *range_descr; char buf[64]; int domain, error; int flags, rid, tuple, type; @@ -179,6 +180,7 @@ pci_host_generic_core_attach(device_t dev) switch (FLAG_TYPE(sc->ranges[tuple].flags)) { case FLAG_TYPE_PMEM: sc->has_pmem = true; + range_descr = "prefetch"; flags = RF_PREFETCHABLE; type = SYS_RES_MEMORY; error = rman_manage_region(&sc->pmem_rman, @@ -186,12 +188,14 @@ pci_host_generic_core_attach(device_t dev) break; case FLAG_TYPE_MEM: flags = 0; + range_descr = "memory"; type = SYS_RES_MEMORY; error = rman_manage_region(&sc->mem_rman, pci_base, pci_base + size - 1); break; case FLAG_TYPE_IO: flags = 0; + range_descr = "I/O port"; type = SYS_RES_IOPORT; error = rman_manage_region(&sc->io_rman, pci_base, pci_base + size - 1); @@ -219,6 +223,10 @@ pci_host_generic_core_attach(device_t dev) error = ENXIO; goto err_rman_manage; } + if (bootverbose) + device_printf(dev, + "PCI addr: 0x%jx, CPU addr: 0x%jx, Size: 0x%jx, Type: %s\n", + pci_base, phys_base, size, range_type); } return (0); That said, it seems like the translation is correct given the prefetch window used for the pcib1 bridge between pcib0 and the mlx5 device: pcib1: <PCI-PCI bridge> at device 0.0 on pci0 pcib1: domain 0 pcib1: secondary bus 1 pcib1: subordinate bus 1 pcib1: memory decode 0x30000000-0x301fffff pcib1: prefetched decode 0x14080000000-0x14083ffffff And this allocation of mlx5's BAR: map[10]: type Prefetchable Memory, range 64, base rx14082000000, size 25, enabled pcib1: allocated prefetch range (0x14082000000-0x14083ffffff) for rid 10 of pci0:1:0:0 It is odd for a register bar to be in a prefetch BAR. It might be good to see a verbose dmesg from before to see how the bridge and and mlx5 BAR were configured before. Created attachment 248803 [details]
verbose dmesg with range_descr debugging
thanks, attached dmesg after patch.
Created attachment 248817 [details]
dmesg after D44132
dmesg & panic after rebasing and applying D44132
Created attachment 248936 [details]
fix.patch
Please try this patch. Looking at the dmesg, the address was translated incorrectly. It matches this range which requires no translation:
pcib0: PCI addr: 0x10100000000, CPU addr: 0x10100000000, Size: 0x7f00000000, Type: memory
(PCI addr == CPU addr), but it was matching on the wrong range and translating the address as if it belonged to the first range:
pcib0: PCI addr: 0x0, CPU addr: 0x10010000000, Size: 0x10000, Type: I/O port
The code I changed in commit d79b6b8ec267 expected the end to be >= start, and the end value of 0 in the old code violated this assumption.
I tested that patch and can confirm that it fixes the issue on my ampere system. Thanks for the patch! Best regards Michael A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=332dbd3a2f08a887014a425d2532af93503588ce commit 332dbd3a2f08a887014a425d2532af93503588ce Author: John Baldwin <jhb@FreeBSD.org> AuthorDate: 2024-03-05 04:52:54 +0000 Commit: John Baldwin <jhb@FreeBSD.org> CommitDate: 2024-03-05 04:52:54 +0000 pci_host_generic: Pass a valid end address in generic_pcie_translate_resource The generic_pcie_containing_range helper added in commit d79b6b8ec267 assumed that the passed in (start, end) range used to locate the containing mapping range was a valid address range (with end >= start). The previous version of generic_pcie_translate_resource_common only used the start address to locate a mapping range, so the end address of 0 did not matter, but an end of 0 now causes the first range to match and an incorrect translation for resources using a later range. PR: 277211 Reported by: dch, tuexen Reviewed by: tuexen Fixes: d79b6b8ec267 pci_host_generic: Don't rewrite resource start address for translation Differential Revision: https://reviews.freebsd.org/D44205 sys/dev/pci/pci_host_generic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) works on my machine :-) thanks!! |
Created attachment 248657 [details] dmesg + panic as of 58df49801d9d panic from 58df49801d9d & jhb's "acpi: Defer reserving resources for ACPI devices" patch.