Created attachment 256370 [details] /var/run/dmesg.boot (verbose) After upgrading to 14.2R from 14.1R, I noticed that listing all sysctls doesn't work anymore. It stops here: <CUT> dev.ahci.%parent: dev.vgapci.1.%iommu: rid=0x10 dev.vgapci.1.%parent: pci0 dev.vgapci.1.%pnpinfo: vendor=0x8086 device=0x591b subvendor=0x1028 subdevice=0x07b1 class=0x030000 dev.vgapci.1.%location: slot=2 function=0 dbsf=pci0:0:2:0 handle=\_SB_.PCI0.GFX0 dev.vgapci.1.%driver: vgapci dev.vgapci.1.%desc: VGA-compatible display dev.vgapci.0.wake: 0 ^C^C^C Process enters R+, eats 100% CPU and becomes unkillable: jason@jnb: [?:0] ~ $ ps auwwx | grep '[s]ysctl' jason 25384 100.0 0.0 16888 4880 6 R+ 15:00 0:49.15 sysctl -a jason@jnb: [?:0] ~ $ kill -9 25384 jason@jnb: [?:0] ~ $ ps auwwx | grep '[s]ysctl' jason 25384 100.0 0.0 16888 4880 6 R+ 15:00 3:19.35 sysctl -a jason@jnb: [?:0] ~ $ procstat -kk 25384 PID TID COMM TDNAME KSTACK 25384 101712 sysctl - pci_find_cap_method+0x11a iommu_get_requester+0x192 device_sysctl_handler+0x216 sysctl_root_handler_locked+0x8a sysctl_root+0x1fa userland_sysctl+0x115 sys___sysctl+0x60 amd64_syscall+0xed fast_syscall_common+0xf8 In this state, subsequent 'sysctl kern.geom' works fine, while 'sysctl dev.cpu' also hangs: jason@jnb: [?:0] ~ $ ps auwwx | grep '[s]ysctl' jason 25436 100.0 0.0 13816 2236 5 R+ 15:05 1:30.48 sysctl dev.cpu jason 25384 100.0 0.0 16888 4880 6 R+ 15:00 6:31.35 sysctl -a jason@jnb: [?:0] ~ $ procstat -kk 25436 25384 PID TID COMM TDNAME KSTACK 25436 101172 sysctl - sysctl_root_handler_locked+0x143 sysctl_root+0x1fa userland_sysctl+0x115 sys___sysctl+0x60 amd64_syscall+0xed fast_syscall_common+0xf8 25384 101712 sysctl - pci_find_cap_method+0x17a iommu_get_requester+0x192 device_sysctl_handler+0x216 sysctl_root_handler_locked+0x8a sysctl_root+0x1fa userland_sysctl+0x115 sys___sysctl+0x60 amd64_syscall+0xed fast_syscall_common+0xf8 jason@jnb: [?:0] ~ $ No changes have been made to system configuration during upgrade.
Created attachment 256371 [details] 'config -x /boot/kernel/kernel' output
Created attachment 256372 [details] 'kldstat -v' output
Narrowed down the issue to a single OID. I have the following GPU: vgapci0@pci0:1:0:0: class=0x030000 rev=0xa1 hdr=0x00 vendor=0x10de device=0x1bb7 subvendor=0x1028 subdevice=0x07b1 vendor = 'NVIDIA Corporation' device = 'GP104GLM [Quadro P4000 Mobile]' class = display subclass = VGA cap 01[60] = powerspec 3 supports D0 D3 current D0 cap 05[68] = MSI supports 1 message, 64 bit cap 10[78] = PCI-Express 2 legacy endpoint max data 256(256) RO NS max read 512 link x16(x16) speed 8.0(8.0) ASPM L0s/L1(L0s/L1) ClockPM disabled ecap 0002[100] = VC 1 max VC0 ecap 0018[250] = LTR 1 ecap 0004[128] = Power Budgeting 1 ecap 0001[420] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 000b[600] = Vendor [1] ID 0001 Rev 1 Length 36 0b 00 01 90 01 00 41 02 02 00 41 01 01 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ecap 0019[900] = PCIe Sec 1 lane errors 0 It's unused in my system, so being turned off by sysutils/acpi_call and xmj@'s turn_off_gpu.sh from TuningPowerConsumption [0]: vgapci0@pci0:1:0:0: class=0x030000 rev=0xa1 hdr=0x00 vendor=0x10de device=0x1bb7 subvendor=0x1028 subdevice=0x07b1 vendor = 'NVIDIA Corporation' device = 'GP104GLM [Quadro P4000 Mobile]' class = display subclass = VGA (With '\_SB.PCI0.PEG0.PEGP._OFF' method.) After NVIDIA turned off, any sysctl call which tries to get dev.vgapci.X.%iommu hangs. Getting other OIDs, e.g. dev.vgapci.0.%driver, works fine. NOTE: this isn't DRM issue as I don't have any NVIDIA-related modules loaded. [0]: https://wiki.freebsd.org/TuningPowerConsumption
Compared 'vga'-containing commits between releng/14.2 and releng/14.1 -- identical. Then compared 'iommu'-containing commits (as this OID hangs) and discovered ec8d60f0d9b762880482e39f567db552c152d3a2 by kib@, which exposes the value. This commit is only present in releng/14.2, so I believe it's the trigger. However, most likely no more than a trigger -- not root cause.
The pci_get_requester() loops somewhere in the call to pci_find_cap_method(). The later is accessing the PCI config space directly, trying to read the header and to iterate the list of the capabilities, for instance, to read PCIe cap. To further diagnose the problem, you might try to instrument pci_find_cap_method() to see which registers it tries to read. My guess is that the cap read cycle gets something like 0xff as the offset of the next capability and then loops back.
Try https://reviews.freebsd.org/D48348
(In reply to Konstantin Belousov from comment #6) Thanks for prompt response. I can confirm that with the patch applied, getting OID in question doesn't hang anymore -- I can see 'dev.vgapci.0.%iommu: rid=0x100' both before and after powering down NVIDIA.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6ba2c036a0117ac02f9979b7dc49f15e9c1ea9c9 commit 6ba2c036a0117ac02f9979b7dc49f15e9c1ea9c9 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2025-01-06 23:29:18 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2025-01-07 15:34:59 +0000 pci_find_cap_method(): limit number of iterations for finding a capability Powered down device might return 0xff of extended config registers reads, causing loop. PR: 283815 Reviewed by: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D48348 sys/dev/pci/pci.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)