Created attachment 246037 [details] core.txt Running devmatch (at boot time or after boot) with 10 Xen "xn" interfaces causes a kernel panic crash. With fewer interfaces it does not crash. Adding many interfaces with the system on-line (not running devmatch) does not crash. Tested and effects 13.2-p4 and 14.0-RC3. This also cause problems with FreeBSD based apps like OPNsense and PFsense. Running on XCP 8.2.1 (Xen). Running FreeBSD amd64 GENERIC kernel ``` FreeBSD 14.0-RC3 FreeBSD 14.0-RC3 #0 releng/14.0-n265368-c6cfdc130554: Fri Oct 27 05:57:28 UTC 2023 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 panic: page fault Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff81008a64 stack pointer = 0x28:0xfffffe0076295b30 frame pointer = 0x28:0xfffffe0076295b30 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 3078 (devmatch) rdi: 0000000000000000 rsi: 0000000000000000 rdx: 0000000000000018 rcx: 0000000000000000 r8: fefefefefefefeff r9: 8080808080808080 rax: fffff80006f99000 rbx: 0000000000000000 rbp: fffffe0076295b30 r10: 0000000000000000 r11: df9290bcccff8d9a r12: 0000000000000002 r13: 0000000000000002 r14: fffffe0076295b68 r15: fffff8007419c000 trap number = 12 panic: page fault cpuid = 0 time = 1698809363 KDB: stack backtrace: #0 0xffffffff80b9002d at kdb_backtrace+0x5d #1 0xffffffff80b43132 at vpanic+0x132 #2 0xffffffff80b42ff3 at panic+0x43 #3 0xffffffff8100c85c at trap_fatal+0x40c #4 0xffffffff8100c8af at trap_pfault+0x4f #5 0xffffffff80fe3818 at calltrap+0x8 #6 0xffffffff80b9c515 at sbuf_cat+0x15 #7 0xffffffff80b83d55 at sysctl_devices+0x125 #8 0xffffffff80b54910 at sysctl_root_handler_locked+0x90 #9 0xffffffff80b53d41 at sysctl_root+0x241 #10 0xffffffff80b543c6 at userland_sysctl+0x176 #11 0xffffffff80b5420c at sys___sysctl+0x5c #12 0xffffffff8100d119 at amd64_syscall+0x109 #13 0xffffffff80fe412b at fast_syscall_common+0xf8 Uptime: 2m6s Dumping 268 out of 4059 ```
It seems the driver->name is NULL. https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=releng/14.0#n5180 if (dev->driver != NULL) sbuf_cat(&sb, dev->driver->name); No idea how that happens.
CC imp@ .
Driver name being null is the bug. That's forbidden.
Otherwise I need a reproducer. I don't have the time to guess or waste if the guess that it's the 10 xn devices causing it is wrong. Seems weird on it's surface.. we have systems at work that have 30 nvme drives that work just fine.
Will take a look tomorrow, might be caused by the xenbus code.
Created attachment 246070 [details] Patch 1/2 I'm attaching a proposal for a fix, I've been able to reproduce locally and the attached patch fixes the issue for me. Can you please check if it also fixes the issue on your side?
YES, that patch fixes the kernel crash problem with >10 interfaces (tested on 14.0-RC3, and would think it applies to 13 also). THANKS! But, it exposes a new problem. Having Xen allocate many more interfaces (>15 total) now produces an error: "xn15: failed to allocate tx grant refs" That would be fine, but it causes a problem on reboot when recognizing the 16th interface: "run_interrupt_driven_hooks: still waiting after 60 seconds for xenbusb_nop_confighook_cb" It seems the system partially allocates a new "xn15" interface (the 16th) and then the system has problems working with it. The kernel should just completely refuse to create/allocate the interface that is beyond the supported limit.
(In reply to Andrew Lindh from comment #7) Grant allocation is not fully static, in other words: a PV device might use a high amount of grants for a short period in order to cope with a data transfer spike and then release them, that's why most code is designed to wait for grants to be released. That being said, when devices are attached unable to meet the number of grants required for static setup should just cause an error and the device shouldn't be initialized. I will see about providing a patch to do that.
Created attachment 246078 [details] Patch 2/2 Attaching a second patch to prevent the kernel getting suck with > 15 PV network interfaces. Should be applied on top of the previous one. Let me know if that fixes the issue, it does seem to do the job for me.
Works well enough for me. No crash. No lockup. Reboots work. Tested in 14.0-RC3 and 13.2-p4 on XCP 8.2.1... I did not test 12... It does have an odd side effect of allowing excess "xn" interfaces (>15) to be created with a failure warning, but they don't actually function in FreeBSD. I don't know if it's worth the time to look in to it any further to make more interfaces work or stop them from being created. The issues started with PFsense/OPNsense that use FreeBSD 13.2 and soon 14.0, I hope this patch passes any additional testing and makes it into the kernels. Thanks for your time! (Olivier appreciates it too)
(In reply to Andrew Lindh from comment #10) Oh, I didn't realize the interfaces where actually created, that's kind of weird. I think that's because by the time the error happens the driver is already attached, and hence in order to remove the device the driver would need to be detached from that specific instance. Thanks for the testing, Roger.