Basic setup: Host running bhyve vms. VMs use dumpon over IP to write crash dumps to the VM host. This used to work ok in 13.1 (booting these machines using a 13.1 kernel still works). With 13.2, dumpon (ironically) causes a panic on boot. Configuration: ifconfig_vtnet0_name="uplink" ifconfig_uplink="inet 192.168.255.3/24" defaultrouter="192.168.255.1" vlans_uplink="vlanone vlantwo" create_args_vlanone="vlan 1001" ifconfig_vlanone="inet 10.1.1.1/24" create_args_vlantwo="vlan 1002" ifconfig_vlantwo="inet 10.1.2.1/24" dumpdev="uplink" dumpon_flags="-k /root/netdump_public.pem -i 0 -c 192.168.255.3 -s 192.168.255.1 -g 192.168.255.1" Panic: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x500 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80deea11 stack pointer = 0x28:0xfffffe00f7fda9a0 frame pointer = 0x28:0xfffffe00f7fda9d0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 26 (dumpon) trap number = 12 panic: page fault cpuid = 0 time = 1694419470 KDB: stack backtrace: #0 0xffffffff80c53e45 at kdb_backtrace+0x65 #1 0xffffffff80c067c1 at vpanic+0x151 #2 0xffffffff80c06663 at panic+0x43 #3 0xffffffff810b1fa7 at trap_fatal+0x387 #4 0xffffffff810b1fff at trap_pfault+0x4f #5 0xffffffff81088ed8 at calltrap+0x8 #6 0xffffffff80def0c6 at netdump_ioctl+0x4c6 #7 0xffffffff80a9d736 at devfs_ioctl+0xc6 #8 0xffffffff80cf91d4 at vn_ioctl+0x1a4 #9 0xffffffff80a9ddee at devfs_ioctl_f+0x1e #10 0xffffffff80c748fd at kern_ioctl+0x26d #11 0xffffffff80c745e0 at sys_ioctl+0x100 #12 0xffffffff810b289c at amd64_syscall+0x10c #13 0xffffffff810897eb at fast_syscall_common+0xf8 Uptime: 1s When commenting out dumpon in /etc/rc.conf, the vm boots just fine.
does it work if you use vtnet0?
(In reply to Mina Galić from comment #1) It boots okay when using vtnet0, when calling "service dumpon restart" after boot, the machine immediately reboots though. So it seems like something is wrong with aliases. When removing device aliases completely (and vlans, but I think these are unrelated), it boots okay and can be reconfigured (via the service command) after boot as well. Note that this is what it looks like: # service dumpon restart kernel dumps on priority: device 0: /dev/null server address: 192.168.255.1 client address: 192.168.255.3 gateway address: 192.168.255.1 dumpon: getifaddrs: Operation not permitted /etc/rc.d/dumpon: WARNING: unable to specify vtnet0 as a dump device Despite the warning, netdump works correctly (tested via sysctl debug.kdb.panic=1).
this, intuitively, makes sense, as we don't track renames down to the ioctl level. see https://reviews.freebsd.org/D39713 and related bugs and reviews adding zlei@ to CC
I'm not sure if https://reviews.freebsd.org/D39713 helps. @Michael Gmelin But you can try it out. The panic should be easy to repeat. I'll have a look at it when I have bandwidth.
(In reply to Zhenlei Huang from comment #4) yeah, no. that was just the first "related" Diff i could pick out from memory. thanks for looking into this
Presumably the cause of the panic is that netdump_configure() doesn't check the return value of ifunit_ref() for null?
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d94d07d58141dcff48f01c6b3e5a31de9d7a7938 commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-10-02 12:08:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-02 12:09:26 +0000 netdump: Check the return value of ifunit_ref() We may fail to match if the specific interface doesn't exist or was renamed. PR: 273715 Reported by: grembo MFC after: 1 week sys/netinet/netdump/netdump_client.c | 2 ++ 1 file changed, 2 insertions(+)
(In reply to Mark Johnston from comment #6) Hi, Thank you for the fix. I just checked, in 13.1 rcorder was: /etc/rc.d/sysctl /etc/rc.d/natd /etc/rc.d/dhclient /etc/rc.d/dumpon while on 13.2 rcorder is: /etc/rc.d/sysctl /etc/rc.d/natd /etc/rc.d/dumpon /etc/rc.d/dhclient So on 13.1, dumpon was called after interface renaming takes, while on 13.2, device renaming happens after dumpon is started (which exposed the bug). Without checking in more detail, this sounds like a plausible explanation. It also means, that - even after applying the fix - it's not possible to have a dumpon configuration in /etc/rc.conf that will work during runtime (service dumpon start) and also on boot, in case a renamed interface is used. This seems like something that would be worth addressing.
> So on 13.1, dumpon was called after interface renaming takes, while on 13.2, device renaming happens after dumpon is started (which exposed the bug). How did you come to that conclusion? Does "dhclient" somehow cause interfaces to be renamed? That seems surprising given that your rc.conf doesn't enable DHCP on anything. > This seems like something that would be worth addressing. Certainly.
(In reply to Mark Johnston from comment #9) > How did you come to that conclusion? By the art of baseless speculation. I compared the actual 13.1 and 13.2 boot sequences, and on both the interfaces are renamed after dumpon is called. So forget that. I spent the time to actually read your patch the previous change that lead to it, so this all makes sense now (Captain Obvious reiterating): https://cgit.freebsd.org/src/commit/sys/netinet/netdump/netdump_client.c?id=38a36057ae56c8023878f3c3c2185bafc2896964 introduced a check if the device supports debugnet. This check couldn't deal with non-existing interfaces, which is the bug you fixed. Before the check was introduced, setting the renamed interface name was fine, since at the point the panic I tested with happened - which is after device renaming - it just worked(tm). If there was a panic between dumpon and device renaming, netdump would have failed of course (but that's a rare thing to happen). Now with the check in place, it first crashed (due to the bug reported) and now - with the fix - will not catch on, as an interfaces named "untrusted" cannot be found. I see various options how to resolve this: 1. Always allow using the physical device name (so it would somehow poke around and find it) 2. Add a flag to dumpon (e.g., `-f` for force) to skip the device exists/device supports debugnet check 3. Allow specifying multiple devices netdump_client to try when dumping 4. Determine based on the device name if it can exist early (feels wonky) For my setup, 2. would be sufficient and easy to apply (as this means that all I have to do to adapting my base rc.conf for a new host is modify the ifconfig_<devname>_name line in rc.conf). It would also be quite easy to implement I assume (DIOCGKERNELDUMP would need to be extended though - unless some encoding in one of the existing parameters is done, like "dumpon -s 192.168.1.1 -s 192.168.1.2 untrusted:nocheck"). Alternatively this could maybe be handled using a sysctl(?).
A minimal snip to repeat this bug: ``` # dumpon -i 0 -c 192.168.255.3 -g 192.168.255.1 -s 192.168.255.1 nonexist ``` > I see various options how to resolve this: > 1. Always allow using the physical device name (so it would somehow poke around and find it) > 2. Add a flag to dumpon (e.g., `-f` for force) to skip the device exists/device supports debugnet check > 3. Allow specifying multiple devices netdump_client to try when dumping > 4. Determine based on the device name if it can exist early (feels wonky) The `dumpon` start quite early, even before `FILESYSTEMS`. At that time `netif` (interface renaming) has no chances to run. I guess it is intended to let dumpon run as early as possible so that almost all kernel panics can be caught rightly. Well for netdump(4), if a pubkey is requested, then dumpon requires `FILESYSTEMS`. I guess we want to defer the setup of netdump(4), until `FILESYSTEMS` (probably also `netif`) is/are ready. Or we can split the rc.d/dumpon into `rc.d/dumpon` and `rc.d/netdump`.
Oh, I see. It turns out that the ifnet pointer is optional, debugnet_connect() will derive the correct ifnet using other netdump parameters, plus the routing table. So to provide the old behaviour we should continue if ifunit_ref() returns NULL (and maybe print a warning...?), or look up ifnets by device name as well. But this also means that to configure netdump, you don't need an interface name at all. But the rc.d/dumpon script requires it (you need to set "dumpdev" to something). So yes, maybe we should have a rc.d/netdump script instead...
A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=d08ef6f7da3943dc49999988e6c2ac03644bdaf1 commit d08ef6f7da3943dc49999988e6c2ac03644bdaf1 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-10-02 12:08:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-09 00:41:42 +0000 netdump: Check the return value of ifunit_ref() We may fail to match if the specific interface doesn't exist or was renamed. PR: 273715 Reported by: grembo MFC after: 1 week (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938) sys/netinet/netdump/netdump_client.c | 2 ++ 1 file changed, 2 insertions(+)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=f093d21e081b1eed9844c28af605b5277e76094c commit f093d21e081b1eed9844c28af605b5277e76094c Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-10-02 12:08:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-09 00:42:37 +0000 netdump: Check the return value of ifunit_ref() We may fail to match if the specific interface doesn't exist or was renamed. PR: 273715 Reported by: grembo MFC after: 1 week (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938) sys/netinet/netdump/netdump_client.c | 2 ++ 1 file changed, 2 insertions(+)
A commit in branch releng/14.0 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=862483bd3500aacc1546d6dfaa940a43e23e096b commit 862483bd3500aacc1546d6dfaa940a43e23e096b Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2023-10-02 12:08:20 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2023-10-09 18:07:08 +0000 netdump: Check the return value of ifunit_ref() We may fail to match if the specific interface doesn't exist or was renamed. Approved by: re (gjb) PR: 273715 Reported by: grembo MFC after: 1 week (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938) (cherry picked from commit d08ef6f7da3943dc49999988e6c2ac03644bdaf1) sys/netinet/netdump/netdump_client.c | 2 ++ 1 file changed, 2 insertions(+)
@Mark Linimon The issue that renamed interfaces are not supported properly still exists though. I'll create a new bug for it.