Bug 273715 - dumpon: Kernel panic on boot when enabling dumpon over IP
Summary: dumpon: Kernel panic on boot when enabling dumpon over IP
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords: crash
Depends on:
Blocks:
 
Reported: 2023-09-11 09:32 UTC by Michael Gmelin
Modified: 2024-12-03 16:31 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Gmelin freebsd_committer freebsd_triage 2023-09-11 09:32:34 UTC
Basic setup:

Host running bhyve vms. VMs use dumpon over IP to write crash dumps to the VM host. This used to work ok in 13.1 (booting these machines using a 13.1 kernel still works). With 13.2, dumpon (ironically) causes a panic on boot.

Configuration:

ifconfig_vtnet0_name="uplink"
ifconfig_uplink="inet 192.168.255.3/24"
defaultrouter="192.168.255.1"

vlans_uplink="vlanone vlantwo"

create_args_vlanone="vlan 1001"
ifconfig_vlanone="inet 10.1.1.1/24"

create_args_vlantwo="vlan 1002"
ifconfig_vlantwo="inet 10.1.2.1/24"

dumpdev="uplink"
dumpon_flags="-k /root/netdump_public.pem -i 0 -c 192.168.255.3 -s 192.168.255.1 -g 192.168.255.1"


Panic:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x500
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80deea11
stack pointer           = 0x28:0xfffffe00f7fda9a0
frame pointer           = 0x28:0xfffffe00f7fda9d0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 26 (dumpon)
trap number             = 12
panic: page fault
cpuid = 0
time = 1694419470
KDB: stack backtrace:
#0 0xffffffff80c53e45 at kdb_backtrace+0x65
#1 0xffffffff80c067c1 at vpanic+0x151
#2 0xffffffff80c06663 at panic+0x43
#3 0xffffffff810b1fa7 at trap_fatal+0x387
#4 0xffffffff810b1fff at trap_pfault+0x4f
#5 0xffffffff81088ed8 at calltrap+0x8
#6 0xffffffff80def0c6 at netdump_ioctl+0x4c6
#7 0xffffffff80a9d736 at devfs_ioctl+0xc6
#8 0xffffffff80cf91d4 at vn_ioctl+0x1a4
#9 0xffffffff80a9ddee at devfs_ioctl_f+0x1e
#10 0xffffffff80c748fd at kern_ioctl+0x26d
#11 0xffffffff80c745e0 at sys_ioctl+0x100
#12 0xffffffff810b289c at amd64_syscall+0x10c
#13 0xffffffff810897eb at fast_syscall_common+0xf8
Uptime: 1s

When commenting out dumpon in /etc/rc.conf, the vm boots just fine.
Comment 1 Mina Galić freebsd_triage 2023-09-11 09:44:28 UTC
does it work if you use vtnet0?
Comment 2 Michael Gmelin freebsd_committer freebsd_triage 2023-09-11 09:57:25 UTC
(In reply to Mina Galić from comment #1)

It boots okay when using vtnet0, when calling "service dumpon restart" after boot, the machine immediately reboots though. So it seems like something is wrong with aliases.

When removing device aliases completely (and vlans, but I think these are unrelated), it boots okay and can be reconfigured (via the service command) after boot as well.

Note that this is what it looks like:


    # service dumpon restart
    kernel dumps on priority: device
    0: /dev/null
    server address: 192.168.255.1
    client address: 192.168.255.3
    gateway address: 192.168.255.1
    dumpon: getifaddrs: Operation not permitted
    /etc/rc.d/dumpon: WARNING: unable to specify vtnet0 as a dump device

Despite the warning, netdump works correctly (tested via sysctl debug.kdb.panic=1).
Comment 3 Mina Galić freebsd_triage 2023-09-11 10:02:12 UTC
this, intuitively, makes sense, as we don't track renames down to the ioctl level.
see https://reviews.freebsd.org/D39713 and related bugs and reviews

adding zlei@ to CC
Comment 4 Zhenlei Huang freebsd_committer freebsd_triage 2023-09-11 16:39:25 UTC
I'm not sure if https://reviews.freebsd.org/D39713 helps. @Michael Gmelin But you can try it out.

The panic should be easy to repeat. I'll have a look at it when I have bandwidth.
Comment 5 Mina Galić freebsd_triage 2023-09-11 17:11:21 UTC
(In reply to Zhenlei Huang from comment #4)
yeah, no. that was just the first "related" Diff i could pick out from memory.

thanks for looking into this
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2023-10-02 12:06:23 UTC
Presumably the cause of the panic is that netdump_configure() doesn't check the return value of ifunit_ref() for null?
Comment 7 commit-hook freebsd_committer freebsd_triage 2023-10-02 12:10:18 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d94d07d58141dcff48f01c6b3e5a31de9d7a7938

commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-02 12:08:20 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-02 12:09:26 +0000

    netdump: Check the return value of ifunit_ref()

    We may fail to match if the specific interface doesn't exist or was
    renamed.

    PR:             273715
    Reported by:    grembo
    MFC after:      1 week

 sys/netinet/netdump/netdump_client.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 8 Michael Gmelin freebsd_committer freebsd_triage 2023-10-02 13:13:01 UTC
(In reply to Mark Johnston from comment #6)

Hi,

Thank you for the fix.

I just checked, in 13.1 rcorder was:

  /etc/rc.d/sysctl
  /etc/rc.d/natd
  /etc/rc.d/dhclient
  /etc/rc.d/dumpon

while on 13.2 rcorder is:

  /etc/rc.d/sysctl
  /etc/rc.d/natd
  /etc/rc.d/dumpon
  /etc/rc.d/dhclient

So on 13.1, dumpon was called after interface renaming takes, while on 13.2, device renaming happens after dumpon is started (which exposed the bug).

Without checking in more detail, this sounds like a plausible explanation.

It also means, that - even after applying the fix - it's not possible to have a dumpon configuration in /etc/rc.conf that will work during runtime (service dumpon start) and also on boot, in case a renamed interface is used. This seems like something that would be worth addressing.
Comment 9 Mark Johnston freebsd_committer freebsd_triage 2023-10-02 13:23:27 UTC
> So on 13.1, dumpon was called after interface renaming takes, while on 13.2, device renaming happens after dumpon is started (which exposed the bug).

How did you come to that conclusion?  Does "dhclient" somehow cause interfaces to be renamed?  That seems surprising given that your rc.conf doesn't enable DHCP on anything.

> This seems like something that would be worth addressing.

Certainly.
Comment 10 Michael Gmelin freebsd_committer freebsd_triage 2023-10-02 14:10:36 UTC
(In reply to Mark Johnston from comment #9)

> How did you come to that conclusion?

By the art of baseless speculation.

I compared the actual 13.1 and 13.2 boot sequences, and on both the interfaces are renamed after dumpon is called. So forget that.

I spent the time to actually read your patch the previous change that lead to it, so this all makes sense now (Captain Obvious reiterating):

https://cgit.freebsd.org/src/commit/sys/netinet/netdump/netdump_client.c?id=38a36057ae56c8023878f3c3c2185bafc2896964

introduced a check if the device supports debugnet. This check couldn't deal with non-existing interfaces, which is the bug you fixed.

Before the check was introduced, setting the renamed interface name was fine, since at the point the panic I tested with happened - which is after device renaming - it just worked(tm). If there was a panic between dumpon and device renaming, netdump would have failed of course (but that's a rare thing to happen).

Now with the check in place, it first crashed (due to the bug reported) and now - with the fix - will not catch on, as an interfaces named "untrusted" cannot be found.

I see various options how to resolve this:

1. Always allow using the physical device name (so it would somehow poke around and find it)
2. Add a flag to dumpon (e.g., `-f` for force) to skip the device exists/device supports debugnet check
3. Allow specifying multiple devices netdump_client to try when dumping
4. Determine based on the device name if it can exist early (feels wonky)

For my setup, 2. would be sufficient and easy to apply (as this means that all I have to do to adapting my base rc.conf for a new host is modify the ifconfig_<devname>_name line in rc.conf). It would also be quite easy to implement I assume (DIOCGKERNELDUMP would need to be extended though - unless some encoding in one of the existing parameters is done, like "dumpon -s 192.168.1.1 -s 192.168.1.2 untrusted:nocheck"). Alternatively this could maybe be handled using a sysctl(?).
Comment 11 Zhenlei Huang freebsd_committer freebsd_triage 2023-10-02 16:40:45 UTC
A minimal snip to repeat this bug:

```
# dumpon -i 0 -c 192.168.255.3 -g 192.168.255.1 -s 192.168.255.1 nonexist
```

> I see various options how to resolve this:

> 1. Always allow using the physical device name (so it would somehow poke around and find it)
> 2. Add a flag to dumpon (e.g., `-f` for force) to skip the device exists/device supports debugnet check
> 3. Allow specifying multiple devices netdump_client to try when dumping
> 4. Determine based on the device name if it can exist early (feels wonky)

The `dumpon` start quite early, even before `FILESYSTEMS`. At that time `netif` (interface renaming) has no chances to run.

I guess it is intended to let dumpon run as early as possible so that almost all kernel panics can be caught rightly.

Well for netdump(4), if a pubkey is requested, then dumpon requires `FILESYSTEMS`.

I guess we want to defer the setup of netdump(4), until `FILESYSTEMS` (probably also `netif`) is/are ready.

Or we can split the rc.d/dumpon into `rc.d/dumpon` and `rc.d/netdump`.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2023-10-04 14:30:25 UTC
Oh, I see.  It turns out that the ifnet pointer is optional, debugnet_connect() will derive the correct ifnet using other netdump parameters, plus the routing table.  So to provide the old behaviour we should continue if ifunit_ref() returns NULL (and maybe print a warning...?), or look up ifnets by device name as well.

But this also means that to configure netdump, you don't need an interface name at all.  But the rc.d/dumpon script requires it (you need to set "dumpdev" to something).  So yes, maybe we should have a rc.d/netdump script instead...
Comment 13 commit-hook freebsd_committer freebsd_triage 2023-10-09 00:59:52 UTC
A commit in branch stable/14 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=d08ef6f7da3943dc49999988e6c2ac03644bdaf1

commit d08ef6f7da3943dc49999988e6c2ac03644bdaf1
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-02 12:08:20 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-09 00:41:42 +0000

    netdump: Check the return value of ifunit_ref()

    We may fail to match if the specific interface doesn't exist or was
    renamed.

    PR:             273715
    Reported by:    grembo
    MFC after:      1 week

    (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938)

 sys/netinet/netdump/netdump_client.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 14 commit-hook freebsd_committer freebsd_triage 2023-10-09 00:59:56 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f093d21e081b1eed9844c28af605b5277e76094c

commit f093d21e081b1eed9844c28af605b5277e76094c
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-02 12:08:20 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-09 00:42:37 +0000

    netdump: Check the return value of ifunit_ref()

    We may fail to match if the specific interface doesn't exist or was
    renamed.

    PR:             273715
    Reported by:    grembo
    MFC after:      1 week

    (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938)

 sys/netinet/netdump/netdump_client.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 15 commit-hook freebsd_committer freebsd_triage 2023-10-09 18:13:46 UTC
A commit in branch releng/14.0 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=862483bd3500aacc1546d6dfaa940a43e23e096b

commit 862483bd3500aacc1546d6dfaa940a43e23e096b
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2023-10-02 12:08:20 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2023-10-09 18:07:08 +0000

    netdump: Check the return value of ifunit_ref()

    We may fail to match if the specific interface doesn't exist or was
    renamed.

    Approved by:    re (gjb)
    PR:             273715
    Reported by:    grembo
    MFC after:      1 week

    (cherry picked from commit d94d07d58141dcff48f01c6b3e5a31de9d7a7938)
    (cherry picked from commit d08ef6f7da3943dc49999988e6c2ac03644bdaf1)

 sys/netinet/netdump/netdump_client.c | 2 ++
 1 file changed, 2 insertions(+)
Comment 16 Michael Gmelin freebsd_committer freebsd_triage 2024-12-03 16:25:08 UTC
@Mark Linimon The issue that renamed interfaces are not supported properly still exists though. I'll create a new bug for it.