Bug 250561 - net/dhcpcd: (9.2.0) endless CPU spinning and logspam 'ps_bpf_recvbpf: Device not configured' after Mellanox NIC reinitializes after suspend
Summary: net/dhcpcd: (9.2.0) endless CPU spinning and logspam 'ps_bpf_recvbpf: Device ...
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Ben Woods
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-23 15:55 UTC by Greg V
Modified: 2020-11-06 12:16 UTC (History)
2 users (show)

See Also:
bugzilla: maintainer-feedback? (woodsb02)


Attachments
Remove interface on ENXIO (4.70 KB, patch)
2020-10-25 10:18 UTC, roy
roy: maintainer-approval+
Details | Diff
Close BPF socket on ENXIO (913 bytes, patch)
2020-10-25 15:00 UTC, roy
no flags Details | Diff
Close BPF socket on ENXIO (1.23 KB, patch)
2020-10-25 15:06 UTC, roy
roy: maintainer-approval+
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Greg V 2020-10-23 15:55:29 UTC
mlx4 NIC resets when coming back from S3 sleep.
dhcpcd does not handle this well.
Apparently its reaction to getting a 'Device not configured' error from ps_bpf_recvbpf is to just log and continue like nothing happened?
So the end result is, both dhcpcd and syslogd use a whole CPU core each, purely on spamming 'Device not configured' log messages :(

Oct 23 13:54:31 ruunvald acpi[6066]: resumed at 20201023 13:54:31
[..snip: USB devices..]
Oct 23 13:54:52 ruunvald dhcpcd[5451]: mlxen0: no IPv6 Routers available
Oct 23 13:55:21 ruunvald kernel: [47193] mlx4_core0: command 0x49 timed out (go bit not cleared)
Oct 23 13:55:21 ruunvald kernel: [47193] mlx4_core0: device is going to be reset
Oct 23 13:55:22 ruunvald kernel: [47194] mlx4_core0: device was reset successfully
Oct 23 13:55:22 ruunvald kernel: [47194] mlx4_en mlx4_core0: Internal error detected, restarting device
Oct 23 13:55:25 ruunvald kernel: [47196] mlx4_core0: Internal error mark was detected on device
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 1340 times
Oct 23 13:55:25 ruunvald avahi-daemon[47574]: IPV6_DROP_MEMBERSHIP failed: Can't assign requested address
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 2172 times
Oct 23 13:55:25 ruunvald dhcpcd[5451]: mlxen0: if_getmtu: Device not configured
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 396 times
Oct 23 13:55:25 ruunvald dhcpcd[5451]: ipv6nd_applyra: Device not configured
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:28 ruunvald syslogd: last message repeated 486197 times
Oct 23 13:55:28 ruunvald kernel: [47200] mlx4_core0: Old device ETS support detected
Oct 23 13:55:28 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:28 ruunvald syslogd: last message repeated 63 times
Oct 23 13:55:28 ruunvald kernel: [47200] mlx4_core0: Consider upgrading device FW.
Oct 23 13:55:28 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 208939 times
Oct 23 13:55:29 ruunvald kernel: [47201] mlx4_core0: Unable to determine PCI device chain minimum BW
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 11613 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlx4_en mlx4_core0: Activating port:1
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 516 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlxen0: Ethernet address: 00:02:c9:[REDACTED]
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlx4_core0: Port 1: Using 16 TX rings
Oct 23 13:55:29 ruunvald kernel: [47202] mlxen0: link state changed to DOWN
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlx4_core0: Port 1: Using 16 RX rings
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Using 16 TX rings
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Using 16 RX rings
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 49 times
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Initializing port
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 415 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlx4_core0: mlx4_restart_one was ended, ret=0
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:32 ruunvald syslogd: last message repeated 422257 times
Oct 23 13:55:32 ruunvald kernel: [47204] <6>mlx4_en: mlxen0: Link Up
Oct 23 13:55:32 ruunvald kernel: [47204] mlxen0: link state changed to UP
Oct 23 13:55:32 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald syslogd: last message repeated 3708975 times
Oct 23 13:55:52 ruunvald kernel: nfs server fdc0:ffee::1:/solitude<118>[47225] ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald kernel: -bolt/greg: not responding
Oct 23 13:55:52 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald syslogd: last message repeated 69 times
[^^nice^^]
Oct 23 13:55:52 ruunvald kernel: nfs server fdc0:ffee::1:/solitude-bolt/greg: is a<118>[47225] ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald kernel: live again
[^^nfs is up means network is okay now, but dhcpcd does not recover^^]
Oct 23 13:55:52 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:56:22 ruunvald syslogd: last message repeated 5428175 times
Oct 23 13:58:23 ruunvald syslogd: last message repeated 21508608 times
Oct 23 14:08:24 ruunvald syslogd: last message repeated 107135631 times
Oct 23 14:18:25 ruunvald syslogd: last message repeated 108124940 times
Oct 23 14:23:38 ruunvald syslogd: last message repeated 56390531 times
Comment 1 roy 2020-10-25 10:18:44 UTC
Created attachment 219054 [details]
Remove interface on ENXIO

Pretty hard for me to test this sadly.

It looks like your interface does a hardware reset and this confuses dhcpcd.
Hopefully the attached patch helps.

When this happens, the interface is clearly not available by name because if_getmtu fails. So it won't be reported by ifconfig(8).
This at least will stop the constant log spam.
However if dhcpcd is started for a single interface it will exit. You will need to run dhcpcd on all interfaces (ie dhcpcd_enable=YES in rc.conf rather than ifconfig_foo=DHCP).
If dhcpcd does not recover then it's either that the kernel does not send RTM_IFANNOUNCE messages for this which is very bad or they got lost as route(4) overflowed.

I can't really help for the former, but for the latter there is a kernel patch here so that dhcpcd can spot route(4) overflows and recover: https://reviews.freebsd.org/D26652
Comment 2 roy 2020-10-25 15:00:01 UTC
Created attachment 219060 [details]
Close BPF socket on ENXIO

Actually this patch is all that's needed for dhcpcd.
Comment 3 roy 2020-10-25 15:06:06 UTC
Created attachment 219061 [details]
Close BPF socket on ENXIO

Test if BPF socket has been closed earlier if resending.
Comment 4 Greg V 2020-11-06 12:16:29 UTC
Updated to 9.3.2, which includes the fix. Looks like it's good now.

Nov  6 15:00:12 ruunvald kernel: [2049] mlx4_core0: device was reset successfully
Nov  6 15:00:15 ruunvald dhcpcd[57525]: mlxen0: ps_bpf_recvbpf: Device not configured
Nov  6 15:00:15 ruunvald avahi-daemon[82553]: IPV6_DROP_MEMBERSHIP failed: Can't assign requested address
Nov  6 15:00:15 ruunvald dhcpcd[55689]: mlxen0: if_getmtu: Device not configured
Nov  6 15:00:15 ruunvald dhcpcd[55689]: ipv6nd_applyra: Device not configured
[..mlx stuff..]
Nov  6 15:00:21 ruunvald kernel: [2059] mlxen0: link state changed to UP

and no leftover BPF ARP process spinning on cpu. Thanks!