Bug 250561

Summary: net/dhcpcd: (9.2.0) endless CPU spinning and logspam 'ps_bpf_recvbpf: Device not configured' after Mellanox NIC reinitializes after suspend
Product: Ports & Packages Reporter: Val Packett <val>
Component: Individual Port(s)Assignee: Ben Woods <woodsb02>
Status: Closed FIXED    
Severity: Affects Only Me CC: roy, val
Priority: --- Flags: bugzilla: maintainer-feedback? (woodsb02)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
Remove interface on ENXIO
roy: maintainer-approval+
Close BPF socket on ENXIO
none
Close BPF socket on ENXIO roy: maintainer-approval+

Description Val Packett 2020-10-23 15:55:29 UTC
mlx4 NIC resets when coming back from S3 sleep.
dhcpcd does not handle this well.
Apparently its reaction to getting a 'Device not configured' error from ps_bpf_recvbpf is to just log and continue like nothing happened?
So the end result is, both dhcpcd and syslogd use a whole CPU core each, purely on spamming 'Device not configured' log messages :(

Oct 23 13:54:31 ruunvald acpi[6066]: resumed at 20201023 13:54:31
[..snip: USB devices..]
Oct 23 13:54:52 ruunvald dhcpcd[5451]: mlxen0: no IPv6 Routers available
Oct 23 13:55:21 ruunvald kernel: [47193] mlx4_core0: command 0x49 timed out (go bit not cleared)
Oct 23 13:55:21 ruunvald kernel: [47193] mlx4_core0: device is going to be reset
Oct 23 13:55:22 ruunvald kernel: [47194] mlx4_core0: device was reset successfully
Oct 23 13:55:22 ruunvald kernel: [47194] mlx4_en mlx4_core0: Internal error detected, restarting device
Oct 23 13:55:25 ruunvald kernel: [47196] mlx4_core0: Internal error mark was detected on device
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 1340 times
Oct 23 13:55:25 ruunvald avahi-daemon[47574]: IPV6_DROP_MEMBERSHIP failed: Can't assign requested address
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 2172 times
Oct 23 13:55:25 ruunvald dhcpcd[5451]: mlxen0: if_getmtu: Device not configured
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:25 ruunvald syslogd: last message repeated 396 times
Oct 23 13:55:25 ruunvald dhcpcd[5451]: ipv6nd_applyra: Device not configured
Oct 23 13:55:25 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:28 ruunvald syslogd: last message repeated 486197 times
Oct 23 13:55:28 ruunvald kernel: [47200] mlx4_core0: Old device ETS support detected
Oct 23 13:55:28 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:28 ruunvald syslogd: last message repeated 63 times
Oct 23 13:55:28 ruunvald kernel: [47200] mlx4_core0: Consider upgrading device FW.
Oct 23 13:55:28 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 208939 times
Oct 23 13:55:29 ruunvald kernel: [47201] mlx4_core0: Unable to determine PCI device chain minimum BW
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 11613 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlx4_en mlx4_core0: Activating port:1
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 516 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlxen0: Ethernet address: 00:02:c9:[REDACTED]
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlx4_core0: Port 1: Using 16 TX rings
Oct 23 13:55:29 ruunvald kernel: [47202] mlxen0: link state changed to DOWN
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlx4_core0: Port 1: Using 16 RX rings
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Using 16 TX rings
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Using 16 RX rings
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 49 times
Oct 23 13:55:29 ruunvald kernel: [47202] <4>mlx4_en: mlxen0: Initializing port
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:29 ruunvald syslogd: last message repeated 415 times
Oct 23 13:55:29 ruunvald kernel: [47202] mlx4_core0: mlx4_restart_one was ended, ret=0
Oct 23 13:55:29 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:32 ruunvald syslogd: last message repeated 422257 times
Oct 23 13:55:32 ruunvald kernel: [47204] <6>mlx4_en: mlxen0: Link Up
Oct 23 13:55:32 ruunvald kernel: [47204] mlxen0: link state changed to UP
Oct 23 13:55:32 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald syslogd: last message repeated 3708975 times
Oct 23 13:55:52 ruunvald kernel: nfs server fdc0:ffee::1:/solitude<118>[47225] ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald kernel: -bolt/greg: not responding
Oct 23 13:55:52 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald syslogd: last message repeated 69 times
[^^nice^^]
Oct 23 13:55:52 ruunvald kernel: nfs server fdc0:ffee::1:/solitude-bolt/greg: is a<118>[47225] ps_bpf_recvbpf: Device not configured
Oct 23 13:55:52 ruunvald kernel: live again
[^^nfs is up means network is okay now, but dhcpcd does not recover^^]
Oct 23 13:55:52 ruunvald dhcpcd[77694]: ps_bpf_recvbpf: Device not configured
Oct 23 13:56:22 ruunvald syslogd: last message repeated 5428175 times
Oct 23 13:58:23 ruunvald syslogd: last message repeated 21508608 times
Oct 23 14:08:24 ruunvald syslogd: last message repeated 107135631 times
Oct 23 14:18:25 ruunvald syslogd: last message repeated 108124940 times
Oct 23 14:23:38 ruunvald syslogd: last message repeated 56390531 times
Comment 1 roy 2020-10-25 10:18:44 UTC
Created attachment 219054 [details]
Remove interface on ENXIO

Pretty hard for me to test this sadly.

It looks like your interface does a hardware reset and this confuses dhcpcd.
Hopefully the attached patch helps.

When this happens, the interface is clearly not available by name because if_getmtu fails. So it won't be reported by ifconfig(8).
This at least will stop the constant log spam.
However if dhcpcd is started for a single interface it will exit. You will need to run dhcpcd on all interfaces (ie dhcpcd_enable=YES in rc.conf rather than ifconfig_foo=DHCP).
If dhcpcd does not recover then it's either that the kernel does not send RTM_IFANNOUNCE messages for this which is very bad or they got lost as route(4) overflowed.

I can't really help for the former, but for the latter there is a kernel patch here so that dhcpcd can spot route(4) overflows and recover: https://reviews.freebsd.org/D26652
Comment 2 roy 2020-10-25 15:00:01 UTC
Created attachment 219060 [details]
Close BPF socket on ENXIO

Actually this patch is all that's needed for dhcpcd.
Comment 3 roy 2020-10-25 15:06:06 UTC
Created attachment 219061 [details]
Close BPF socket on ENXIO

Test if BPF socket has been closed earlier if resending.
Comment 4 Val Packett 2020-11-06 12:16:29 UTC
Updated to 9.3.2, which includes the fix. Looks like it's good now.

Nov  6 15:00:12 ruunvald kernel: [2049] mlx4_core0: device was reset successfully
Nov  6 15:00:15 ruunvald dhcpcd[57525]: mlxen0: ps_bpf_recvbpf: Device not configured
Nov  6 15:00:15 ruunvald avahi-daemon[82553]: IPV6_DROP_MEMBERSHIP failed: Can't assign requested address
Nov  6 15:00:15 ruunvald dhcpcd[55689]: mlxen0: if_getmtu: Device not configured
Nov  6 15:00:15 ruunvald dhcpcd[55689]: ipv6nd_applyra: Device not configured
[..mlx stuff..]
Nov  6 15:00:21 ruunvald kernel: [2059] mlxen0: link state changed to UP

and no leftover BPF ARP process spinning on cpu. Thanks!