We have faced this problem several times, but it is not easy to reproduce.
We use several mce(4) cards in one or two LACP lagg(4). The system is mostly router with firewall. Each lagg(4) has several vlans. vlanhwfilter is enabled by default. Each IRQ is bound to dedicated CPU core, but each mce* interface shares CPU cores with each other. The number of queues is limited to 26, the last two are dedicated to user processes.
The problem can occur when management software does some network reconfiguration, i.e. creates new vlan interfaces or destroys some old.
Looks like the system stops receiving any packets. If we turn off vlanhwfilter via ipmi console, it starts work again. At the moment when problem occurs, the network loading is about 500-800kpps.
It seems this can not be reproduced with single card, i.e. without LACP lagg.
We already saw this problem in 11-CURRENT, 12-CURRENT, 12-STABLE. But since our configuration is very specific with patches to driver, this report is mostly JFYI. For now we just disable vlanhwfilter by default.
Are there any syndrome messages printed in dmesg?
There are no strange messages in the logs. Due to the packets loss BGP daemon closes connections and unsuccessfully tried to reconnect.
Are you using so-called prio-tagged traffic on any non-vlan interfaces? Refer to pcp option as printed by ifconfig.
In the case the RX queue gets full, there is a receive path watchdog which should ensure that the RX queue doesn't stall. You have not removed this watchdog in your patches?
What firmware revision are you using?
(In reply to Hans Petter Selasky from comment #3)
> Are you using so-called prio-tagged traffic on any non-vlan interfaces?
> Refer to pcp option as printed by ifconfig.
> In the case the RX queue gets full, there is a receive path watchdog which
> should ensure that the RX queue doesn't stall. You have not removed this
> watchdog in your patches?
No, we use patches mostly to make automatic binding to CPU cores depending from specified in loader.conf configuration. As I said the problem disappeared when vlanhwfilter was disabled on interfaces in run-time.
> What firmware revision are you using?
> sysctl dev.mlx5_core.<N>.hw.fw_version
I think version may differs. The last machine where problem happened has this