Bug 237010 - mlx5en with LACP can stop work with vlanhwfilter enabled
Summary: mlx5en with LACP can stop work with vlanhwfilter enabled
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-04 10:38 UTC by Andrey V. Elsukov
Modified: 2019-05-25 05:24 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey V. Elsukov freebsd_committer freebsd_triage 2019-04-04 10:38:06 UTC
We have faced this problem several times, but it is not easy to reproduce.

We use several mce(4) cards in one or two LACP lagg(4). The system is mostly router with firewall. Each lagg(4) has several vlans. vlanhwfilter is enabled by default. Each IRQ is bound to dedicated CPU core, but each mce* interface shares CPU cores with each other. The number of queues is limited to 26, the last two are dedicated to user processes.

The problem can occur when management software does some network reconfiguration, i.e. creates new vlan interfaces or destroys some old.
Looks like the system stops receiving any packets. If we turn off vlanhwfilter via ipmi console, it starts work again. At the moment when problem occurs, the network loading is about 500-800kpps. 
It seems this can not be reproduced with single card, i.e. without LACP lagg.

We already saw this problem in 11-CURRENT, 12-CURRENT, 12-STABLE. But since our configuration is very specific with patches to driver, this report is mostly JFYI. For now we just disable vlanhwfilter by default.
Comment 1 Hans Petter Selasky freebsd_committer freebsd_triage 2019-04-09 08:19:22 UTC
Are there any syndrome messages printed in dmesg?
Comment 2 Andrey V. Elsukov freebsd_committer freebsd_triage 2019-04-09 09:27:17 UTC
There are no strange messages in the logs. Due to the packets loss BGP daemon closes connections and unsuccessfully tried to reconnect.
Comment 3 Hans Petter Selasky freebsd_committer freebsd_triage 2019-04-09 09:48:24 UTC
Are you using so-called prio-tagged traffic on any non-vlan interfaces? Refer to pcp option as printed by ifconfig.

In the case the RX queue gets full, there is a receive path watchdog which should ensure that the RX queue doesn't stall. You have not removed this watchdog in your patches?

What firmware revision are you using?

sysctl dev.mlx5_core.<N>.hw.fw_version
Comment 4 Andrey V. Elsukov freebsd_committer freebsd_triage 2019-04-09 09:56:52 UTC
(In reply to Hans Petter Selasky from comment #3)
> Are you using so-called prio-tagged traffic on any non-vlan interfaces?
> Refer to pcp option as printed by ifconfig.
> 
> In the case the RX queue gets full, there is a receive path watchdog which
> should ensure that the RX queue doesn't stall. You have not removed this
> watchdog in your patches?

No, we use patches mostly to make automatic binding to CPU cores depending from specified in loader.conf configuration. As I said the problem disappeared when vlanhwfilter was disabled on interfaces in run-time.
 
> What firmware revision are you using?
> 
> sysctl dev.mlx5_core.<N>.hw.fw_version

I think version may differs. The last machine where problem happened has this
dev.mlx5_core.0.hw.fw_version: 14.17.2032