Bug 240825 - Possible race between vlan interfaces and lagg(4) w/ em0/em1 post-EPOCH
Summary: Possible race between vlan interfaces and lagg(4) w/ em0/em1 post-EPOCH
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa, regression
Depends on:
Blocks:
 
Reported: 2019-09-25 22:12 UTC by Yaroslav Shvets
Modified: 2024-10-08 03:45 UTC (History)
10 users (show)

See Also:


Attachments
network configuration in rc.conf (876 bytes, text/plain)
2019-09-25 22:12 UTC, Yaroslav Shvets
no flags Details
console.log (11.44 KB, text/plain)
2019-09-25 22:15 UTC, Yaroslav Shvets
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yaroslav Shvets 2019-09-25 22:12:51 UTC
Created attachment 207816 [details]
network configuration in rc.conf

After upgrading from 11.3-RELEASE-p3 to 12.0-RELEASE-p10 vlan-interfaces on lagg0 (em0, em1) stopped working.
Vlan interfaces look like working but not working.
Network configuration is: em0 and em1 aggregated to lagg0. Based on lagg0 created vlan-interfaces.
On the other end is a cisco switch (port channel from two ethernet ports, the mode is LACP).

The network configuration worked on 9.x-releng, 10.x-releng, 11.2-releng, 11.3-releng.
And does not work on 12.0-releng, 12-stable, 12.1-PRERELEASE.
The lagg0 interface works. But does not work vlans on lagg0.
Looks like that untagged ethernet frames passed from/to lagg0, but tagged ethernet frames not passed.

After downgrading to 11.3-releng vlans on lagg0 works again.

Kernel of the system based on GENERIC, i.e "device vlan" is in the kernel configuration.

See rc.conf, console.log in attachment.
Comment 1 Yaroslav Shvets 2019-09-25 22:15:31 UTC
Created attachment 207817 [details]
console.log
Comment 2 Eugene Grosbein freebsd_committer freebsd_triage 2019-09-26 04:20:49 UTC
From console.log at boot time:

lagg0.11: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1496
 	options=403<RXCSUM,TXCSUM,LRO>
 	ether 00:e0:81:ba:ad:90
 	inet xx.xx.170.82 netmask 0xfffffff0 broadcast xx.xx.170.95
 	groups: vlan
 	vlan: 11 vlanpcp: 0 parent interface: lagg0
 	media: Ethernet autoselect
 	status: active
 	nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

mtu=1496 for lagg0.11 (while mtu=1500 for lagg0) means that vlan was created while lagg0 had no registered members with hardware vlan support. This is very strange.

It looks like some race condition at boot time between internal lagg configuration and another ifconfig process creating vlan over lagg.

Adding CC: for some people that worked with lagg(4) code recently.
Comment 3 Yaroslav Shvets 2019-09-26 11:39:17 UTC
However if destroy the broken vlan-interface and lagg-interface after FreeBSD booting,
and create them again, the vlan-interface works.
Comment 4 Hans Petter Selasky freebsd_committer freebsd_triage 2019-09-26 11:45:31 UTC
Does this also happen if you disable devd. When network interfaces are created devd will receive an event and do some configuration in the background. This will typically race with netstart .
Comment 5 Eugene Grosbein freebsd_committer freebsd_triage 2019-11-12 05:05:02 UTC
Feedback timeout over 6 weeks. Feel free to re-open the PR if you have additional information.
Comment 6 Yaroslav Shvets 2019-11-12 13:49:20 UTC
As you can see in the original console.log devd started:

> Sep 25 18:39:00 <console.info> gw1 kernel: Starting devd.
Comment 7 Yaroslav Shvets 2019-11-12 19:08:16 UTC
Vlans over lagg0 does not work with devd and without devd (devd_enable="NO" in /etc/rc.conf)
Comment 8 Yaroslav Shvets 2019-11-12 19:19:15 UTC
with devd:
lagg0.11: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1496
        options=403<RXCSUM,TXCSUM,LRO>
        ether 00:e0:81:ba:ad:90
        inet xx.xx.170.82 netmask 0xfffffff0 broadcast xx.xx.170.95
        groups: vlan
        vlan: 11 vlanpcp: 0 parent interface: lagg0
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

without devd (devd_enable="NO"):
lagg0.11: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1496
        options=403<RXCSUM,TXCSUM,LRO>
        ether 00:e0:81:ba:ad:90
        inet xx.xx.170.82 netmask 0xfffffff0 broadcast xx.xx.170.95
        groups: vlan
        vlan: 11 vlanpcp: 0 parent interface: lagg0
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Comment 9 Yaroslav Shvets 2019-11-17 19:37:01 UTC
I just updated the system to 12.1-RELEASE.
Problem still exists.
After reboot, the interface lagg0.11 does not work.
With manual creation (ifconfig lagg0 create, etc...)
the interface is working.
Comment 10 commit-hook freebsd_committer freebsd_triage 2020-01-09 11:58:41 UTC
A commit references this bug:

Author: eugen
Date: Thu Jan  9 11:58:26 UTC 2020
New revision: 356551
URL: https://svnweb.freebsd.org/changeset/base/356551

Log:
  arp(8): avoid segfaulting due to out-of-bounds memory access

  Fix obvious mistake that sometimes results in reading memory
  past end of an array.

  PR:		240825
  MFC after:	1 week

Changes:
  head/usr.sbin/arp/arp.c
Comment 11 Mark Linimon freebsd_committer freebsd_triage 2024-10-04 11:16:53 UTC
^Triage: assign to committer who resolved back in 2020.
Comment 12 Eugene Grosbein freebsd_committer freebsd_triage 2024-10-04 20:44:21 UTC
An attribution is wrong due to my mistake in the commit log for Subversion revision 356551 back in 2020 that had no connection to this PR. Undo last change to it.
Comment 13 Mark Linimon freebsd_committer freebsd_triage 2024-10-08 03:45:53 UTC
^Triage: turn off spurious mfc-stable12 flag.

(I thought this had been previously obviated by a commit, but I was wrong.)