Bug 191786

Summary: [bce] [patch] bce link state changes to same state are ignored by lagg
Product: Base System Reporter: Kajetan Staszkiewicz <vegeta>
Component: kernAssignee: freebsd-net (Nobody) <net>
Status: New ---    
Severity: Affects Only Me CC: vegeta
Priority: --- Keywords: patch
Version: 9.2-RELEASE   
Hardware: amd64   
OS: Any   
Attachments:
Description Flags
Check for proper phy messages, never check phy status in bce_tick() none

Description Kajetan Staszkiewicz 2014-07-10 17:03:01 UTC
I have a Dell PowerEdge m610 system with 6 bce NICs. BladeCenter is equipped with M6348 switches in fabric A and pass-through modules in fabrics B and C. bce[01] are connected to switch stack via backplane and form a lagg with LACP, bce[23] are connected via pass-through module to copper ports in the same switch stack and create another lagg with LACP. bce[45] are connected elsewhere without LAG.

As far as I understand an interface can change its state in lagg when it goes down or up, thanks to if_link_state_change and do_link_state_change in sys/net/if.c. if_link_state_change is equipped with a check for "changing" into the same state.

I've added following printfs to see what exactly happens:

--- a/sys/dev/bce/if_bce.c
+++ b/sys/dev/bce/if_bce.c
@@ -6460,11 +6460,23 @@ bce_phy_intr(struct bce_softc *sc)
                        if (new_link_state) {
                                if (bootverbose)
                                        if_printf(sc->bce_ifp, "link UP\n");
+                               printf("%s: state change to UP; from 0x%x to 0x%x, calling if_link_state_change(0x%x, 0x%x)\n",
+                                               sc->bce_ifp->if_xname,
+                                               old_link_state,
+                                               new_link_state,
+                                               sc->bce_ifp->if_link_state,
+                                               LINK_STATE_UP);
                                if_link_state_change(sc->bce_ifp,
                                    LINK_STATE_UP);
                        } else {
                                if (bootverbose)
                                        if_printf(sc->bce_ifp, "link DOWN\n");
+                               printf("%s: state change to DOWN; from 0x%x to 0x%x, calling if_link_state_change(0x%x, 0x%x)\n",
+                                               sc->bce_ifp->if_xname,
+                                               old_link_state,
+                                               new_link_state,
+                                               sc->bce_ifp->if_link_state,
+                                               LINK_STATE_DOWN);
                                if_link_state_change(sc->bce_ifp,
                                    LINK_STATE_DOWN);
                        }
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -1900,9 +1900,12 @@ void     *(*vlan_cookie_p)(struct ifnet *);
 void
 if_link_state_change(struct ifnet *ifp, int link_state)
 {
+       log(LOG_NOTICE, "%s: link state changed to %s, scheduling do_link_state_change\n", ifp->if_xname, (link_state == LINK_STATE_UP) ? "UP" : "DOWN" );
        /* Return if state hasn't changed. */
-       if (ifp->if_link_state == link_state)
+       if (ifp->if_link_state == link_state) {
+               log(LOG_NOTICE, "%s: not really changed, skipping\n", ifp->if_xname);
                return;
+       }
 
        ifp->if_link_state = link_state;
 
@@ -1921,6 +1924,8 @@ do_link_state_change(void *arg, int pending)
        if (ifp->if_vlantrunk != NULL)
                (*vlan_link_state_p)(ifp);
 
+       log(LOG_NOTICE, "%s: link state changed to %s, calling hooks\n", ifp->if_xname, (link_state == LINK_STATE_UP) ? "UP" : "DOWN" );
+
        if ((ifp->if_type == IFT_ETHER || ifp->if_type == IFT_L2VLAN) &&
            IFP2AC(ifp)->ac_netgraph != NULL)
                (*ng_ether_link_state_p)(ifp, link_state);
@@ -1928,8 +1933,10 @@ do_link_state_change(void *arg, int pending)
                (*carp_linkstate_p)(ifp);
        if (ifp->if_bridge)
                (*bridge_linkstate_p)(ifp);
-       if (ifp->if_lagg)
+       if (ifp->if_lagg) {
+               log(LOG_NOTICE, "%s: hook link_state\n", ifp->if_xname);
                (*lagg_linkstate_p)(ifp, link_state);
+       }
 
        if (IS_DEFAULT_VNET(curvnet))
                devctl_notify("IFNET", ifp->if_xname,


As howtos and manuals suggest, my rc.conf contains ifconfig_bce[01234]="up", this causes the following thing to be logged:

Jul 10 17:18:55 aw19lb2 kernel: bce0: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x0, 0x2)
Jul 10 17:18:55 aw19lb2 kernel: bce1: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x0, 0x2)
Jul 10 17:18:55 aw19lb2 kernel: bce2: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x0, 0x2)
Jul 10 17:18:55 aw19lb2 kernel: bce3: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x0, 0x2)


Later on laggs get their ports added:

ifconfig_lagg1_name="internal"
ifconfig_internal="laggproto lacp laggport bce2 laggport bce3"

This causes following lacp debug messages:

Jul 10 17:18:55 aw19lb2 kernel: bce2: media changed 0x0 -> 0x22, ether = 1, fdx = 0, link = 1
Jul 10 17:18:55 aw19lb2 kernel: bce2: partner timeout changed
Jul 10 17:18:55 aw19lb2 kernel: bce2: LACP_STATE_AGGREGATION==0, key is 8003
Jul 10 17:18:55 aw19lb2 kernel: bce2: -> UNSELECTED

Jul 10 17:18:55 aw19lb2 kernel: bce3: media changed 0x0 -> 0x22, ether = 1, fdx = 0, link = 1
Jul 10 17:18:55 aw19lb2 kernel: bce3: partner timeout changed
Jul 10 17:18:55 aw19lb2 kernel: bce3: LACP_STATE_AGGREGATION==0, key is 8004
Jul 10 17:18:55 aw19lb2 kernel: bce3: -> UNSELECTED

Ports are added, but they are missing Full Duplex for some reason, so they don't participate in lagg yet. This gets fixed later on, but not for bce3:

Jul 10 17:18:57 aw19lb2 kernel: bce3: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x2, 0x2)
Jul 10 17:18:57 aw19lb2 kernel: bce3: link state changed to UP, scheduling do_link_state_change
Jul 10 17:18:57 aw19lb2 kernel: bce3: not really changed, skipping

bce3 performs change from down to up, but there was never change to down, so if_link_state_change ignores the change and lagg stuff is never called.

For bce2 this looks different, there is a real change to down, and then up:

Jul 10 17:18:59 aw19lb2 kernel: bce2: state change to DOWN; from 0x1 to 0x0, calling if_link_state_change(0x2, 0x1)
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to DOWN, scheduling do_link_state_change
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to DOWN, calling hooks
Jul 10 17:18:59 aw19lb2 kernel: bce2: hook link_state
Jul 10 17:18:59 aw19lb2 kernel: bce2: media changed 0x22 -> 0x100630, ether = 1, fdx = 1, link = 0
Jul 10 17:18:59 aw19lb2 kernel: bce2: LACP_STATE_AGGREGATION==0, key is 8003
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to DOWN
# end of do_link_state_change
Jul 10 17:18:59 aw19lb2 kernel: bce2: state change to UP; from 0x0 to 0x1, calling if_link_state_change(0x1, 0x2)
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to UP, scheduling do_link_state_change
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to UP, calling hooks
Jul 10 17:18:59 aw19lb2 kernel: bce2: hook link_state
Jul 10 17:18:59 aw19lb2 kernel: bce2: media changed 0x100630 -> 0x100630, ether = 1, fdx = 1, link = 1
Jul 10 17:18:59 aw19lb2 kernel: bce2: key is 20b
Jul 10 17:18:59 aw19lb2 kernel: bce2: -> UNSELECTED
Jul 10 17:18:59 aw19lb2 kernel: bce2: link state changed to UP
Jul 10 17:18:59 aw19lb2 kernel: bce2: lacpdu receive
Jul 10 17:18:59 aw19lb2 kernel: bce2: lacp_sm_rx_update_ntt: assert ntt
Jul 10 17:18:59 aw19lb2 kernel: bce2: old pstate 38<SYNC,COLLECTING,DISTRIBUTING>
Jul 10 17:18:59 aw19lb2 kernel: bce2: new pstate c5<ACTIVITY,AGGREGATION,DEFAULTED,EXPIRED>
Jul 10 17:18:59 aw19lb2 kernel: bce2: lacpdu transmit

In the meantime there are multiple messages "bce[23]: lacpdu receive", so both links are really up and are receiving packets from switch.

My system is patched with patch from http://lists.freebsd.org/pipermail/freebsd-net/2013-February/034649.html, otherwise there are even more problems, as bce[01] are show to be connected to different media, even though both of them are connected directly to switch via Blade Centre's backplane. Generally so strict checking for media type forbids creating a 4-port lagg for such network configuration as I have, where some ports would use direct backplane connection and some extra copper cables, but this is yet another topic...

[root@aw19lb2 ~]# ifconfig -m bce0
bce0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE>
        capabilities=800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE>
        ether c8:1f:66:9d:1b:88
        media: Ethernet autoselect (1000baseSX <full-duplex,rxpause,txpause>)
        status: active
        supported media:
                media autoselect
                media 1000baseSX mediaopt full-duplex
                media 1000baseSX
[root@aw19lb2 ~]# 
[root@aw19lb2 ~]# ifconfig -m bce1
bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE>
        capabilities=800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE>
        ether c8:1f:66:9d:1b:88
        media: Ethernet autoselect (1000baseT <full-duplex,rxpause,txpause>)
        status: active
        supported media:
                media autoselect
                media 1000baseSX mediaopt full-duplex
                media 1000baseSX
Comment 1 Kajetan Staszkiewicz 2014-07-14 12:01:40 UTC
Created attachment 144650 [details]
Check for proper phy messages, never check phy status in bce_tick()
Comment 2 Kajetan Staszkiewicz 2014-07-14 12:18:24 UTC
Comment on attachment 144650 [details]
Check for proper phy messages, never check phy status in bce_tick()

I've checked what Linux bce driver does and what the NIC really sends. This patch changes the following things:

- Remove handling of PHY state from bce_tick() as this function might catch some not fully "stable" state changes (like IFM_ETHER|IFM_NONE) and set interface up before link reaches full link state (IFM_ETHER|IFM_FDX|IFM_1000_T|rxpause|txpause) making the link unusable for lagg.

- sc->status_block->status_attn_bits(_ack) are not really link status. Those bits can change from down to up and then from down to up again without going from up to down. Call if_link_state_change() only on changes of sc->bce_link_up which is set only via bce_ifmedia_sts_rphy().

- Distinguish between STATUS_ATTN_BITS_LINK_STATE and STATUS_ATTN_BITS_TIMER_ABORT. Timer abort messages are in fact sent even if only link state is subscribed for. The old code would then put the link up with IFM_ETHER|IFM_NONE state making the link unusable in lagg.

- Sent bce_pulse when on STATUS_ATTN_BITS_TIMER_ABORT with event_code == BCE_FW_EVT_CODE_SW_TIMER_EXPIRE_EVENT and also on normal remote PHY events. Without this a message "Warning: bootcode thinks driver is absent." was seen on system boot.
Comment 3 Mark Linimon freebsd_committer freebsd_triage 2014-07-20 18:41:03 UTC
Over to maintainers.
Comment 4 Graham Perrin freebsd_committer freebsd_triage 2022-10-17 12:39:05 UTC
Keyword: 

    patch
or  patch-ready

– in lieu of summary line prefix: 

    [patch]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>