Bug 156226 - [lagg]: failover does not announce the failover to switch
Summary: [lagg]: failover does not announce the failover to switch
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Some People
Assignee: Steven Hartland
URL: https://reviews.freebsd.org/D4111
Keywords: feature, needs-qa, patch
Depends on:
Blocks:
 
Reported: 2011-04-06 16:30 UTC by pvz
Modified: 2019-03-14 00:14 UTC (History)
8 users (show)

See Also:
koobs: mfc-stable10?
koobs: mfc-stable9?


Attachments
Send gratuitous ARP on primary port status change (4.97 KB, patch)
2014-12-20 04:17 UTC, Kubilay Kocak
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description pvz 2011-04-06 16:30:09 UTC
I have an environment where I want failover between two links to the same subnet on two different switches (some kind of HP Procurve Switches, I don't have the exact model number of them handy).

The switch itself has not been configured to treat the ports specially.

When the link goes down, lagg switches its ACTIVE interface properly, however there is no gratuitous ARP sent out on the link. This means that the CAM on the switch is never updated.

More information is available at a forum thread on the FreeBSD forums: https://forums.freebsd.org/showthread.php?p=102093 - This problem is known since at least FreeBSD 8.0. The problem does not seem to be specific to the em driver, it seems to happen with bce also.

Fix: 

Make the lagg driver send out a gratuitous ARP in case of a failover. This will update the CAM tables and prevent this problem from occurring.
How-To-Repeat: Get two switches, hook them together.

On a FreeBSD 8.2 host with two NIC:s, configure them to use lagg-failover. Plug the two NICs into the two different switches.

Yank out the cable to simulate a network failure of the currently-active network card, while running a ping from another host to that host.

Observe that the affected host stops responding to ping.

Send some random traffic from the affected host, and observe that it now starts responding to ping again. Alternatively, wait until the CAM tables in the switch time out.
Comment 1 Remko Lodder freebsd_committer 2011-04-06 21:38:58 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-net

Reassign to -net 

http://www.freebsd.org/cgi/query-pr.cgi?pr=156226 

Date: Mon, 12 Mar 2012 11:51:42 +1100
Comment 2 Ryan Steinmetz freebsd_committer freebsd_triage 2012-10-23 02:12:40 UTC
This isn't a solution, but is a workaround that I've been using for a
bit:
http://people.freebsd.org/~zi/ping

You can drop the file in /usr/local/etc/rc.d and then add
ping_enable="YES" to /etc/rc.conf

Basically, it sends an icmp echo request to your default gateway every 5
seconds by default, which forces the switch to update its FIB.  This
means that after a maximum of 5 seconds after lagg completes an
interface failover, you should regain network connectivity.

-r
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2014-12-20 04:17:09 UTC
Created attachment 150794 [details]
Send gratuitous ARP on primary port status change

Add patch posted to freebsd-net [1] by Tushar Mulkar from Sandvine in February 2012.

[1] https://lists.freebsd.org/pipermail/freebsd-net/2012-February/031328.html
Comment 4 weberge42 2015-10-02 17:12:36 UTC
Any news on this ?
I have the same problem - lagg is just not useable without notifying the switch.
Comment 5 Eugene Grosbein 2015-10-03 06:28:27 UTC
(In reply to weberge42 from comment #4)

AFAIK, link aggregation (lagg) is not supposed to be used this way, without special switch configuration. And lagg IS usable if you do things right: configure your switch(es) to make them know what ports are aggregated, so they can manage their information databases to do right thing.

This PR should be closed as pilot error, I guess.
Comment 6 weberge42 2015-10-03 21:22:10 UTC
(In reply to eugen from comment #5)

i know that link aggregation needs to configured on the switch(es) too. but i'm not talking about link aggregation - i'm (and the opener) are talking about simple failover supported by lagg(4):

 failover	  Sends	traffic	only through the active	port.  If the master
		  port becomes unavailable, the	next active port is used.  The
		  first	interface added	is the master port; any	interfaces
		  added	after that are used as failover	devices.


without sending traffic from the host, traffic flow is only working again when the switch(es) update their tables.
Comment 7 Eugene Grosbein 2015-10-03 21:40:13 UTC
(In reply to weberge42 from comment #6)

In case of real failure, physical link goes down and switch updates its table at once. There are patches for em(4)/igb(4) drivers that brings link down in case of manual "ifconfig em0 down" command:

http://www.grosbein.net/freebsd/patches/em_sysctl-9.3S.diff.gz
http://www.grosbein.net/freebsd/patches/igb_sysctl-9.3S.diff.gz
Comment 8 weberge42 2015-10-03 21:59:25 UTC
(In reply to eugen from comment #7)

Yes, but if you have the following setup:

Server NIC1 - Switch 1
Server NIC2 - Switch 2

If NIC1 is active and Switch 1 fails, NIC2 becomes active and it takes ages to see traffic flowing again.

I can't remember if the standby link is up but no traffic flowing or if it is really down (down as in disabled).

I will try to investigate this a little bit further when i have access to the switches again (tried with fortiswitch 548d, latest firmware)
Comment 9 Kubilay Kocak freebsd_committer freebsd_triage 2015-10-04 02:12:49 UTC
If lagg(4) doesn't currently send a gratutious ARP on failover/failback, then it probably should. I was under the impression that it did, but that may (still) not be the case.

Unless your switches have portfast enabled, failover/failback scenarios can take a long time to 'recover' and start passing traffic.

We need someone to review this and get it committed
Comment 10 weberge42 2015-10-04 06:43:30 UTC
Maybe the smaller patch from https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201916 is sufficient.
Comment 11 Eugene Grosbein 2015-10-05 07:28:55 UTC
(In reply to weberge42 from comment #8)

Each switch has its "MAC aging" time, so one may configure switch so that it does not "take ages to see traffic flowing again".

And yes, lagg's failover needs that NIC drivers bring link down in case of voluntary "ifconfig down"
Comment 12 weberge42 2015-10-05 07:52:19 UTC
Yes, but each switch has different default and miniumum values.
Fortiswitch: 10s, HP Procurve 60s, don't know about cisco.

So that would mean XXs no service in case of a failover which may be acceptable if its short enough. But what if you can't (or not allowed) to change this settings ? 

I don't quite understand what you mean with:

>And yes, lagg's failover needs that NIC drivers bring link down in case of >voluntary "ifconfig down"

Who issues the ifconfig down ? I just don't see any relationship between manual ifconfig in case of a failover.
Can you please elaborate this a bit more ? Maybe i'm missing something.
Comment 13 Eugene Grosbein 2015-10-05 08:36:00 UTC
(In reply to weberge42 from comment #12)

> each switch has different default and miniumum values

Yes, and one should not depend on default values but configure switches as required.

> But what if you can't (or not allowed) to change this settings ?

Too bad for you if you have manageable equipment but can't manage it.

> I just don't see any relationship between manual ifconfig in case of a failover. Can you please elaborate this a bit more ? Maybe i'm missing something.

In case of "non-manual" failure physical link goes down generally and corresponding switch changes its FIB at once.

If one connects redundant layer-2 links to different switches, one should use some kind of signalling protocol like RSTP.
Comment 14 pvz 2015-10-05 09:32:49 UTC
The patch could probably be simplified. I suggested when I created the case four years ago that the solution might be to send out a gratuitous ARP.

However, that would not seem to be a correct generic solution, in that you're adding coupling to IPv4 and IPv6, where the problem actually occurs further down the network stack.

It would be entirely appropriate to send out *any* type of broadcast Ethernet frame. For example, VMware ESXi uses RARP in this scenario: http://rickardnobel.se/vswitch-notify-switches-setting/

This would be a valid approach no matter if the switches in question are connected to an IPv4 network, an IPv6-only network, or even, God forbid, IPX or PPPoE or anything else that might run over Ethernet (not IP neccessarilly) ... The only goal is to update the MAC forwarding tables in the switches, what the actual payload in Ethernet is doesn't matter. You might even be able to send out a completely empty broadcast frame.

This is of course then also complicated by the hypothetical but plausible scenario where you might have VLANs configured on top of the lagg, where these "notify packets" would have to be sent for each VLAN, because switches typically have per-VLAN forwarding tables.
Comment 15 Eugene Grosbein 2015-10-05 09:45:17 UTC
(In reply to pvz from comment #14)

> the hypothetical but plausible scenario

That's competly real scenario. My FreeBSD-based PPPoE BRAS'es run hundreds of VLANs over lagg port-channels and each VLAN carries PPPoE frames only, no IP traffic at all. Of course, I use LACP for failover (and load balancing too).
And LACP does not send its signalling traffic over each vlan (and should not).

Why don't people want to use already invented layer-2 signalling protocols but try to invent a wheel every time?
Comment 16 pvz 2015-10-05 13:23:49 UTC
In this case it's not even a case of a protocol though, it's rather just some functionality to prod even the dumbest stack of switches that are smart enough to know about MAC learning to do the right thing.

I think though that this could well be made into a really simple user-space daemon with a few hundred lines of code and no configuration file, though it would be "cleaner" (for a regular end user) for it to just be built into lagg.

The code would be something like: every now and then, iterate over all laggs. If the master has changed, send out an ethernet frame on the interface, and on any VLAN interfaces on top of the lagg. Would that be something that could be included in the base system as a feature rather than just adding the functionality to lagg?
Comment 17 weberge42 2015-10-05 14:06:43 UTC
> Yes, and one should not depend on default values but configure switches as required.

Bad luck if the MINIMUM value is 60s.

> If one connects redundant layer-2 links to different switches, one should use some kind of signalling protocol like RSTP.

Would not be needed if lagg announces the change. Simpler than fiddling around with protocols. Its also not granted that the infrastructure one is using supports/allows other protocols.

> Too bad for you if you have manageable equipment but can't manage it.
One does not always have access / permissions to do so. Company rules or whatever and people in charge of the network not always cooperative. But this is another point and not part of THIS problem.

As for the rest. I'm glad it works for you.
We have no need for PPPoE. Just for IPv4. Announcing the changes is the simplest solution.

If failover using lagg with different switches is not supported or is considered exotic, the feature should be removed. The docs do not mention that this is the case.

> I think though that this could well be made into a really simple user-space daemon 

Would be possible i think but i would vote against it. 
IMHO this is clearly a driver task to accomplish.
Comment 18 Eugene Grosbein 2015-10-05 14:46:22 UTC
(In reply to weberge42 from comment #17)

> If failover using lagg with different switches is not supported or is considered exotic, the feature should be removed. The docs do not mention that this is the case.

lagg failover works just fine at present using links connected to the switch(es) with single FIB (just one switch or stack of switches) and should not be removed. I agree that documentation may need some warnings against not supported configurations but it cannot foretell all kinds of network setups built on wrong assumptions. OTOH, it can give a hint towards other known ways to build failure-resistant setups like RSTP etc. Some short hint, because man page is not textbook.
Comment 19 weberge42 2015-10-05 21:20:23 UTC
Whqt would be the correct setup using 1 Server, 2 Nics and 2 non stacked Switches to achive a basic failover scenario for IPv4 ? RSTP (as STP usually is) sounds too complicated for this.

The description used in the manpage is basically the same as for linux bonding with active-passive mode. This setup works fine across 2 single, non stacked switches. I guess the driver sends a gratatious ARP (or RARP like vmware) to the switches.
Comment 20 Eugene Grosbein 2015-10-06 06:40:17 UTC
RSTP is very simple thing. And it is general solution suitable for IPv4, IPv6, vlan trunk, PPPoE, IPX or anything else because it deals with links at layer 2.
Comment 21 pvz 2015-10-06 10:26:32 UTC
I would not call RSTP "simple" compared to sending a single broadcast frame (of any protocol) out on the new active NIC on a failover, especially not once you consider that to use it you have to bridge together the two NICs and make sure you to configure your costs appropriately so that you don't accidentally make your server an unintentional bridge in your network, cutting off the link you actually do want active.

There's a reason other vendors have implemented this simple and elegant solution for this problem. It is because it works, and will work in any correctly functioning Ethernet network (with the exception of possible security features).

Implementing this feature to the lagg failover mode does not preclude sysadmins who dislike this mode to implement other solutions such as LACP (with the other end being a MLAG) or bridging with RSTP. It does make FreeBSD usable in scenarios where sysadmins might not have much say about the network operations of the environment at large.
Comment 22 Eugene Grosbein 2015-10-06 12:36:00 UTC
(In reply to pvz from comment #21)

"single broadcast frame" won't work for all cases, f.e. multiple vlans over lagg but RSTP will. And there is no rocket science in creating bridge as RSTP runs by default on bridges under FreeBSD.
Comment 23 Steven Hartland freebsd_committer 2015-11-17 14:43:28 UTC
Review of patch for HEAD can be found here:
https://reviews.freebsd.org/D4111
Comment 24 commit-hook freebsd_committer 2015-12-15 16:03:02 UTC
A commit references this bug:

Author: smh
Date: Tue Dec 15 16:02:12 UTC 2015
New revision: 292275
URL: https://svnweb.freebsd.org/changeset/base/292275

Log:
  Fix lagg failover due to missing notifications

  When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
  Neighbour Advertisements (IPv6) are sent to notify other nodes that the
  address may have moved.

  This results is slow failover, dropped packets and network outages for the
  lagg interface when the primary link goes down.

  We now use the new if_link_state_change_cond with the force param set to
  allow lagg to force through link state changes and hence fire a
  ifnet_link_event which are now monitored by rip and nd6.

  Upon receiving these events each protocol trigger the relevant
  notifications:
  * inet4 => Gratuitous ARP
  * inet6 => Unsolicited Neighbour Announce

  This also fixes the carp IPv6 NA's that stopped working after r251584 which
  added the ipv6_route__llma route.

  The new behavour can be controlled using the sysctls:
  * net.link.ether.inet.arp_on_link
  * net.inet6.icmp6.nd6_on_link

  Also removed unused param from lagg_port_state and added descriptions for the
  sysctls while here.

  PR:		156226
  MFC after:	1 month
  Sponsored by:	Multiplay
  Differential Revision:	https://reviews.freebsd.org/D4111

Changes:
  head/sys/net/if.c
  head/sys/net/if_lagg.c
  head/sys/net/if_lagg.h
  head/sys/net/if_var.h
  head/sys/netinet/if_ether.c
  head/sys/netinet/if_ether.h
  head/sys/netinet/in_var.h
  head/sys/netinet/ip_carp.c
  head/sys/netinet6/in6.c
  head/sys/netinet6/in6_var.h
  head/sys/netinet6/nd6.c
  head/sys/netinet6/nd6.h
  head/sys/netinet6/nd6_nbr.c
Comment 25 Steven Hartland freebsd_committer 2015-12-17 22:44:57 UTC
Fix reverted, due to more work needed.
Comment 26 Bryan Drewery freebsd_committer 2019-03-14 00:09:27 UTC
Steven is there a status update here?
Comment 27 Bryan Drewery freebsd_committer 2019-03-14 00:14:01 UTC
Discussion that lead to the revert: https://lists.freebsd.org/pipermail/svn-src-all/2015-December/115255.html