I have an environment where I want failover between two links to the same subnet on two different switches (some kind of HP Procurve Switches, I don't have the exact model number of them handy). The switch itself has not been configured to treat the ports specially. When the link goes down, lagg switches its ACTIVE interface properly, however there is no gratuitous ARP sent out on the link. This means that the CAM on the switch is never updated. More information is available at a forum thread on the FreeBSD forums: https://forums.freebsd.org/showthread.php?p=102093 - This problem is known since at least FreeBSD 8.0. The problem does not seem to be specific to the em driver, it seems to happen with bce also. Fix: Make the lagg driver send out a gratuitous ARP in case of a failover. This will update the CAM tables and prevent this problem from occurring. How-To-Repeat: Get two switches, hook them together. On a FreeBSD 8.2 host with two NIC:s, configure them to use lagg-failover. Plug the two NICs into the two different switches. Yank out the cable to simulate a network failure of the currently-active network card, while running a ping from another host to that host. Observe that the affected host stops responding to ping. Send some random traffic from the affected host, and observe that it now starts responding to ping again. Alternatively, wait until the CAM tables in the switch time out.
Responsible Changed From-To: freebsd-bugs->freebsd-net Reassign to -net http://www.freebsd.org/cgi/query-pr.cgi?pr=156226 Date: Mon, 12 Mar 2012 11:51:42 +1100
This isn't a solution, but is a workaround that I've been using for a bit: http://people.freebsd.org/~zi/ping You can drop the file in /usr/local/etc/rc.d and then add ping_enable="YES" to /etc/rc.conf Basically, it sends an icmp echo request to your default gateway every 5 seconds by default, which forces the switch to update its FIB. This means that after a maximum of 5 seconds after lagg completes an interface failover, you should regain network connectivity. -r
Created attachment 150794 [details] Send gratuitous ARP on primary port status change Add patch posted to freebsd-net [1] by Tushar Mulkar from Sandvine in February 2012. [1] https://lists.freebsd.org/pipermail/freebsd-net/2012-February/031328.html
Any news on this ? I have the same problem - lagg is just not useable without notifying the switch.
(In reply to weberge42 from comment #4) AFAIK, link aggregation (lagg) is not supposed to be used this way, without special switch configuration. And lagg IS usable if you do things right: configure your switch(es) to make them know what ports are aggregated, so they can manage their information databases to do right thing. This PR should be closed as pilot error, I guess.
(In reply to eugen from comment #5) i know that link aggregation needs to configured on the switch(es) too. but i'm not talking about link aggregation - i'm (and the opener) are talking about simple failover supported by lagg(4): failover Sends traffic only through the active port. If the master port becomes unavailable, the next active port is used. The first interface added is the master port; any interfaces added after that are used as failover devices. without sending traffic from the host, traffic flow is only working again when the switch(es) update their tables.
(In reply to weberge42 from comment #6) In case of real failure, physical link goes down and switch updates its table at once. There are patches for em(4)/igb(4) drivers that brings link down in case of manual "ifconfig em0 down" command: http://www.grosbein.net/freebsd/patches/em_sysctl-9.3S.diff.gz http://www.grosbein.net/freebsd/patches/igb_sysctl-9.3S.diff.gz
(In reply to eugen from comment #7) Yes, but if you have the following setup: Server NIC1 - Switch 1 Server NIC2 - Switch 2 If NIC1 is active and Switch 1 fails, NIC2 becomes active and it takes ages to see traffic flowing again. I can't remember if the standby link is up but no traffic flowing or if it is really down (down as in disabled). I will try to investigate this a little bit further when i have access to the switches again (tried with fortiswitch 548d, latest firmware)
If lagg(4) doesn't currently send a gratutious ARP on failover/failback, then it probably should. I was under the impression that it did, but that may (still) not be the case. Unless your switches have portfast enabled, failover/failback scenarios can take a long time to 'recover' and start passing traffic. We need someone to review this and get it committed
Maybe the smaller patch from https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201916 is sufficient.
(In reply to weberge42 from comment #8) Each switch has its "MAC aging" time, so one may configure switch so that it does not "take ages to see traffic flowing again". And yes, lagg's failover needs that NIC drivers bring link down in case of voluntary "ifconfig down"
Yes, but each switch has different default and miniumum values. Fortiswitch: 10s, HP Procurve 60s, don't know about cisco. So that would mean XXs no service in case of a failover which may be acceptable if its short enough. But what if you can't (or not allowed) to change this settings ? I don't quite understand what you mean with: >And yes, lagg's failover needs that NIC drivers bring link down in case of >voluntary "ifconfig down" Who issues the ifconfig down ? I just don't see any relationship between manual ifconfig in case of a failover. Can you please elaborate this a bit more ? Maybe i'm missing something.
(In reply to weberge42 from comment #12) > each switch has different default and miniumum values Yes, and one should not depend on default values but configure switches as required. > But what if you can't (or not allowed) to change this settings ? Too bad for you if you have manageable equipment but can't manage it. > I just don't see any relationship between manual ifconfig in case of a failover. Can you please elaborate this a bit more ? Maybe i'm missing something. In case of "non-manual" failure physical link goes down generally and corresponding switch changes its FIB at once. If one connects redundant layer-2 links to different switches, one should use some kind of signalling protocol like RSTP.
The patch could probably be simplified. I suggested when I created the case four years ago that the solution might be to send out a gratuitous ARP. However, that would not seem to be a correct generic solution, in that you're adding coupling to IPv4 and IPv6, where the problem actually occurs further down the network stack. It would be entirely appropriate to send out *any* type of broadcast Ethernet frame. For example, VMware ESXi uses RARP in this scenario: http://rickardnobel.se/vswitch-notify-switches-setting/ This would be a valid approach no matter if the switches in question are connected to an IPv4 network, an IPv6-only network, or even, God forbid, IPX or PPPoE or anything else that might run over Ethernet (not IP neccessarilly) ... The only goal is to update the MAC forwarding tables in the switches, what the actual payload in Ethernet is doesn't matter. You might even be able to send out a completely empty broadcast frame. This is of course then also complicated by the hypothetical but plausible scenario where you might have VLANs configured on top of the lagg, where these "notify packets" would have to be sent for each VLAN, because switches typically have per-VLAN forwarding tables.
(In reply to pvz from comment #14) > the hypothetical but plausible scenario That's competly real scenario. My FreeBSD-based PPPoE BRAS'es run hundreds of VLANs over lagg port-channels and each VLAN carries PPPoE frames only, no IP traffic at all. Of course, I use LACP for failover (and load balancing too). And LACP does not send its signalling traffic over each vlan (and should not). Why don't people want to use already invented layer-2 signalling protocols but try to invent a wheel every time?
In this case it's not even a case of a protocol though, it's rather just some functionality to prod even the dumbest stack of switches that are smart enough to know about MAC learning to do the right thing. I think though that this could well be made into a really simple user-space daemon with a few hundred lines of code and no configuration file, though it would be "cleaner" (for a regular end user) for it to just be built into lagg. The code would be something like: every now and then, iterate over all laggs. If the master has changed, send out an ethernet frame on the interface, and on any VLAN interfaces on top of the lagg. Would that be something that could be included in the base system as a feature rather than just adding the functionality to lagg?
> Yes, and one should not depend on default values but configure switches as required. Bad luck if the MINIMUM value is 60s. > If one connects redundant layer-2 links to different switches, one should use some kind of signalling protocol like RSTP. Would not be needed if lagg announces the change. Simpler than fiddling around with protocols. Its also not granted that the infrastructure one is using supports/allows other protocols. > Too bad for you if you have manageable equipment but can't manage it. One does not always have access / permissions to do so. Company rules or whatever and people in charge of the network not always cooperative. But this is another point and not part of THIS problem. As for the rest. I'm glad it works for you. We have no need for PPPoE. Just for IPv4. Announcing the changes is the simplest solution. If failover using lagg with different switches is not supported or is considered exotic, the feature should be removed. The docs do not mention that this is the case. > I think though that this could well be made into a really simple user-space daemon Would be possible i think but i would vote against it. IMHO this is clearly a driver task to accomplish.
(In reply to weberge42 from comment #17) > If failover using lagg with different switches is not supported or is considered exotic, the feature should be removed. The docs do not mention that this is the case. lagg failover works just fine at present using links connected to the switch(es) with single FIB (just one switch or stack of switches) and should not be removed. I agree that documentation may need some warnings against not supported configurations but it cannot foretell all kinds of network setups built on wrong assumptions. OTOH, it can give a hint towards other known ways to build failure-resistant setups like RSTP etc. Some short hint, because man page is not textbook.
Whqt would be the correct setup using 1 Server, 2 Nics and 2 non stacked Switches to achive a basic failover scenario for IPv4 ? RSTP (as STP usually is) sounds too complicated for this. The description used in the manpage is basically the same as for linux bonding with active-passive mode. This setup works fine across 2 single, non stacked switches. I guess the driver sends a gratatious ARP (or RARP like vmware) to the switches.
RSTP is very simple thing. And it is general solution suitable for IPv4, IPv6, vlan trunk, PPPoE, IPX or anything else because it deals with links at layer 2.
I would not call RSTP "simple" compared to sending a single broadcast frame (of any protocol) out on the new active NIC on a failover, especially not once you consider that to use it you have to bridge together the two NICs and make sure you to configure your costs appropriately so that you don't accidentally make your server an unintentional bridge in your network, cutting off the link you actually do want active. There's a reason other vendors have implemented this simple and elegant solution for this problem. It is because it works, and will work in any correctly functioning Ethernet network (with the exception of possible security features). Implementing this feature to the lagg failover mode does not preclude sysadmins who dislike this mode to implement other solutions such as LACP (with the other end being a MLAG) or bridging with RSTP. It does make FreeBSD usable in scenarios where sysadmins might not have much say about the network operations of the environment at large.
(In reply to pvz from comment #21) "single broadcast frame" won't work for all cases, f.e. multiple vlans over lagg but RSTP will. And there is no rocket science in creating bridge as RSTP runs by default on bridges under FreeBSD.
Review of patch for HEAD can be found here: https://reviews.freebsd.org/D4111
A commit references this bug: Author: smh Date: Tue Dec 15 16:02:12 UTC 2015 New revision: 292275 URL: https://svnweb.freebsd.org/changeset/base/292275 Log: Fix lagg failover due to missing notifications When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited Neighbour Advertisements (IPv6) are sent to notify other nodes that the address may have moved. This results is slow failover, dropped packets and network outages for the lagg interface when the primary link goes down. We now use the new if_link_state_change_cond with the force param set to allow lagg to force through link state changes and hence fire a ifnet_link_event which are now monitored by rip and nd6. Upon receiving these events each protocol trigger the relevant notifications: * inet4 => Gratuitous ARP * inet6 => Unsolicited Neighbour Announce This also fixes the carp IPv6 NA's that stopped working after r251584 which added the ipv6_route__llma route. The new behavour can be controlled using the sysctls: * net.link.ether.inet.arp_on_link * net.inet6.icmp6.nd6_on_link Also removed unused param from lagg_port_state and added descriptions for the sysctls while here. PR: 156226 MFC after: 1 month Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D4111 Changes: head/sys/net/if.c head/sys/net/if_lagg.c head/sys/net/if_lagg.h head/sys/net/if_var.h head/sys/netinet/if_ether.c head/sys/netinet/if_ether.h head/sys/netinet/in_var.h head/sys/netinet/ip_carp.c head/sys/netinet6/in6.c head/sys/netinet6/in6_var.h head/sys/netinet6/nd6.c head/sys/netinet6/nd6.h head/sys/netinet6/nd6_nbr.c
Fix reverted, due to more work needed.
Steven is there a status update here?
Discussion that lead to the revert: https://lists.freebsd.org/pipermail/svn-src-all/2015-December/115255.html
Triage: * set the importance that's appropriate for a feature * clear flags for merge targets that are end of life.