Bug 269908 - CARP feature breaks the network
Summary: CARP feature breaks the network
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-02 07:38 UTC by franklin.suvi@gmail.com
Modified: 2024-03-19 18:28 UTC (History)
4 users (show)

See Also:


Attachments
snapshot of wireshart screen (74.39 KB, image/png)
2023-03-02 07:38 UTC, franklin.suvi@gmail.com
no flags Details
tcpdump files from Machine1 (406.50 KB, application/vnd.tcpdump.pcap)
2023-03-03 04:13 UTC, franklin.suvi@gmail.com
no flags Details
tcpdump files from Machine2 (396.28 KB, application/vnd.tcpdump.pcap)
2023-03-03 04:14 UTC, franklin.suvi@gmail.com
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description franklin.suvi@gmail.com 2023-03-02 07:38:06 UTC
Created attachment 240523 [details]
snapshot of wireshart screen

When a CARP interface is failing over from Master to Backup, it responds to GARP announcement of the incoming-master with "Duplice use of <virtual ip> detected" GARP packet. 

In the attached snapshot, the machine with physical MAC VMware_a7:e3:41 is the incoming-master node and the machine with physical MAC VMware_a7:0f:7f is the going-to-be-backup node. 

I observe that when the going-to-be-backup node responds i see that the Sendor MAC address is the physical MAC address and the Sender IP Address is the Virtual IP address.
Comment 1 franklin.suvi@gmail.com 2023-03-02 07:39:28 UTC
This issue affects the learning of Cisco ACI devices.
Comment 2 Zhenlei Huang freebsd_committer freebsd_triage 2023-03-02 16:55:08 UTC
It is not easy to analyze by only the screen snapshot, can you please provide tcpdump captures?

> This issue affects the learning of Cisco ACI devices.
Not quite understand how it affects Cisco ACI. What do you expect ?
Comment 3 franklin.suvi@gmail.com 2023-03-03 04:12:31 UTC
Details:
Machine 1: 
    Physical MAC: 00:50:56:a7:0f:7f
    IP Address:   10.10.4.17
Machine 2: 
    Physical MAC: 00:50:56:a7:e3:41
    IP Address:   10.10.4.18
CARP: 
    Virtual MAC: 00:00:5e:00:01:01
    Virtual IP: 10.10.4.19

Steps followed:
1.  Configure CARP on Machine 1. 
    ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 10
    This box becomes the MASTER
2.  Configure CARP on Machine 2. 
    ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 20
    This box becomes the BACKUP
3.  Re-configure CARP on Machine 1, to trigger a failover.
    ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 30
    Since now the advskew value of Machine 1 is higher than the Machine 2's value,   Machine 1 will become the BACKUP and Machine 2 will become the MASTER. 

Observation / Failure. 
At step 3, the moment Machine 2 becomes the MASTER, it makes the ARP announcement.
To this announcement when the Machine 1, who is in BACKUP state, which is supposed to be quiet, responds with "Duplicate use of <ip> detected" GARP message. Interestingly at this point, the Source MAC address is the physical MAC address and the Source IP address is the Virtual IP address. Please find the attached tcpdump files captured from both the machines. 

Due to this error, the CISCO ACI endpoint table messed up and is routing traffic  to the wrong device.
Comment 4 franklin.suvi@gmail.com 2023-03-03 04:13:42 UTC
Created attachment 240549 [details]
tcpdump files from Machine1
Comment 5 franklin.suvi@gmail.com 2023-03-03 04:14:41 UTC
Created attachment 240550 [details]
tcpdump files from Machine2
Comment 6 franklin.suvi@gmail.com 2023-03-03 04:26:38 UTC
There appears to be another race condition and could be similar to one of the earlier issues fixed. 

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191832
Comment 7 franklin.suvi@gmail.com 2023-03-07 10:12:25 UTC
https://lists.freebsd.org/pipermail/freebsd-net/2006-November/012476.html

This list seems to be talking about the same problem.
Comment 8 Zhenlei Huang freebsd_committer freebsd_triage 2023-03-13 16:48:06 UTC
(In reply to franklin.suvi@gmail.com from comment #3)
> Steps followed:
> 1.  Configure CARP on Machine 1. 
>     ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 10
>     This box becomes the MASTER
> 2.  Configure CARP on Machine 2. 
>     ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 20
>     This box becomes the BACKUP
> 3.  Re-configure CARP on Machine 1, to trigger a failover.
>     ifconfig nic0 vhid 1 pass testing alias 10.10.4.19/28 advskew 30
>    Since now the advskew value of Machine 1 is higher than the Machine 2's value,  
>  Machine 1 will become the BACKUP and Machine 2 will become the MASTER. 

I'm able to repeat this on 13.1-RELEASE.
Comment 9 franklin.suvi@gmail.com 2023-04-13 09:17:55 UTC
Any updates on the fix ?
Comment 10 Zhenlei Huang freebsd_committer freebsd_triage 2023-04-13 10:29:41 UTC
(In reply to franklin.suvi@gmail.com from comment #9)

Sorry I'm busy working on some bugs related to VLAN PCP .
I'll re-check this PR this weekend.
Comment 11 Zhenlei Huang freebsd_committer freebsd_triage 2023-04-15 17:18:20 UTC
While testing carp, I see multiple issues. The fix will not come immediately, so I'd like to propose you do the following to see if it helps.

1. While in the example of the man doc, host A and B are set different advskew, I recommend against and set advskew to a same one. So you can change vhid state on either host.
2. The preferred way to make a host master or backup is `ifconfig nic0 vhid 1 state master` or `ifconfig nic0 vhid 1 state backup`.
3. For aliases, the recommend prefixlen / netmask is 32 / 255.255.255.255
4. If `CISCO ACI endpoint table messed up`, can you try setting only virtual IP 10.10.4.19 on both hosts and see whether it helps or not ? i.e., only `ifconfig nic0 vhid 1 advskew 20 10.10.4.19/28`.

Apparently the fourth suggestion has drawbacks and you lost the ability to reach exact host via host IP (not the virtual one).