Bug 219453 - TCP MD5 broken for vlan interfaces created on lagg
Summary: TCP MD5 broken for vlan interfaces created on lagg
Status: Closed Works As Intended
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 11.0-STABLE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Alexander Motin
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2017-05-22 10:46 UTC by Marek Zarychta
Modified: 2017-07-12 13:19 UTC (History)
5 users (show)

See Also:


Attachments
ifconfig output (3.68 KB, text/plain)
2017-05-25 05:52 UTC, Marek Zarychta
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marek Zarychta 2017-05-22 10:46:28 UTC
After upgrade from 11.0-STABLE r318137 to 11.1-PRERELEASE TCP MD5 signatures cannot be verified, so bird session cannot be established.
Neither ISP, nor our side changed the configuration. Bird-1.6.3_1 was recompiled from port, but it doesn't fix the trouble.

# cat /etc/ipsec.conf 
flush ;

add x.x.x.y x.x.x.x tcp 0x1000 -A tcp-md5 "Password1234" ;
add x.x.x.x x.x.x.y tcp 0x1001 -A tcp-md5 "Password1234" ;

# setkey -D
x.x.x.x x.x.x.y
	tcp mode=any spi=4097(0x00001001) reqid=0(0x00000000)
	A: tcp-md5  3647334d 72483753 4c4d5733
	seq=0x00000000 replay=0 flags=0x00000040 state=mature 
	created: May 22 12:25:03 2017	current: May 22 12:35:06 2017
	diff: 603(s)	hard: 0(s)	soft: 0(s)
	last: May 22 12:25:09 2017	hard: 0(s)	soft: 0(s)
	current: 6016(bytes)	hard: 0(bytes)	soft: 0(bytes)
	allocated: 94	hard: 0	soft: 0
	sadb_seq=1 pid=37398 refcnt=1
x.x.x.y x.x.x.x
	tcp mode=any spi=4096(0x00001000) reqid=0(0x00000000)
	A: tcp-md5  3647334d 72483753 4c4d5733
	seq=0x00000000 replay=0 flags=0x00000040 state=mature 
	created: May 22 12:25:03 2017	current: May 22 12:35:06 2017
	diff: 603(s)	hard: 0(s)	soft: 0(s)
	last: May 22 12:25:08 2017	hard: 0(s)	soft: 0(s)
	current: 5680(bytes)	hard: 0(bytes)	soft: 0(bytes)
	allocated: 71	hard: 0	soft: 0
	sadb_seq=0 pid=37398 refcnt=1

# netstat -sp tcp | grep signature
	0 packets with matching signature received
	4601 packets with bad signature received
	42 times failed to make signature due to no SA
	0 times unexpected signature received
	30 times no signature provided by segment
Comment 1 dgilbert 2017-05-22 20:03:45 UTC
This also affects quagga.  AFAICT, we're not sending MD5, either.
Comment 2 dgilbert 2017-05-22 20:04:46 UTC
Oh-and ... this all worked fine in 11.0-RELEASE (save the ipv6 bug ... which was supposedly patched in 11.0-STABLE (nice if that stays "fixed" too)).
Comment 3 Marek Zarychta 2017-05-23 01:51:52 UTC
FYI something was broken between 11.0-STABLE r318137 and 11.1-PRERELEASE r318590.
In our setup, TCP MD5 signed packets originate from vlan interface created on lagg interface, so the issue could be VLAN or LAGG specific. 
I don't know if freshest PRERELEASE works fine. For sure r318137 recompiled from sources works correctly, although just after the bootup, some packets with bad signatures occurred:

# netstat -sp tcp | grep signature
	8794 packets with matching signature received
	18 packets with bad signature received
	0 times failed to make signature due to no SA
	1 time unexpected signature received
	1 time no signature provided by segment
Comment 4 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-05-23 08:39:28 UTC
(In reply to Marek Zarychta from comment #0)
> After upgrade from 11.0-STABLE r318137 to 11.1-PRERELEASE TCP MD5 signatures
> cannot be verified, so bird session cannot be established.
> Neither ISP, nor our side changed the configuration. Bird-1.6.3_1 was
> recompiled from port, but it doesn't fix the trouble.
> # netstat -sp tcp | grep signature
> 	0 packets with matching signature received
> 	4601 packets with bad signature received
> 	42 times failed to make signature due to no SA
> 	0 times unexpected signature received
> 	30 times no signature provided by segment

There were no changes in stable/11 in TCP-MD5 code. So if it worked in r318137, it should work. Do you use bird's "password" option to set SAs or are they set via setkey(8)? There is patch for bird in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218907
I sent it to bird's developer and port maintainer, but seems it is not committed.
Comment 5 Marek Zarychta 2017-05-23 11:06:35 UTC
(In reply to Andrey V. Elsukov from comment #4)
SAs are they set via setkey(8)from ipsec.conf config file.
Comment 6 Marek Zarychta 2017-05-23 15:03:31 UTC
(In reply to Andrey V. Elsukov from comment #4)
My previous reply was mistaken, so let me explain once again. SAs are set from /etc/ipsec.conf and since the new IPSEC code was committed to STABLE I am following a branch. A couple of successful world updates has been provided. tcpmd5.ko module is loaded from /boot/loader.conf as well as some other modules: 
# kldstat 
Id Refs Address            Size     Name
 1   38 0xffffffff80200000 1f32ac8  kernel
 2    1 0xffffffff82134000 313338   zfs.ko
 3    2 0xffffffff82448000 cb38     opensolaris.ko
 4    1 0xffffffff82455000 12530    carp.ko
 5    1 0xffffffff82468000 161c8    if_lagg.ko
 6    1 0xffffffff8247f000 18d0     accf_dns.ko
 7    1 0xffffffff82481000 66f0     ichwd.ko
 8    1 0xffffffff82488000 40c0     tcpmd5.ko
 9    1 0xffffffff8248d000 2af28    ipsec.ko
10    1 0xffffffff82621000 106c5    geom_eli.ko
11    1 0xffffffff82632000 58de     fdescfs.ko
12    1 0xffffffff82638000 2839     pflog.ko
13    1 0xffffffff8263b000 34c2c    pf.ko
14    1 0xffffffff82670000 a0d      green_saver.ko
15    1 0xffffffff82671000 837a     autofs.ko

Kernel config is quite simple:

include GENERIC
ident   MAXDATA

options		EM_MULTIQUEUE
options		IPSEC_SUPPORT
nooptions	IPSEC

This is a production machine and updates will be stalled, at least till 11.1 will be released.
Comment 7 dgilbert 2017-05-23 17:22:45 UTC
What is the branch you are following?  I'm a little stuck here.  11-0-RELEASE is broken because TCP_MD5 on IPv6 panics... so I can't go back to that (misplaced patch that fixes it).  11-1-PRERELEASE is also broken.  At least it doesn't panic.

I also use TCP-MD5 over vlans.
Comment 8 Marek Zarychta 2017-05-23 19:27:50 UTC
(In reply to dgilbert from comment #7)
I am using TCP-MD5 signatures only for IPv4 with vlans created atop of lagg interface following 11.0-STABLE branch since march, just after r315514 when ae@ MFCed new IPSEC code.  The kernels had been always built with IPSEC_SUPPORT to allow load TCP-MD5 as a module.
Comment 9 Marek Zarychta 2017-05-24 11:05:37 UTC
(In reply to Andrey V. Elsukov from comment #4)
I have tested on the spare machine, that r318329 commit broke TCP-MD5 for vlan interfaces created atop of lagg. Vlan interfaces created on raw network adapters are not affected.
Comment 10 Alexander Motin freebsd_committer freebsd_triage 2017-05-25 00:12:39 UTC
Can somebody hint me how TCP MD5 supposed to coexist with TSO and LRO?  Because it is the only way I can guess how my changes may affect TCP MD% signatures, calculated in my understanding completely in software.

Marek, don't you have ifconfig outputs for working and not working cases?  Are the interface options/capabilities you see there different?
Comment 11 Marek Zarychta 2017-05-25 05:52:20 UTC
Created attachment 182877 [details]
ifconfig output
Comment 12 Marek Zarychta 2017-05-25 05:53:36 UTC
Only one of adapters (msk0) belonging to lagg0 is TSO capable, but it is brought up with "-tso" in both machines. Lagg interfaces are brought up with "-lacp_strict" option. The promiscuous mode in working is caused by other child VLAN interface running this mode. I have attached output of ifconfig -v command limited to considered interfaces only, IP addresses are masked.
Comment 13 Alexander Motin freebsd_committer freebsd_triage 2017-05-25 18:59:43 UTC
I see one problem with my change: when it tries to disable RXCSUM for em0, since msk0 does not support it, em0 also disables TXCSUM, which lagg does not notice.  But that should not affect vlan operation, since due to lack of VLAN_HWCSUM on msk0 none of those flags propagated to vlan0 either before my change or after.

I'd appreciate some comments here from people knowing how TCP MD5 is working.

Meanwhile, Marek, could you try to manually synchronize the msk0 and em0 interface capabilities (disable rxcsuma and txcsum) to see whether it change anything.
Comment 14 Marek Zarychta 2017-05-26 00:14:16 UTC
(In reply to Alexander Motin from comment #13)
I have manually synchronized (disabled) TX/RXcsum options after lagg0 creation. It has no impact on TCP-MD5 packet signing on affected VLAN interface.
Comment 15 commit-hook freebsd_committer freebsd_triage 2017-05-26 20:16:11 UTC
A commit references this bug:

Author: mav
Date: Fri May 26 20:15:33 UTC 2017
New revision: 318966
URL: https://svnweb.freebsd.org/changeset/base/318966

Log:
  Improve applying unified capabilities to the lagg ports.

  Some NICs have some capabilities dependent, so that disabling one require
  disabling some other (TXCSUM/RXCSUM on em).  This code tries to reach the
  consensus more insistently.

  PR:		219453
  MFC after:	1 week

Changes:
  head/sys/net/if_lagg.c
Comment 16 Marek Zarychta 2017-06-08 19:03:07 UTC
(In reply to commit-hook from comment #15)

Thank you for MFCing this to STABLE. Now options/capabilities of LAGG interface is a subset of options/capabilities of interfaces belonging to this LAGG. But this change looks like irrelevant to TCP-MD5 packet signing on child VLAN interface which still seems to be broken:

# netstat -sp tcp | grep signature
	0 packets with matching signature received
	239 packets with bad signature received
	0 times failed to make signature due to no SA
	0 times unexpected signature received
	0 times no signature provided by segment
Comment 17 Mark Linimon freebsd_committer freebsd_triage 2017-06-26 02:49:23 UTC
Over to committer.
Comment 18 Marek Zarychta 2017-07-04 10:10:00 UTC
The TCP-MD5 signatures received directly on IPv4 enabled LAGG interface are also incorrect. That makes the issue strictly LAGG related.

Curiously the packets originating from the affected interface are reported by the receiving (unaffected) peer as correct. So the problem is in checking TCP-MD5 signatures of incoming packets on the affected interface, not packet signing originating from this interface, if it at all could be considered separately.

Any chance to make it work again before 11.1 is released?
Comment 19 Andrey V. Elsukov freebsd_committer freebsd_triage 2017-07-04 10:36:22 UTC
(In reply to Marek Zarychta from comment #18)
> The TCP-MD5 signatures received directly on IPv4 enabled LAGG interface are
> also incorrect. That makes the issue strictly LAGG related.
> 
> Curiously the packets originating from the affected interface are reported
> by the receiving (unaffected) peer as correct. So the problem is in checking
> TCP-MD5 signatures of incoming packets on the affected interface, not packet
> signing originating from this interface, if it at all could be considered
> separately.
> 
> Any chance to make it work again before 11.1 is released?

I don't know how lagg or vlan can affect TCP-MD5 calculation.

There are several counters that could be changed during processing inbound signed packet:
1. "no signature provided by segment" is incremented when signed TCP segment received, but socket is not configured to receive them.
2. "failed to make signature due to no SA" is incremented when received signed TCP segment, but there is no SA for given addresses. 
3. "packets with bad signature received" is incremented when signed TCP segment received, SA is found, but calculated signature for given addresses and password from SA doesn't match to expected value.

There are many users who have tested the TCP-MD5 code and it works for both IPv4 and IPv6. According to your last `netstat` output I can suggest to check that password is the same on both hosts in all SAs.

Use tcpdump with -M flag on both hosts to check that sent and received packets are correct.
Comment 20 Marek Zarychta 2017-07-04 14:23:17 UTC
(In reply to Andrey V. Elsukov from comment #19)
Received packets are correct. 
One more notice here: the router with LAGG interface was always the initiator of BGP session, even before disastrous 318329 commit the router never responded to BGP session initiated by the peer, but attempted to initiate session itself.
Comment 21 Marek Zarychta 2017-07-05 09:27:55 UTC
It is not bug, it is a feature.

May be it should better documented, that TCP-MD5 signatures work with interfaces and VLAN subinterfaces created on them if the interface supports TX/RX checksums generation/checking in hardware. This trait applies not only to LAGG interfaces.

Thank all, especially ae@ for help in this weird case.