Bug 207598 - pf adds icmp unreach on gre/ipsec somehow
Summary: pf adds icmp unreach on gre/ipsec somehow
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.2-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-pf (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-29 18:52 UTC by emz
Modified: 2016-08-17 09:25 UTC (History)
3 users (show)

See Also:


Attachments
dumps (1.07 KB, application/octet-stream)
2016-05-24 08:25 UTC, Max
no flags Details
pf_frag_pass patch (455 bytes, patch)
2016-05-28 11:17 UTC, Kristof Provost
no flags Details | Diff
pf error returns (1.47 KB, patch)
2016-06-10 14:35 UTC, Kristof Provost
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description emz 2016-02-29 18:52:43 UTC
FreeBSD:

FreeBSD moscow-alpha 10.2-STABLE FreeBSD 10.2-STABLE #0 r286954: Fri Aug 21 08:33:14 MSK 2015     emz@moscow-alpha:/usr/obj/usr/src/sys/MOSCOW  amd64

Network scheme:

(FreeBSD A) <---(gre inside ipsec)---> (FreeBSD B) <---gre inside ipsec---> (FreeBSD C)

(uname taken from B)

Issue:

PF is on
A pings B with icmp packets < gre MTU = everything is OK
A pings C with icmp packets < gre MTU = everything is OK

A pings B with icmp packets > gre MTU = everything is OK
A pings C with icmp packets > gre MTU = got two answers, a normal ICMP reply from C, and an ICMP unreach from B:

[emz@big-cherkiz5-1:~]# ping -s 4096 192.168.7.127
PING 192.168.7.127 (192.168.7.127): 4096 data bytes
36 bytes from 172.16.5.214: Destination Host Unreachable
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 055c a28a   0 0000  40  01 3908 172.16.5.215  192.168.7.127 

4104 bytes from 192.168.7.127: icmp_seq=0 ttl=61 time=62.119 ms
^C
--- 192.168.7.127 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 62.119/62.119/62.119/0.000 ms

Workaround: disable pf on B. With pf disabled on B, situation resolves back to normal.
The issue was first seen somewhere on 10-STABLE, didn't resolve so far. I've talk with tough guys, like ae@, he told me to report it, since it cannot be explained by configuration errors.
Comment 1 Kristof Provost freebsd_committer freebsd_triage 2016-03-13 05:31:14 UTC
It doesn't look like a very simple setup, so ideally I'd like you to try
to simplify it a bit. For example, is the ipsec bit required, or does it
happen with just the GRE tunnels as well?
It'd also be interesting to know if it happens both with and without
'scrub fragment reassemble'.

Do you have anything 'special' in your pf.conf? (That is, route-to,
dup-to, set state-policy, ... basically anything not pass or block.)

(PS: I seem to be unable to send e-mail to you. Your mail server keeps telling me '452 4.7.1 Try again later')
Comment 2 emz 2016-05-19 05:19:38 UTC
Sorry it took that long (I was kinda overwhelmed by the amount of work).

So, same setup: A <---gre/ipsec---> B <---gre/ipsec---> C.

1) ipsec removed between A and B. The issue persists.
2) pf disabled on B. The issue is no more.
3) ipsec added on B, pf still disabled. The issue is no more.
4) ipsec still on, pf enabled on B. The issue is back.
5) ipsec enabled, pf enabled, the following line removed from pf on B:

scrub on $oif from !<voippbxes> fragment reassemble

The issue persists.

6) Line from previous point added back, removed the line

scrub on gre0 max-mss 1360

where gre0 is the B <---> C tunnel

and the issue is gone.

But I don't understand how the MSS enforcing can affect the ICMP packets, while it should only affect TCP.
Comment 3 Max 2016-05-23 18:20:18 UTC
I have reproduced the problem.
I think we shouldn't use scrub rule without "in" option. I.e. rule should be
scrub *in* on gre0 ...
Without "in" this rule is triggered twice ("B" <--> "C"): for outgoing *fragmented* echo request and for incoming fragmented echo reply. As a result, the length of the received echo request exceeds the MTU on "C" box. I think it is not good.
PF.CONF(5): "Traffic normalization is used to sanitize packet content in such a way that there are no ambiguities in packet interpretation on the receiving side. The normalizer does IP fragment reassembly to prevent attacks that confuse intrusion detection systems by sending overlapping IP fragments."
Do we really need "max-mss 1360" on outgoing flow?
However, appearance of "Destination Host Unreachable" remains unclear to me. It is routing stuff. Need to do some research.
Comment 4 Max 2016-05-23 18:42:24 UTC
It seems that scrubbing on both tunnels eliminates the problem.
scrub on gre0 max-mss 1360
scrub on gre1 max-mss 1360
Comment 5 Kristof Provost freebsd_committer freebsd_triage 2016-05-24 06:49:39 UTC
(In reply to Max from comment #3)
Scrubbing in both directions should be safe, even with fragment reassemble.
In IPv4 it's OK for a frame to not fit in the MTU. The router will fragment.
(There's special casing in pf to handle the IPv6 scenario, but that doesn't seem to be relevant here.)

It's also very strange that the mss setting has an influence on ICMP packets. I'd only expect that to affect TCP streams.

It'd be interesting to get packet captures here (tcpdump -n -i <interface> -s0 -w output.pcap) of both the ICMP echo request and the ICMP error packets.
Ideally capture on an interface outside the GRE tunnel (so we get the GRE headers too).
Comment 6 Max 2016-05-24 08:25:00 UTC
Created attachment 170591 [details]
dumps

(In reply to Kristof Provost from comment #5)
Host "A" config:

cloned_interfaces="gre0"
ifconfig_em0="inet 192.168.10.1/24"
defaultrouter="192.168.10.254"
ifconfig_gre0="inet 10.10.1.1 10.10.2.1 tunnel 192.168.10.1 192.168.10.254"

static_routes="rb rc"
route_rb="10.10.2.0/24 10.10.2.1"
route_rc="10.10.3.0/24 10.10.2.1"

pf disabled.


Host "C" config:

cloned_interfaces="gre0"
ifconfig_em0="inet 192.168.30.1/24"
defaultrouter="192.168.30.254"
ifconfig_gre0="inet 10.10.3.1 10.10.2.1 tunnel 192.168.30.1 192.168.30.254"

static_routes="ra rb"
route_ra="10.10.1.0/24 10.10.2.1"
route_rb="10.10.2.0/24 10.10.2.1"

pf disabled.


Host "B" config:

cloned_interfaces="gre0 gre1"
ifconfig_em0="inet 192.168.10.254/24"
ifconfig_em2="inet 192.168.30.254/24"
ifconfig_gre0="inet 10.10.2.1 10.10.1.1 tunnel 192.168.10.254 192.168.10.1"
ifconfig_gre1="inet 10.10.2.1 10.10.3.1 tunnel 192.168.30.254 192.168.30.1"

static_routes="ra rc"
route_ra="10.10.1.0/24 10.10.1.1"
route_rc="10.10.3.0/24 10.10.3.1"

pf.conf:

set skip on lo

#scrub on gre0 max-mss 1360
scrub on gre1 max-mss 1360

pass all


pfctl -x misc


gre MTU is 1476. So, 1476-28=1448 bytes sholud fit MTU.

First, on host "A":
ping -s 1450 -c 1 10.10.3.1
Then, on host "C":
ping -s 1450 -c 1 10.10.1.1

Kernel log on host "B":
May 24 10:57:38 isp kernel: em0: promiscuous mode enabled
May 24 10:57:39 isp kernel: em2: promiscuous mode enabled
May 24 10:58:13 isp kernel: pf_normalize_ip: reass frag 56321 @ 0-1456
May 24 10:58:13 isp kernel: pf_fillup_fragment: reass frag 56321 @ 0-1456pf_normalize_ip: reass frag 56321 @ 1456-1458
May 24 10:58:13 isp kernel: pf_fillup_fragment: reass frag 56321 @ 1456-1458pf_isfull_fragment: 1458 < 1458?pf_reassemble: complete: 0xfffff8001f4aa300(1478)
May 24 10:58:13 isp kernel: pf_normalize_ip: reass frag 30208 @ 0-1456
May 24 10:58:13 isp kernel: pf_fillup_fragment: reass frag 30208 @ 0-1456pf_normalize_ip: reass frag 30208 @ 1456-1458
May 24 10:58:13 isp kernel: pf_fillup_fragment: reass frag 30208 @ 1456-1458pf_isfull_fragment: 1458 < 1458?pf_reassemble: complete: 0xfffff8001f662000(1478)
May 24 10:58:39 isp kernel: pf_normalize_ip: reass frag 30464 @ 0-1456
May 24 10:58:39 isp kernel: pf_fillup_fragment: reass frag 30464 @ 0-1456pf_normalize_ip: reass frag 30464 @ 1456-1458
May 24 10:58:39 isp kernel: pf_fillup_fragment: reass frag 30464 @ 1456-1458pf_isfull_fragment: 1458 < 1458?pf_reassemble: complete: 0xfffff8001f661d00(1478)
May 24 10:58:39 isp kernel: pf_normalize_ip: reass frag 57601 @ 0-1456
May 24 10:58:39 isp kernel: pf_fillup_fragment: reass frag 57601 @ 0-1456pf_normalize_ip: reass frag 57601 @ 1456-1458
May 24 10:58:39 isp kernel: pf_fillup_fragment: reass frag 57601 @ 1456-1458pf_isfull_fragment: 1458 < 1458?pf_reassemble: complete: 0xfffff8001f4aa100(1478)
May 24 10:58:56 isp kernel: em0: promiscuous mode disabled
May 24 10:58:57 isp kernel: em2: promiscuous mode disabled
Comment 7 emz 2016-05-24 13:26:12 UTC
I confirm, adding "in" on scrubbing for TCP MSS fixes the issue. Although the relation between TCP MSS fixing and the ICMP still seems bogus to me.
Comment 8 Max 2016-05-24 13:54:34 UTC
(In reply to emz from comment #7)
fragment reassemble
	   Using scrub rules, fragments	can be reassembled by normalization.
	   In this case, fragments are buffered	until they form	a complete
	   packet, and only the	completed packet is passed on to the filter.
	   ...  This is
	   the default behavior	of a scrub rule	if no fragmentation modifier
	   is supplied.

Thus the rule 
1. reassembles and fixes mss of tcp packets
2. reassembles other packets

I think "scrub in proto tcp ..." should affect tcp packets only.
Comment 9 Max 2016-05-24 18:23:10 UTC
pflog output:

21:20:52.521232 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 24585, seq 0, length 1456
21:20:52.521241 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 24585, seq 0, length 1456
21:20:52.521256 rule 0..16777216/0(match): pass out on gre0: 10.10.2.1 > 10.10.1.1: ICMP host 10.10.3.1 unreachable, length 36
21:20:52.521262 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 60: 10.10.2.1 > 10.10.1.1: ICMP host 10.10.3.1 unreachable, length 36
21:20:52.521282 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:20:52.521286 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:20:52.521288 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 24585, seq 0, length 1458
21:20:52.521316 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1482: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 24585, seq 0, length 1458
21:20:52.521598 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 24585, seq 0, length 1456
21:20:52.521614 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
21:20:52.521619 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 24585, seq 0, length 1458
21:20:52.521624 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 24585, seq 0, length 1458
21:20:52.521630 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 24585, seq 0, length 1456
21:20:52.521646 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 10 Max 2016-05-24 18:38:28 UTC
scrub on gre1 proto tcp max-mss 1360 (there is no "host unreachable" message).


21:28:54.220629 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 30473, seq 0, length 1456
21:28:54.220641 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 30473, seq 0, length 1456
21:28:54.220650 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 30473, seq 0, length 1456
21:28:54.220656 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 30473, seq 0, length 1456
21:28:54.220700 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:28:54.220704 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:28:54.220710 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:28:54.220716 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
21:28:54.220824 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 30473, seq 0, length 1456
21:28:54.220829 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 30473, seq 0, length 1456
21:28:54.220835 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 30473, seq 0, length 1456
21:28:54.220840 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 30473, seq 0, length 1456
21:28:54.220880 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
21:28:54.220886 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ip-proto-1
21:28:54.220892 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ip-proto-1
21:28:54.220899 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 11 Max 2016-05-24 19:14:10 UTC
Could it be gre issue (when packets dropped by kernel)?
Comment 12 Max 2016-05-24 19:29:42 UTC
And the last one... I'm sorry for these listings...

scrub on gre0 max-mss 1360
scrub on gre1 max-mss 1360
(there is no "host unreachable")

22:22:56.728057 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 60681, seq 0, length 1456
22:22:56.728069 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
22:22:56.728078 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 60681, seq 0, length 1458
22:22:56.728089 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 60681, seq 0, length 1458
22:22:56.728097 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 60681, seq 0, length 1456
22:22:56.728140 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
22:22:56.728266 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 60681, seq 0, length 1456
22:22:56.728273 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
22:22:56.728278 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 60681, seq 0, length 1458
22:22:56.728283 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 60681, seq 0, length 1458
22:22:56.728288 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 60681, seq 0, length 1456
22:22:56.728325 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 13 Kristof Provost freebsd_committer freebsd_triage 2016-05-25 11:30:43 UTC
I'm wondering if the 'max-mss' thing isn't a red herring.
Can you try 'scrub on gre0', 'scrub on gre1' (so without the max-mss)?
Comment 14 Max 2016-05-25 11:40:16 UTC
scrub on gre1

14:35:43.641169 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 44806, seq 0, length 1456
14:35:43.641178 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 44806, seq 0, length 1456
14:35:43.641194 rule 0..16777216/0(match): pass out on gre0: 10.10.2.1 > 10.10.1.1: ICMP host 10.10.3.1 unreachable, length 36
14:35:43.641200 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 60: 10.10.2.1 > 10.10.1.1: ICMP host 10.10.3.1 unreachable, length 36
14:35:43.641218 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:35:43.641223 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:35:43.641230 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 44806, seq 0, length 1458
14:35:43.641237 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1482: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 44806, seq 0, length 1458
14:35:43.641421 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 44806, seq 0, length 1456
14:35:43.641428 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
14:35:43.641434 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 44806, seq 0, length 1458
14:35:43.641439 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 44806, seq 0, length 1458
14:35:43.641479 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 44806, seq 0, length 1456
14:35:43.641497 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 15 Max 2016-05-25 11:41:00 UTC
scrub on gre0
scrub on gre1

14:39:39.127649 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 47622, seq 0, length 1456
14:39:39.127666 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:39:39.127672 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 47622, seq 0, length 1458
14:39:39.127681 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 47622, seq 0, length 1458
14:39:39.127689 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 47622, seq 0, length 1456
14:39:39.127734 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:39:39.127857 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 47622, seq 0, length 1456
14:39:39.127865 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
14:39:39.127869 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 47622, seq 0, length 1458
14:39:39.127875 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 47622, seq 0, length 1458
14:39:39.127880 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 47622, seq 0, length 1456
14:39:39.127917 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 16 Kristof Provost freebsd_committer freebsd_triage 2016-05-25 11:44:45 UTC
So if I understand this correctly the problem is still there with only 'scrub on gre1' (so without the MSS clamping), but it's not there if scrubbing is done on both interfaces?
Comment 17 Max 2016-05-25 11:51:33 UTC
(In reply to Kristof Provost from comment #16)
Absolutely.

ICMP-unreach generated when the first fragment of echo request is dropped by pf, I think.
Comment 18 Max 2016-05-25 12:03:24 UTC
no scrub on gre1 proto icmp
scrub on gre1
There is no "host unreachable".

14:58:34.741461 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 59142, seq 0, length 1456
14:58:34.741471 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 59142, seq 0, length 1456
14:58:34.741479 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 59142, seq 0, length 1456
14:58:34.741486 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.1.1 > 10.10.3.1: ICMP echo request, id 59142, seq 0, length 1456
14:58:34.741542 rule 0..16777216/0(match): pass in on em0: 192.168.10.1 > 192.168.10.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:58:34.741571 rule 0..16777216/0(match): pass in on gre0: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:58:34.741576 rule 0..16777216/0(match): pass out on gre1: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:58:34.741580 rule 0..16777216/0(match): pass out on em2: 192.168.30.254 > 192.168.30.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.1.1 > 10.10.3.1: ip-proto-1
14:58:34.741648 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 59142, seq 0, length 1456
14:58:34.741654 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 59142, seq 0, length 1456
14:58:34.741659 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 59142, seq 0, length 1456
14:58:34.741665 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 1480: 10.10.3.1 > 10.10.1.1: ICMP echo reply, id 59142, seq 0, length 1456
14:58:34.741682 rule 0..16777216/0(match): pass in on em2: 192.168.30.1 > 192.168.30.254: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
14:58:34.741686 rule 0..16777216/0(match): pass in on gre1: 10.10.3.1 > 10.10.1.1: ip-proto-1
14:58:34.741691 rule 0..16777216/0(match): pass out on gre0: 10.10.3.1 > 10.10.1.1: ip-proto-1
14:58:34.741696 rule 0..16777216/0(match): pass out on em0: 192.168.10.254 > 192.168.10.1: GREv0, proto IPv4 (0x0800), length 26: 10.10.3.1 > 10.10.1.1: ip-proto-1
Comment 19 Max 2016-05-25 18:54:47 UTC
I've never read FreeBSD sources, except pf's last week... probably I'm wrong.
ip_input()->ip_forward()->ip_output()->ip_output_pfil()->pfil_run_hooks()->pf_test().
If ip_output() returns any error, then in ip_forward():
	error = ip_output(...);
	...
	switch (error) {
	case 0:				/* forwarded, but need redirect */
		/* type, code set above */
		break;
	...
	default:
		type = ICMP_UNREACH;
		code = ICMP_UNREACH_HOST;
		break;
	...
	icmp_error(...);
So, we have incoming fragment of echo request. There are two options:
1. pf returns PF_PASS -> ip_output() returns 0 -> everything is OK
2. pf returns PF_DROP -> ip_output() returns nonzero value -> we have icmp-unreach message.
pf returns PF_DROP when we have (implicit) "scrub out on...".

Please, correct me if I missing something.
Comment 20 Kristof Provost freebsd_committer freebsd_triage 2016-05-28 11:17:13 UTC
Created attachment 170747 [details]
pf_frag_pass patch

(In reply to Max from comment #19)
You may be on to something there.

pf_reassemble() actually returns PF_PASS, but it's turned back into PF_DROP later on.

It actually looks like this'd be a problem for IPv6 too.

Can you give the attached patch a try? I'm not completely happy with it, but it should fix the problem.
Comment 21 Max 2016-05-28 13:01:45 UTC
(In reply to Kristof Provost from comment #20)
Hello, Kristof.

This patch works.
But there is another problem:

pass log (all) all
block drop out log (all) on gre1 proto icmp

In this case icmp-unreach still presents.
Comment 22 Kristof Provost freebsd_committer freebsd_triage 2016-05-28 13:20:23 UTC
(In reply to Max from comment #21)
Yeah, I guess that makes sense. After all, the rules tell PF to drop the ICMP packet, which it does. It tells the network stack that the packet was dropped, so it generates an 'ICMP destination unreachable' error.

In this case that's correct, because the destination really is unreachable.
Arguably that error should be under the control of the firewall, but I'm not sure this is really wrong.
Comment 23 Max 2016-05-28 13:36:13 UTC
(In reply to Kristof Provost from comment #22)
I agree. But should we send any ICMP if we have "block drop ..." rule, not "block return ..."?
Comment 24 Kristof Provost freebsd_committer freebsd_triage 2016-05-28 13:59:34 UTC
(In reply to Max from comment #23)
Yeah, that's certainly a valid point.

Arguably the network stack shouldn't send errors if the firewall drops a packet, instead leaving it to the firewall to send an error.
Or perhaps we should extend the netpfil interface to support both scenarios.

Either way, this change will affect more than just pf, so it'd have to be done very carefully.
Comment 25 Max 2016-05-28 14:32:23 UTC
(In reply to Kristof Provost from comment #24)
That's why I've reviewed ip_input.c and ip_output.c. It's not just a routing issue...
Can you discuss this problem with responsible developers?
Comment 26 Max 2016-05-28 21:40:22 UTC
Does it look reasonable? We should use consistent return values in pf_reassemble(), I think.

--- pf_norm.c.orig      2016-05-28 23:40:52.171196000 +0300
+++ pf_norm.c   2016-05-28 23:50:39.912093000 +0300
@@ -623,7 +623,7 @@ pf_reassemble(struct mbuf **m0, struct i
        m = *m0 = NULL;

        if (!pf_isfull_fragment(frag))
-               return (PF_PASS);  /* drop because *m0 is NULL, no error */
+               return (PF_DROP);

        /* We have all the data */
        frent = TAILQ_FIRST(&frag->fr_queue);
@@ -1284,8 +1284,6 @@ pf_normalize_ip(struct mbuf **m0, int di
                        return (PF_DROP);

                m = *m0;
-               if (m == NULL)
-                       return (PF_DROP);

                /* use mtag from concatenated mbuf chain */
                pd->pf_mtag = pf_find_mtag(m);

IPv6 versions should be fixed too.
Comment 27 Kristof Provost freebsd_committer freebsd_triage 2016-05-29 08:35:39 UTC
(In reply to Max from comment #26)
I think what we need to do is very carefully go through all the return paths in pf.

There's basically three scenarios:
 * Accept packet (modified or not)
   => return PF_PASS *m0 is the mbuf
 * Reject the packet (i.e tell the stack it couldn't be sent)
   => return PF_DROP
 * Drop the packet (i.e. it vanishes, do not tell the stack it couldn't be sent)
   => return PF_PASS, *m0 is NULL.

The pf_isfull_fragment() check needs to return PF_PASS (with *m0 == NULL), because the packet was just buffered until we have the full (reassembled) packet. It's not been rejected.

The if (m == NULL) check is really required, because we do set *m0 to NULL during reassembly.
Comment 28 Max 2016-05-29 09:14:13 UTC
(In reply to Kristof Provost from comment #27)
Hello, Kristof.
Thank you for your reply. I understand the logic of current implementation of pf_reassemble(). But it does not return a value directly to network stack. I think it could return PF_PASS only in single case: the packet is fully reassembled. Instead, pf_normalize_ip() does it: immediately returns PF_DROP if pf_reassemble() == PF_PASS && *m0 == NULL. I think, it is confusing a bit...
In any way, this is just a suggestion. (:
Comment 29 Kristof Provost freebsd_committer freebsd_triage 2016-06-10 14:35:24 UTC
Created attachment 171268 [details]
pf error returns

Hmm. I might be making this harder than it needs to be.
If the netpfil hook returns EACCESS ip_forward() won't actually generate an ICMP error message.

The problem is that PF returns PF_PASS, PF_DROP, ... instead of the error codes the stack expects.

Can you test this patch?

It's interesting that this doesn't seem to be as big a problem on CURRENT, because the fast forwarding code (ip_tryforward()) doesn't generate ICMP errors for netpfil() errors.
Comment 30 Max 2016-06-10 16:51:28 UTC
(In reply to Kristof Provost from comment #29)
It seems the patch works, except one thing: "block return ..." does not generate ICMP message.
Comment 31 Max 2016-06-10 17:42:04 UTC
(In reply to Max from comment #30)
But it looks like this is not a patch problem.
Comment 32 Kristof Provost freebsd_committer freebsd_triage 2016-06-10 17:49:51 UTC
(In reply to Max from comment #31)
Okay, let's track that in a new bug. Can you create one?
Comment 33 Kristof Provost freebsd_committer freebsd_triage 2016-06-10 18:15:14 UTC
(In reply to Max from comment #30)
Was this with TCP/UDP or ICMP?

Note that pf doesn't generate ICMP error messages for anything other than UDP (and TCP RST for TCP):

           return    A TCP RST is returned for blocked TCP packets, an ICMP
                     UNREACHABLE is returned for blocked UDP packets, and all
                     other packets are silently dropped.
Comment 34 Max 2016-06-10 18:57:06 UTC
(In reply to Kristof Provost from comment #33)
Yeah, that's my fault... It is ICMP. But man pf.conf says
return
		 This causes a TCP RST to be returned for tcp(4) packets and
		 an ICMP UNREACHABLE for UDP and other packets.


(In reply to Kristof Provost from comment #32)
I'm trying to understand what's happening... Without the patch:

ruleset 1:
scrub on gre1
pass log (all) all
block return out log (all) on gre1 proto icmp

ICMP-unreach exists.

ruleset 2:
scrub on gre1
pass log (all) all
block return in log (all) on gre0 proto icmp

ICMP-unreach doesn't exist. Should it?

ruleset 3:
scrub on gre0
scrub on gre1
pass log (all) all
block return in log (all) on gre0 proto icmp

ICMP-unreach doesn't exist.

I've rebuilt the kernel... again... the patched version.
There is no ICMP-unreach at all. So, the first case is relevant to patch, I think...
Comment 35 Max 2016-06-10 19:26:40 UTC
(In reply to Kristof Provost from comment #33)
Now I see that...
		} else if (pd->proto != IPPROTO_ICMP && af == AF_INET &&
		    r->return_icmp)
			pf_send_icmp(m, r->return_icmp >> 8,
			    r->return_icmp & 255, af, r);
		else if (pd->proto != IPPROTO_ICMPV6 && af == AF_INET6 &&
		    r->return_icmp6)
			pf_send_icmp(m, r->return_icmp6 >> 8,
			    r->return_icmp6 & 255, af, r);

block return ... proto tcp
works as expected.
Comment 36 Kristof Provost freebsd_committer freebsd_triage 2016-06-10 19:54:43 UTC
(In reply to Max from comment #35)
Yeah, we may want to clarify the man page somewhat, and explicitly state that there are no ICMP error messages for ICMP packets.
Comment 37 commit-hook freebsd_committer freebsd_triage 2016-07-09 12:17:43 UTC
A commit references this bug:

Author: kp
Date: Sat Jul  9 12:17:01 UTC 2016
New revision: 302497
URL: https://svnweb.freebsd.org/changeset/base/302497

Log:
  pf: Map hook returns onto the correct error values

  pf returns PF_PASS, PF_DROP, ... in the netpfil hooks, but the hook callers
  expect to get E<foo> error codes.
  Map the returns values. A pass is 0 (everything is OK), anything else means
  pf ate the packet, so return EACCES, which tells the stack not to emit an ICMP
  error message.

  PR:	207598

Changes:
  head/sys/netpfil/pf/pf_ioctl.c
Comment 38 commit-hook freebsd_committer freebsd_triage 2016-08-17 09:24:32 UTC
A commit references this bug:

Author: kp
Date: Wed Aug 17 09:23:41 UTC 2016
New revision: 304282
URL: https://svnweb.freebsd.org/changeset/base/304282

Log:
  MFC r302497:

  pf: Map hook returns onto the correct error values

  pf returns PF_PASS, PF_DROP, ... in the netpfil hooks, but the hook callers
  expect to get E<foo> error codes.
  Map the returns values. A pass is 0 (everything is OK), anything else means
  pf ate the packet, so return EACCES, which tells the stack not to emit an ICMP
  error message.

  PR:     207598

Changes:
_U  stable/11/
  stable/11/sys/netpfil/pf/pf_ioctl.c
Comment 39 commit-hook freebsd_committer freebsd_triage 2016-08-17 09:25:37 UTC
A commit references this bug:

Author: kp
Date: Wed Aug 17 09:24:46 UTC 2016
New revision: 304283
URL: https://svnweb.freebsd.org/changeset/base/304283

Log:
  MFC r302497:

  pf: Map hook returns onto the correct error values

  pf returns PF_PASS, PF_DROP, ... in the netpfil hooks, but the hook callers
  expect to get E<foo> error codes.
  Map the returns values. A pass is 0 (everything is OK), anything else means
  pf ate the packet, so return EACCES, which tells the stack not to emit an ICMP
  error message.

  PR:     207598

Changes:
_U  stable/10/
  stable/10/sys/netpfil/pf/pf_ioctl.c