Bug 254675 - ICMP Unreach needfrag is broken in 13.0-RC4
Summary: ICMP Unreach needfrag is broken in 13.0-RC4
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-net (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-31 11:17 UTC by Aleksandr Fedorov
Modified: 2021-03-31 21:03 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aleksandr Fedorov freebsd_committer 2021-03-31 11:17:06 UTC
Hello.

I have the following setup with two VM's:

<public net> --- [ FreeBSD 13.0 RC4 GW_VM + NAT ] --- <private net> --- [Linux VM]

GW_VM:

Interfaces:
vtnet1 <public ip>
vtnet2 192.168.1.1/24

net.inet.ip.forwarding=1

NAT pf.conf:
nat on vtnet1 from 192.168.1.0/24 to any -> vtnet1

Linux VM:
enp0s2 192.168.1

When I'm trying iperf3 from Linux VM to public host:
[  4] local 192.168.1.4 port 49412 connected to <PUBLIC_HOST> port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.01   sec   263 KBytes  2.14 Mbits/sec   45   5.66 KBytes       
[  4]   1.01-2.00   sec   156 KBytes  1.28 Mbits/sec   32   5.66 KBytes       
[  4]   2.00-3.00   sec   156 KBytes  1.27 Mbits/sec   26   5.66 KBytes       

The low upload speed is predictable due to virtio-net offload are enabled.
But what I did not expect was the absence of the needfrag ICMP packet.

I setup 12.2 RELEASE with same configuration, and

root@edge-12:~ # tcpdump -i vtnet2 proto ICMP
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vtnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
14:07:09.803538 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.803581 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.803605 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.806829 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.806856 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.810143 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176
14:07:09.810172 IP 192.168.1.1 > 192.168.1.4: ICMP 10.78.28.17 unreachable - need to frag (mtu 1500), length 176


Using the following DTrace script: dtrace -n 'fbt:kernel:icmp_error:entry { stack(); printf("type: %d code: %d", arg1, arg2);}'

12.2-RELEASE work as expected: ip_forward() call ip_output() which return EMSGSIZE -> generate ICMP unreach needsfrag.

  0  53981                 icmp_error:entry 
              kernel`ip_forward+0x5c4
              kernel`ip_input+0x7a7
              kernel`netisr_dispatch_src+0xca
              kernel`ether_demux+0x138
              kernel`ether_nh_input+0x33b
              kernel`netisr_dispatch_src+0xca
              kernel`ether_input+0x4b
              kernel`vtnet_rxq_eof+0x7a5
              kernel`vtnet_rx_vq_process+0xb7
              kernel`ithread_loop+0x23c
              kernel`fork_exit+0x7e
              kernel`0xffffffff81067f6e
type: 3 code: 4
  0  53981                 icmp_error:entry 
              kernel`ip_forward+0x5c4
              kernel`ip_input+0x7a7
              kernel`netisr_dispatch_src+0xca
              kernel`ether_demux+0x138
              kernel`ether_nh_input+0x33b
              kernel`netisr_dispatch_src+0xca
              kernel`ether_input+0x4b
              kernel`vtnet_rxq_eof+0x7a5
              kernel`vtnet_rx_vq_process+0xb7
              kernel`ithread_loop+0x23c
              kernel`fork_exit+0x7e
              kernel`0xffffffff81067f6e
type: 3 code: 4

13-RC4:
  0  54326                 icmp_error:entry                                                                                                                                                                                                    
              kernel`ip_tryforward+0x730                                                                                                                                                                                                       
              kernel`ip_input+0x356                                                                                                                                                                                                            
              kernel`netisr_dispatch_src+0xca                                                                                                                                                                                                  
              kernel`ether_demux+0x148                                                                                 
              kernel`ether_nh_input+0x34c                                                                              
              kernel`netisr_dispatch_src+0xca                                                                          
              kernel`ether_input+0x69                                                                                  
              kernel`vtnet_rxq_eof+0x7d4                                                                               
              kernel`vtnet_rx_vq_process+0xb7                                                                          
              kernel`ithread_loop+0x24d                                                                                
              kernel`fork_exit+0x7e                                                                                    
              kernel`0xffffffff810625ae                                                                                
type: 3 code: 4                                                                                                        
  1  54326                 icmp_error:entry                                                                            
              kernel`ip_forward+0x9c                                                                                   
              kernel`ip_input+0x6cc                                                                                    
              kernel`swi_net+0x12b                                                                                     
              kernel`ithread_loop+0x24d                                                                                
              kernel`fork_exit+0x7e                                                                                    
              kernel`0xffffffff810625ae 
type: 3 code: 1

So, As I understand ip_tryforward() trying to generate ICMP needsfrag, but after that generated ICMP ICMP_UNREACH_HOST.
Comment 1 Aleksandr Fedorov freebsd_committer 2021-03-31 13:05:23 UTC
This is very funny:

root@GW_13RC4:~ # tcpdump -i lo0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo0, link-type NULL (BSD loopback), capture size 262144 bytes
15:32:30.655851 IP localhost > <GW_13RC4 public IP>: ICMP <remote public host> unreachable - need to frag (mtu 1500), length 576
15:32:30.693492 IP localhost > <GW_13RC4 public IP>: ICMP <remote public host> unreachable - need to frag (mtu 1500), length 576
15:32:30.713231 IP localhost > <GW_13RC4 public IP>: ICMP <remote public host> unreachable - need to frag (mtu 1500), length 576

So, ICMP packets were sent, but from localhost to localhost.

It seems that the 12.2-RELEASE checks the packet size before NAT, but the 13-RC4 after.
Comment 2 Marek Zarychta 2021-03-31 20:04:41 UTC
It looks like PF's behaviour has changed with regard to loopback interfaces. Could this observation[1] be relevant to the breakage reported in this PR?

[1] https://lists.freebsd.org/pipermail/freebsd-pf/2021-February/009390.html
Comment 3 Alexander V. Chernikov freebsd_committer 2021-03-31 20:35:44 UTC
For the context, we have switched fastforwarding on by default: https://cgit.freebsd.org/src/commit/?id=8ad114c082a159c0dde95aa35d2e3e108aa30a75

In 12.2 the codepath was ip_input() -> ip_forward() -> ip_output(), where ip_forward() created mbuf copy for the purposes of generating various ICMP messages.

Fastforward code currently don't do this for performance reasons, except for the redirect usecase.

As a result, we use (possibly altered) packet to generate the redirect.