Bug 236999 - vmx driver stops sending network packets and resets connections (TCP) but allows ICMP
Summary: vmx driver stops sending network packets and resets connections (TCP) but all...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.0-RELEASE
Hardware: amd64 Any
: --- Affects Many People
Assignee: Vincenzo Maffione
URL:
Keywords: regression
: 242070 243869 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-04-04 01:52 UTC by Shirkdog
Modified: 2021-03-07 15:06 UTC (History)
20 users (show)

See Also:
koobs: mfc-stable12+


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Shirkdog 2019-04-04 01:52:28 UTC
Tested on 13-CURRENT, with a FreeBSD VM running under ESXi (not sure the versions, I actually do not have console access at this point).

I did an upgrade from 12-ALPHA4 something to 13-CURRENT and after this was completed, users were unable to download files over 1MB from this server. I verified that I could SSH into this system and navigate the simple website that was serving files, but whenever I tried to download a 40MB file, the transfer would stop, the connection would reset.

I verified the same thing was happening with SFTP, navigate to the file, everything worked, but start the transfer, connection reset.

I enabled ICMP to be able to ping this system continuously, and noticed nothing was dropping out, and verified nothing else was going on (that I am aware of) with any type of security device, packet shaping, etc.

But with ICMP working fine while TCP connections were reset, I went to look at NIC offloading features, and all of them were turned on.

Running the following to disable all of it with the vmx driver, and all of the problems went away, file transfers started to work:

ifconfig vmx0 -rxcsum -txcsum -tso4 -tso6 -lro -rxcsum6 -txcsum6 -vlanhwcsum -vlanhwtso

I tried flipping each one off and on trying to see it was one particular option, and it appears TSO4/TSO6 is where the problems resides



vmx0@pci0:3:0:0:        class=0x020000 card=0x07b015ad chip=0x07b015ad rev=0x01 hdr=0x00
    vendor     = 'VMware'
    device     = 'VMXNET3 Ethernet Controller'
    class      = network
    subclass   = ethernet

[1] Hypervisor: Origin = "VMwareVMware"
[1] vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x4000-0x400f mem 0xfd5fc000-0xfd5fcfff,0xfd5fd000-0xfd5fdfff,0xfd5fe000-0xfd5fffff irq 18 at device 0.0 on pci3

[1] vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x4000-0x400f mem 0xfd5fc000-0xfd5fcfff,0xfd5fd000-0xfd5fdfff,0xfd5fe000-0xfd5fffff irq 18 at device 0.0 on pci3
[1] vmx0: Using 512 tx descriptors and 256 rx descriptors
[1] vmx0: Using 2 rx queues 2 tx queues
[1] vmx0: failed to allocate 3 MSI-X vectors, err: 6 - using MSI
[1] vmx0: Using an MSI interrupt
[1] vmx0: Ethernet address: 00:50:56:8f:25:15
[1] vmx0: netmap queues/slots: TX 1/512, RX 1/512
Comment 1 Patrick Kelsey freebsd_committer 2019-04-04 17:40:58 UTC
I tried to reproduce using the latest snapshot on ESXi 6.7, but so far I have been unable to.  I was able to repeatedly scp a 100 MiB file from the machine under test without issue (~line rate at 1G each time).

You said all NIC offloads were enabled.  lro isn't enabled by default, which suggests there may be other NIC-related settings that are not at defaults.  If so, please share them.

Details from the system I used:

Client version:      1.25.0
Client build number: 7872652
ESXi version:        6.7.0
ESXi build number:   8169922

FreeBSD vmx-debug 13.0-CURRENT FreeBSD 13.0-CURRENT r345863 GENERIC  amd64

vmx0: link state changed to UP
vmx0: <VMware VMXNET3 Ethernet Adapter> port 0x5000-0x500f mem 0xfd3fc000-0xfd3fcfff,0xfd3fd000-0xfd3fdfff,0xfd3fe000-0xfd3fffff irq 19 at device 0.0 on pci4
vmx0: Using 512 tx descriptors and 256 rx descriptors
vmx0: Using 6 rx queues 6 tx queues
vmx0: failed to allocate 7 MSI-X vectors, err: 6 - using MSI
vmx0: Using an MSI interrupt
vmx0: Ethernet address: 00:53:00:00:00:00
vmx0: netmap queues/slots: TX 1/512, RX 1/512
vmx0: link state changed to UP


vmx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500      options=e403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:53:00:00:00:00
        inet 192.0.2.1 netmask 0xffffff00 broadcast 192.0.2.255
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
Comment 2 Shirkdog 2019-04-04 17:59:41 UTC
I will try to get the specifics. I was unable to access the console where this VM reside, but my guess is an older version of ESXi.
Comment 3 Shirkdog 2019-04-09 17:47:19 UTC
Information I have. 

Protected by VMware HA, managed by vCenter Server.
Compatibility: ESXi 6.0 update 3 host (VM version 11)
Comment 4 Sascha Klauder 2019-11-22 12:46:59 UTC
I can confirm this on ESXi 6.0 and FreeBSD 12.1-RELEASE-p1.  Network traffic capture shows a lot of TCP retransmissions and iperf3 measures a mere ~25 kbit/s bandwith performance.  Turning off TSO restores expected performance.
Comment 5 Dmitry Petrov 2019-12-26 14:20:38 UTC
Same problem on multiple hosts with VMware 6.5 with FreeBSD 12-stable guests.

It usually manifests itself after VMotion.
Comment 6 Brendan Shanks 2019-12-27 19:52:19 UTC
I also hit this problem with VMware ESXi 6.0u3 after upgrading from 12.0 to 12.1-RELEASE-p1. No vCenter or vMotion going on, just a standalone ESXi install. Disabling TSO fixed the problem for me ('ifconfig vmx0 -tso'). I can test any fixes if it'll help.
Comment 7 Patrick Kelsey freebsd_committer 2019-12-28 04:52:38 UTC
Repeating here what I just said on -net (basically, I think there is at least this bug in the TSO path of the current vmxnet3 driver):

I am not able to test this at the moment, nor likely in the very near future, but I did have a few minutes to do some code reading and now believe that the following is part of the problem, if not the entire problem.  Using r353803 as a reference, I believe line 1323 in sys/dev/vmware/vmxnet3/if_vmx.c (in vmxnet3_isc_txd_encap()) should be:

sop->hlen = hdrlen + ipi->ipi_tcp_hlen;

instead of the current:

sop->hlen = hdrlen;

This can be seen by going back to r333813 and examining the CSUM_TSO case of vmxnet3_txq_offload_ctx().  The final increment of *start in that case is what was literally lost in translation when converting the driver to iflib.
Comment 8 commit-hook freebsd_committer 2020-01-13 21:27:18 UTC
A commit references this bug:

Author: vmaffione
Date: Mon Jan 13 21:26:17 UTC 2020
New revision: 356703
URL: https://svnweb.freebsd.org/changeset/base/356703

Log:
  vmx: fix initialization of TSO related descriptor fields

  Fix a mistake introduced by r343291, which ported the vmx(4)
  driver to iflib.
  In case of TSO, the hlen field of the (first) tx descriptor must
  be initialized to the cumulative length of Ethernet, IP and TCP
  headers. The length of the TCP header was missing.

  PR:		236999
  Reported by:	pkelsey
  Reviewed by:	avg
  MFC after:	1 week
  Differential Revision:	https://reviews.freebsd.org/D22967

Changes:
  head/sys/dev/vmware/vmxnet3/if_vmx.c
Comment 9 SW@FL 2020-01-14 14:41:16 UTC
We was also running in this problem last year with upgrade test (to 12.0) and because I don't know this bug report up to now, I have done some testing last week.

First I have upgrade a VM to 11.3 last patch level. There are also the vmx driver are removed from open-vm-tools and included into the kernel. A simple test for the wrong behavior is e.g. connect via SSH to the VM and call

  find /usr/share/doc -name "*.html" | xargs cat

Under 11.3-RELEASE-p5 all works well. In the next step VM was upgraded to 12.1-RELEASE-p1 and now the problem appears. The output of the command stops some time and then some minutes and at the end and SSH timeout appears.

I have compared the ifconfig output of both FreeBSD releases:

 # diff -w vmx1.vmkernel-11.3 vmx1.vmkernel-12.1
2c2
<       options=60039b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,TSO6,RXCSUM_IPV6,TXCSUM_IPV6>
---
>       options=e403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
4d3
<       hwaddr 00:50:56:80:94:72
6d4
<       nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
8a7
>       nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

There are two additional capabilities shown under 12: JUMBO_MTU and VLAN_HWTSO. For the second exist an ifconfig parameter to disable it (-vlanhwtso) but this does not solve the problem. For the first I can't find any possibility to disable it and also no documentation (I know jumbo frames and what happens if one site uses this frames and the switches are not configured for such MTUs). An other posting in the internet discusses problems with a pfsense VM with passthru network card and the new kernel vmx driver. They must enable TSO and LRO to solve there performance problems. So I have tried to disable TSO and voila the problem disappears.
But I wounder if this coding bug was found just yesterday why the problem does not appear in 11.3?
Comment 10 commit-hook freebsd_committer 2020-01-20 22:16:07 UTC
A commit references this bug:

Author: vmaffione
Date: Mon Jan 20 22:15:33 UTC 2020
New revision: 356932
URL: https://svnweb.freebsd.org/changeset/base/356932

Log:
  MFC r356703

  vmx: fix initialization of TSO related descriptor fields

  Fix a mistake introduced by r343291, which ported the vmx(4)
  driver to iflib.
  In case of TSO, the hlen field of the (first) tx descriptor must
  be initialized to the cumulative length of Ethernet, IP and TCP
  headers. The length of the TCP header was missing.

  PR:             236999
  Reported by:    pkelsey
  Reviewed by:    avg
  Differential Revision:  https://reviews.freebsd.org/D22967

Changes:
_U  stable/12/
  stable/12/sys/dev/vmware/vmxnet3/if_vmx.c
Comment 11 Mark Peek freebsd_committer 2020-02-08 15:54:06 UTC
*** Bug 243869 has been marked as a duplicate of this bug. ***
Comment 12 Kubilay Kocak freebsd_committer freebsd_triage 2020-02-21 04:40:16 UTC
^Triage: 

- Track earliest (reported) version affected
- Track stable merge
- Assign to committer resolving
Comment 13 kgc 2020-03-25 00:47:24 UTC
I confirm it's resolved the issue on my vmware cluster as well.  It would be great if this could get into 12.1-RELEASE-p4. Anything I can do to help?
Comment 14 Eugene Grosbein freebsd_committer 2020-04-19 17:20:40 UTC
*** Bug 242070 has been marked as a duplicate of this bug. ***
Comment 15 info 2020-05-19 16:29:27 UTC
Will this fix be released as a patch to 12.1-RELEASE, or will it come with 12.2-RELEASE?
Thanks
Comment 16 Gleb Popov freebsd_committer 2020-06-30 13:24:51 UTC
Just bumped into this on 2 production servers. What needs to be done to get it as a patch for 12.1?
Comment 17 Mark Johnston freebsd_committer 2020-07-20 18:19:38 UTC
We submitted an erratum notice request to secteam@, so this should be fixed in the next round of patches for 12.1 (no ETA yet).
Comment 18 michele 2020-08-05 18:27:33 UTC
(In reply to Mark Johnston from comment #17)
Looks like "the next round of patches for 12.1" has just happened. Including the vmx patch.
Comment 19 Mark Johnston freebsd_committer 2020-08-05 18:30:33 UTC
(In reply to michele from comment #18)
Indeed, thanks.  I will resolve the bug then.  There are some other iflib/vmx patches recently merged to stable/12 that may be candidates for future ENs.
Comment 20 dote 2021-01-15 16:38:30 UTC
# File: /boot/loader.conf

# Reduce boot-time delay.
autoboot_delay="-1"

# Disable beastie logo.
beastie_disable="YES"

# Intel(R) PRO/1000 Gigabit Ethernet adapter driver, preload.
if_em_load="YES"

# Advanced Host Controller Interface (AHCI).
ahci_load="YES"

# H-TCP Congestion Control for a more aggressive increase in speed on higher
# latency, high bandwidth networks with some packet loss.
cc_htcp_load="YES"

# hostcache cachelimit is the number of ip addresses in the hostcache list.
# Setting the value to zero(0) stops any ip address connection information from
# being cached and negates the need for "net.inet.tcp.hostcache.expire". We
# find disabling the hostcache increases burst data rates by 2x if a subnet was
# incorrectly graded as slow on a previous connection. A host cache entry is
# the client's cached tcp connection details and metrics (TTL, SSTRESH and
# VARTTL) the server can use to improve future performance of connections
# between the same two hosts. When a tcp connection is completed, our server
# will cache information about the connection until an expire timeout. If a new
# connection between the same client is initiated before the cache has expired,
# the connection will use the cached connection details to setup the
# connection's internal variables. This pre-cached setup allows the client and
# server to reach optimal performance significantly faster because the server
# will not need to go through the usual steps of re-learning the optimal
# parameters for the connection. To view the current host cache stats use
# "sysctl net.inet.tcp.hostcache.list"
net.inet.tcp.hostcache.cachelimit="0"

# Change the zfs pool output to show the GPT id for each drive instead of the
# gptid or disk identifier. The gpt id will look like "ada0p2"; the gpt id
# "ada0" of the first drive found on the AHCI SATA / SAS / SCSI chain and
1. Change the /boot/loader.conf


# partition string "p2".  Use "gpart list" to see all drive identifiers and
# "zpool status" to see the chosen id through ZFS.
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gpt.enable="1"
kern.geom.label.gptid.enable="0"

# Interface Maximum Queue Length: common recommendations are to set the interface
# buffer size to the number of packets the interface can transmit (send) in 50
# milliseconds _OR_ 256 packets times the number of interfaces in the machine;
# whichever value is greater. To calculate a size of a 50 millisecond buffer
# for a 60 megabit network take the bandwidth in megabits divided by 8 bits
# divided by the MTU times 50 millisecond times 1000, 60/8/1460*50*1000=256.84
# packets in 50 milliseconds. OR, if the box has two(2) interfaces take 256
# packets times two(2) NICs to equal 512 packets. 512 is greater then 256.84 so
# set to 512.
#
# Our preference is to define the interface queue length as two(2) times the
# value set in the interface transmit descriptor ring, "hw.igb.txd". If
# hw.igb.txd="1024" then set the net.link.ifqmaxlen="2048".
#
# An indirect result of increasing the interface queue is the buffer acts like
# a large TCP initial congestion window (init_cwnd) by allowing a network stack
# to burst packets at the start of a connection. Do not to set to zero(0) or
# the network will stop working due to "no network buffers" available. Do not
# set the interface buffer ludicrously large to avoid buffer bloat.
net.link.ifqmaxlen="2048"  # (default 50)

# Enable the optimized version of soreceive() for stream (TCP) sockets.
# soreceive_stream() only does one sockbuf unlock/lock per receive independent
# of the length of data to be moved into the uio compared to soreceive() which
# unlocks/locks per *mbuf*. soreceive_stream() can significantly reduced CPU
# usage and lock contention when receiving fast TCP streams. Additional gains
# are obtained when the receiving application is using SO_RCVLOWAT to batch up
# some data before a read (and wakeup) is done.
net.inet.tcp.soreceive_stream="1"  # (default 0)

2. Change the /etc/rc.conf

# RC.conf - System configuration information.

# Machine name.
hostname="orion"

# Network.
ifconfig_em0="DHCP -lro -tso"

# Enable ZFS filesystem.
zfs_enable="YES"

# Disable update motd.
update_motd="NO"

# Clear /tmp on boot.
clear_tmp_enable="YES"

# Allow packets to pass between interfaces.
gateway_enable="YES"

# Keboard delay to 250 ms and repeat to 34 cps.
keyrate="250.34"

# Daemons disabled.
dumpdev="NO"
sendmail_enable="NONE"

# Daemons enabled.
#ntpdate_enable="YES"
#ntpdate_hosts="c.ntp.br"
#ntpd_flags="-g"
sshd_enable="YES"
syslogd_flags="-ss"

3. Change the /etc/sysctl.conf

# Firewall: Ip Forwarding to allow packets to traverse between interfaces and
# is used for firewalls, bridges and routers. When fast IP forwarding is also
# enabled, IP packets are forwarded directly to the appropriate network
# interface with direct processing to completion, which greatly improves the
# throughput. All packets for local IP addresses, non-unicast, or with IP
# options are handled by the normal IP input processing path. All features of
# the normal (slow) IP forwarding path are supported by fast forwarding
# including firewall (through pfil(9) hooks) checking, except ipsec tunnel
# brokering. The IP fast forwarding path does not generate ICMP redirect or
# source quench messages though. Compared to normal IP forwarding, fast
# forwarding can give a speedup of 40 to 60% in packet forwarding performance
# which is great for interactive connections like online games or VOIP where
# low latency is critical.
net.inet.ip.forwarding=1                   # (default 0)

# H-TCP congestion control: The Hamilton TCP (HighSpeed-TCP) algorithm is a
# packet loss based congestion control and is more aggressive pushing up to max
# bandwidth (total BDP) and favors hosts with lower TTL / VARTTL then the
# default "newreno". Understand "newreno" works well in most conditions and
# enabling HTCP may only gain a you few percentage points of throughput.
# http://www.sigcomm.org/sites/default/files/ccr/papers/2008/July/1384609-1384613.pdf
# make sure to also add 'cc_htcp_load="YES"' to /boot/loader.conf then check
# available congestion control options with "sysctl net.inet.tcp.cc.available"
net.inet.tcp.cc.algorithm=htcp  # (default newreno)

# H-TCP congestion control: adaptive back off will increase bandwidth
# utilization by adjusting the additive-increase/multiplicative-decrease (AIMD)
# backoff parameter according to the amount of buffers available on the path.
# adaptive backoff ensures no queue along the path will remain completely empty
# after a packet loss event which increases buffer efficiency.
net.inet.tcp.cc.htcp.adaptive_backoff=1  # (default 0 ; disabled)

# H-TCP congestion control: RTT scaling will increase the fairness between
# competing TCP flows traversing different RTT paths through a common
# bottleneck. rtt_scaling increases the Congestion Window Size (CWND)
# independent of path round-trip time (RTT) leading to lower latency for
# interactive sessions when the connection is saturated by bulk data transfers.
# Default is 0 (disabled)
net.inet.tcp.cc.htcp.rtt_scaling=1  # (default 0 ; disabled)

# RFC 6675 increases the accuracy of TCP Fast Recovery when combined with
# Selective Acknowledgement (net.inet.tcp.sack.enable=1). TCP loss recovery is
# enhanced by computing "pipe", a sender side estimation of the number of bytes
# still outstanding on the network. Fast Recovery is augmented by sending data
# on each ACK as necessary to prevent "pipe" from falling below the slow-start
# threshold (ssthresh). The TCP window size and SACK-based decisions are still
# determined by the congestion control algorithm; H-TCP if enabled, newreno by
# default.
net.inet.tcp.rfc6675_pipe=1  # (default 0)

# maximum segment size (MSS) specifies the largest payload of data in a single
# IPv4 TCP segment. RFC 6691 states the maximum segment size should equal the
# effective MTU minus the fixed IP and TCP headers, but without subtracting IP
# or TCP options. To construct the MMS, start with the interface MTU of 1500
# bytes and subtract 20 bytes for the IP header and 20 bytes for the TCP header
# to equal 1460 bytes. An MMS of 1460 bytes has a 97% packet efficiency
# (1460/1500=0.97) Note: with net.inet.tcp.rfc1323 enabled, hosts can negotiate
# the tcp timestamps option which reduces the packet payload by 12 bytes and
# the MSS is automatically reduced from 1460 bytes to 1448 bytes total. An MMS
# of 1448 bytes has a 96.5% packet efficiency (1448/1500=0.965) WARNING: if you
# are using PF with an outgoing scrub rule then PF will re-package the packet
# using an MTU of 1460 by default, thus overriding this mssdflt setting and
# possibly wasting CPU time.
net.inet.tcp.mssdflt=1460  # (default 536)

# minimum, maximum segment size (mMSS) specifies the smallest payload of data
# in a single IPv4 TCP segment our system will agree to send when negotiating
# with the client. RFC 6691 states that a minimum MTU frame size of 576 bytes
# must be supported and the MSS option should equal the effective MTU minus the
# fixed IP and TCP headers, but without subtracting IP or TCP options. To
# construct the minimum MMS, start with the minimum recommended MTU size of 576
# bytes and subtract 20 bytes for the IP header and 20 bytes for the TCP header
# to equal 536 bytes. An mMMS of 536 bytes should allow our server to forward
# data across any network without being fragmented and still preserve an
# overhead to data ratio of 93% packet efficiency (536/576=0.93). The default
# mMMS is only 84% efficient (216/256=0.84).
net.inet.tcp.minmss=536  # (default 216)

# Reduce the amount of SYN/ACKs the server will re-transmit to an ip address
# whom did not respond to the first SYN/ACK. On a client's initial connection
# our server will always send a SYN/ACK in response to the client's initial
# SYN. Limiting retranstited SYN/ACKS reduces local syn cache size and a "SYN
# flood" DoS attack's collateral damage by not sending SYN/ACKs back to spoofed
# ips, multiple times. If we do continue to send SYN/ACKs to spoofed IPs they
# may send RST's back to us and an "amplification" attack would begin against
# our host. If you do not wish to send retransmits at all then set to zero(0)
# especially if you are under a SYN attack. If our first SYN/ACK gets dropped
# the client will re-send another SYN if they still want to connect. Also set
# "net.inet.tcp.msl" to two(2) times the average round trip time of a client,
# but no lower then 2000ms (2s). Test with "netstat -s -p tcp" and look under
# syncache entries. http://www.ouah.org/spank.txt
# http://people.freebsd.org/~jlemon/papers/syncache.pdf
net.inet.tcp.syncache.rexmtlimit=0  # (default 3)

# IP fragments require CPU processing time and system memory to reassemble. Due
# to multiple attacks vectors ip fragmentation can contribute to and that
# fragmentation can be used to evade packet inspection and auditing, we will
# not accept ipv4 fragments. Comment out these directives when supporting
# traffic which generates fragments by design; like NFS and certain
# preternatural functions of the Sony PS4.
# https://en.wikipedia.org/wiki/IP_fragmentation_attack
net.inet.ip.maxfragpackets=0     # (default 13687)
net.inet.ip.maxfragsperpacket=0  # (default 16)

# TCP Slow start gradually increases the data send rate until the TCP
# congestion algorithm (HTCP) calculates the networks maximum carrying capacity
# without dropping packets. TCP Congestion Control with Appropriate Byte
# Counting (ABC) allows our server to increase the maximum congestion window
# exponentially by the amount of data ACKed, but limits the maximum increment
# per ACK to (abc_l_var * maxseg) bytes. An abc_l_var of 44 times a maxseg of
# 1460 bytes would allow slow start to increase the congestion window by more
# than 64 kilobytes per step; 65535 bytes is the TCP receive buffer size of
# most hosts without TCP window scaling.
net.inet.tcp.abc_l_var=44  # (default 2)

# Initial Congestion Window (initcwnd) limits the amount of segments that TCP
# can send onto the network before receiving an ACK from the other machine.
# Increasing the TCP Initial Congestion Window will reduce data transfer
# latency during the slow start phase of a TCP connection. The initial
# congestion window should be increased to speed up short, burst connections
# in order to send the most data in the shortest time frame without overloading
# any network buffers. Google's study reported sixteen(16) segments as showing
# the lowest latency initial congestion window. Also test 44 segments which is
# 65535 bytes, the TCP receive buffer size of most hosts without TCP window
# scaling. https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
net.inet.tcp.initcwnd_segments=44             # (default 10 for FreeBSD 11.0)
#net.inet.tcp.experimental.initcwnd10=1       # (default  1 for FreeBSD 10.1)
#net.inet.tcp.experimental.initcwnd10=1       # (default  0 for FreeBSD  9.2)
#net.inet.tcp.local_slowstart_flightsize=44   # (default  4 for FreeBSD  9.1)
#net.inet.tcp.slowstart_flightsize=44         # (default  4 for FreeBSD  9.1)

# TCP Receive Window: The throughput of connection is limited by two windows: the
# (Initial) Congestion Window and the TCP Receive Window (wsize). The Congestion
# Window avoids exceeding the capacity of the network (H-TCP congestion
# control); and the Receive Window avoids exceeding the capacity of the
# receiver to process data (flow control). When our server is able to process
# packets as fast as they are received we want to allow the remote sending
# host to send data as fast as the network, Congestion Window, will allow.
# Increase the Window Scaling Factor (wsize) to fourteen(14) which allows the
# our server to receive 2^14 x 65,535 bytes = 1,064,960 bytes (100 gigabit) on
# the network before requiring an ACK packet.
#
# maxsockbuf:   2MB  wsize:  6  2^ 6*65KB =    4MB (FreeBSD default)
# maxsockbuf: 600MB  wsize: 14  2^14*65KB = 1064MB
kern.ipc.maxsockbuf=614400000  # (wsize 14)

# Syncookies have advantages and disadvantages. Syncookies are useful if you
# are being DoS attacked as this method helps filter the proper clients from
# the attack machines. But, since the TCP options from the initial SYN are not
# saved in syncookies, the tcp options are not applied to the connection,
# precluding use of features like window scale, timestamps, or exact MSS
# sizing. As the returning ACK establishes the connection, it may be possible
# for an attacker to ACK flood a machine in an attempt to create a connection.
# Another benefit to overflowing to the point of getting a valid SYN cookie is
# the attacker can include data payload. Now that the attacker can send data to
# a FreeBSD network daemon, even using a spoofed source IP address, they can
# have FreeBSD do processing on the data which is not something the attacker
# could do without having SYN cookies. Even though syncookies are helpful
# during a DoS, we are going to disable syncookies at this time.
net.inet.tcp.syncookies=0  # (default 1)

# TCP segmentation offload (TSO), also called large segment offload (LSO),
# should be disabled on NAT firewalls and routers. TSO/LSO works by queuing up
# large buffers and letting the network interface card (NIC) split them into
# separate packets. The problem is the NIC can build a packet that is the wrong
# size and would be dropped by a switch or the receiving machine, like for NFS
# fragmented traffic. If the packet is dropped the overall sending bandwidth is
# reduced significantly. You can also disable TSO in /etc/rc.conf using the
# "-tso" directive after the network card configuration; for example,
# ifconfig_igb0="inet 10.10.10.1 netmask 255.255.255.0 -tso". Verify TSO is off
# on the hardware by making sure TSO4 and TSO6 are not seen in the "options="
# section using ifconfig.
# http://www.peerwisdom.org/2013/04/03/large-send-offload-and-network-performance/
# ex (
https://besttoasterovenguides.com/best-microwave-toaster-oven-combo/)
net.inet.tcp.tso=0  # (default 1)

# Fortuna pseudorandom number generator (PRNG) maximum event size is also
# referred to as the minimum pool size. Fortuna has a main generator which
# supplies the OS with PRNG data. The Fortuna generator is seeded by 32
# separate 'Fortuna' accumulation pools which each have to be filled with at
# least 'minpoolsize' bytes before being able to seed the generator. On
# FreeBSD, the default 'minpoolsize' of 64 bytes is an estimate of how many
# bytes a new pool should contain to provide at least 128 bits of entropy.
# After a pool is used in a generator reseed, it is reset to an empty string
# and must reach 'minpoolsize' bytes again before being used as a seed. By
# increasing the 'minpoolsize' we allow higher entropy into the accumulation
# pools before being assimilated by the generator. 256 bytes will provide an
# absolute minimum of 512 bits of entropy, but realistically closer to 2048
# bits of entropy, for each of the 32 accumulation pools. Values between 64
# bytes and 256 bytes are reasonable, but higher values like 1024 bytes are
# also acceptable when coupled with a dedicated hardware based PRNG like the
# fast source Intel Secure Key RNG.
kern.random.fortuna.minpoolsize=2048  # (default 64)

# Initial Sequence Numbers (ISN) refer to the unique 32-bit sequence number
# assigned to each new Transmission Control Protocol (TCP) connection. The TCP
# protocol assigns an ISN to each new byte, beginning with 0 and incrementally
# adding a secret number every four seconds until the limit is exhausted. In
# continuous communication all available ISN options could be used up in a few
# hours. Normally a new secret number is only chosen after the ISN limit has
# been exceeded. In order to defend against Sequence Number Attacks the ISN
# secret key should not be used sufficiently often that it would be regarded as
# insecure or predictable. Reseeding will break TIME_WAIT recycling for a few
# minutes. BUT, for the more paranoid, simply choose a random number of seconds
# in which a new ISN secret should be generated.
# https://tools.ietf.org/html/rfc6528
net.inet.tcp.isn_reseed_interval=4500  # (default 0, disabled)

#
# HardenedBSD and DoS mitigation
#
hw.kbd.keymap_restrict_change=4    # disallow keymap changes for non-privileged users
kern.ipc.shm_use_phys=1            # lock shared memory into RAM and prevent it from being paged out to swap (default 0, disabled)
kern.msgbuf_show_timestamp=1       # display timestamp in msgbuf (default 0)
kern.randompid=7657                # calculate PIDs by the modulus of the integer given, choose a random int (default 0)
net.inet.icmp.drop_redirect=1      # no redirected ICMP packets (default 0)
net.inet.ip.check_interface=1      # verify packet arrives on correct interface (default 0)
net.inet.ip.portrange.first=1024   # use ports r1024 to portrange.last for outgoing connections (default 10000)
net.inet.ip.portrange.randomcps=999 # use random port allocation if less than this many ports per second are allocated (default 10)
net.inet.ip.random_id=1            # assign a random IP id to each packet leaving the system (default 0)
net.inet.ip.redirect=0             # do not send IP redirects (default 1)
net.inet.sctp.blackhole=2          # drop stcp packets destined for closed ports (default 0)
net.inet.tcp.always_keepalive=0    # disable tcp keep alive detection for dead peers, keepalive can be spoofed (default 1)
net.inet.tcp.blackhole=2           # drop tcp packets destined for closed ports (default 0)
net.inet.tcp.drop_synfin=1         # SYN/FIN packets get dropped on initial connection (default 0)
net.inet.tcp.ecn.enable=0          # Explicit Congestion Notification disabled unless proper active queue management is verified (default 2)
net.inet.tcp.fast_finwait2_recycle=1 # recycle FIN/WAIT states quickly, helps against DoS, but may cause false RST (default 0)
net.inet.tcp.finwait2_timeout=5000 # TCP FIN_WAIT_2 timeout waiting for client FIN packet before state close (default 60000, 60 sec)
net.inet.tcp.icmp_may_rst=0        # icmp may not send RST to avoid spoofed icmp/udp floods (default 1)
net.inet.tcp.keepinit=5000         # establish connection in five(5) seconds or abort attempt (default 75000, 75 secs)
net.inet.tcp.msl=2500              # Maximum Segment Lifetime, time the connection spends in TIME_WAIT state (default 30000, 2*MSL = 60 sec)
net.inet.tcp.nolocaltimewait=1     # remove TIME_WAIT states for the loopback interface (default 0)
net.inet.tcp.path_mtu_discovery=0  # disable MTU discovery since many hosts drop ICMP type 3 packets (default 1)
net.inet.tcp.rexmit_slop=70        # reduce the TCP retransmit timer, min+slop=100ms (default 200ms)
net.inet.udp.blackhole=1           # drop udp packets destined for closed sockets (default 0)
security.bsd.hardlink_check_gid=1  # unprivileged processes may not create hard links to files owned by other groups (default 0)
security.bsd.hardlink_check_uid=1  # unprivileged processes may not create hard links to files owned by other users (default 0)
security.bsd.see_other_gids=0      # groups only see their own processes. root can see all (default 1)
security.bsd.see_other_uids=0      # users only see their own processes. root can see all (default 1)
security.bsd.stack_guard_page=1    # stack smashing protection (SSP), ProPolice, defence against buffer overflows (default 0)
security.bsd.unprivileged_proc_debug=0 # unprivileged processes may not use process debugging (default 1)
security.bsd.unprivileged_read_msgbuf=0 # unprivileged processes may not read the kernel message buffer (default 1)
Comment 21 Apkarc 2021-03-07 15:06:12 UTC
MARKED AS SPAM