OS: 12.2-STABLE stable/12-c243266-gd970a8218e16 In kernel conf: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS "netstat -m" in a few weeks of usage under high network load: 78365279/16/78365295 mbufs in use (current/cache/total) 43996/18104/62100/12244576 mbuf clusters in use (current/cache/total/max) 2429/16 mbuf+clusters out of packet secondary zone in use (current/cache) 12584/2975/15559/6122288 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/1814011 9k jumbo clusters in use (current/cache/total/max) 0/0/0/1020381 16k jumbo clusters in use (current/cache/total/max) 19729647K/48112K/19777759K bytes allocated to network (current/cache/total) 42375955/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 404032/0/33003 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 1899771 sendfile syscalls 399 sendfile syscalls completed without I/O request 1004979 requests for I/O initiated by sendfile 7900311 pages read by sendfile as part of a request 399 pages were valid at time of a sendfile request 0 pages were valid and substituted to bogus page 0 pages were requested for read ahead by applications 0 pages were read ahead by sendfile 0 times sendfile encountered an already busy page 0 requests for sfbufs denied 0 requests for sfbufs delayed As you can see, almost 20Gb of RAM allocated for network. Finally server becomes unreachable: [zone: mbuf] kern.ipc.nmbufs limit reached Additional info and counters: https://forums.freebsd.org/threads/mbufs-leaks.78480/
I doubt that it is the compilation using the options used in the title of this bug. I guess it is related to using the RACK stack. Without the above options, the RACK stack can't be used. So could you test this by compiling a kernel using option TCPHPTS WITH_EXTRA_TCP_STACKS=1 but not using the RACK stack. This means that you don't make RACK the default stack in /etc/sysctl.conf by setting net.inet.tcp.functions_default=rack Please comment out this line in /etc/sysctl.conf. If this results in no mbufs leak, then the problem is caused by using RACK. If the mbuf leak persists, the case is really compiling in the TCPHPTS system. Could you also provide the output of sockstat -sSPtcp from the system when mbufs are leaked?
It's not easy to test because the bug appear only on production system. I'll make tests next few days amd provide detailed feedback.
(In reply to iron.udjin from comment #2) OK, thanks a lot. Can you describe the load on the server? Maybe I'm able to reproduce the issue locally.
(In reply to Michael Tuexen from comment #3) Web-server nginx with 100Mbit traffic in peak. Most of traffic is small static files. Here is metrics: https://prnt.sc/xgetl4 Network driver ixl. Please let me know of you need any additional info. P.S. In ~8 hours I'll rebuild kernel.
(In reply to Michael Tuexen from comment #3) You were right. Mbufs leaks caused by using RACK (please rename bug if it's possible) I have written simple script which show statistic about TCP connections/states and calculates mbuf diff each 10 sec. #!/usr/local/bin/bash OLD="0" while sleep 10; do MBUF=`vmstat -z | grep 'mbuf:' | awk -F\, '{print $3}' | awk '{print $1}'` echo -n `date "+%H:%M:%S"`: `expr $MBUF - $OLD` mbufs: $MBUF "| " netstat -a -n -ptcp | awk '{print $6}' | sort | uniq -c | egrep -v '^ 1 $|been|Foreign' | awk '{print $2": "$1" | "}' | tr -d '\n' echo OLD=$MBUF done In attachments you'll find 4 files with script output: no_options.txt - kernel built without the options enabled_options.txt - kernel build with options but no RACK enabled rack_enabled.txt - when I manualy loaded tcp_rack module and set sysctl net.inet.tcp.functions_default=rack rack_enabled_and_restarted.txt - when module set in loader.conf and sysctl net.inet.tcp.functions_default=rack... so all connections used RACK. "sockstat -sSPtcp" contains private data of my server IPs and visitors. I can't share it publicly. I'll send it to your email in two files: sockstat.txt - when I enabled rack via sysctl and it used partially. sockstat_restart.txt - with restart and RACK enabled in boot time - all connections use RACK. P.S: at night there is low traffic on the site. In day time with big amount of visitors mbuf leaks 1000/s.
Created attachment 221821 [details] no_options.txt
Created attachment 221822 [details] enabled_options.txt
Created attachment 221823 [details] rack_enabled.txt
Created attachment 221824 [details] rack_enabled_and_restarted.txt
Thanks for testing. I changed the title of the bug to reflect that using RACK causes the mbuf leak. Let me try to reproduce this locally.
Ok, please let me know in case you need any additional info. Here is "netstat -m" after ~9 hours uptime: 1676964/24726/1701690 mbufs in use (current/cache/total) 26555/28149/54704/12244576 mbuf clusters in use (current/cache/total/max) 1937/12737 mbuf+clusters out of packet secondary zone in use (current/cache) 1047/16401/17448/6122288 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/1814011 9k jumbo clusters in use (current/cache/total/max) 0/0/0/1020381 16k jumbo clusters in use (current/cache/total/max) 476539K/128083K/604622K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 sendfile syscalls 0 sendfile syscalls completed without I/O request 0 requests for I/O initiated by sendfile 0 pages read by sendfile as part of a request 0 pages were valid at time of a sendfile request 0 pages were valid and substituted to bogus page 0 pages were requested for read ahead by applications 0 pages were read ahead by sendfile 0 times sendfile encountered an already busy page 0 requests for sfbufs denied 0 requests for sfbufs delayed For network allocated 465M with RACK versus 60M without it.
I can trigger a leak locally for stable/12, but not for head.
(In reply to Michael Tuexen from comment #12) Glad to hear that. Is that means that tcp_rack in 13-STABLE doesn't have this bug?
(In reply to iron.udjin from comment #13) Correct. At least not the mbuf leak I'm able to reproduce.
(In reply to iron.udjin from comment #13) I have to correct myself: I can't reproduce that locally. I just did not wait long enough such that all resources were freed again. Can you remove all traffic from one of your production machines and let it sit around for a while to see if all the TCP connections disappear and all mbufs are freed or not? Can you provide the output of netstat -sptcp? Maybe that gives a hint what is happening and how to trigger the leak.
Created attachment 221992 [details] netstat-sptcp
Today I discovered another issue. Some of TCP sessions looping in TCP Retransmission state. Here is what I captured by tcpdump and dumped by tshark: 31880 32.587768 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31881 32.588789 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31882 32.589809 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31883 32.590829 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31884 32.591849 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31885 32.592870 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31886 32.593891 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31887 32.594911 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 31888 32.595931 XXX.XXX.XXX.XXX → YYY.YYY.YYY.YYY TCP 1514 [TCP Retransmission] 443 → 56467 [ACK] Seq=1 Ack=1 Win=17 Len=1460 XXX.XXX.XXX.XXX - my IP YYY.YYY.YYY.YYY - client IP Those TCP sessions produce 10-100Mbit traffic from my server to clients. Please check attached screenshot.
Created attachment 221993 [details] looping_sessions_traffic.png
(In reply to Michael Tuexen from comment #15) >Can you remove all traffic from one of your production machines Unfortunately I cannot stop web-server for a long time. But I can make short test. Is that would be correct test case when I stop nginx and kill all established TCP sessions use tcpdrop? Offtopic: please check PR 253066. I'm trying to test RACK on 13-STABLE and hit the bug which blocks me.
(In reply to iron.udjin from comment #17) Can you provide privately a .pcap file? I would like to look at some fields not shown in the ASCII output. Also there is no traffic in the other direction or did you use some sort of capture filter?
(In reply to Michael Tuexen from comment #20) Just sent to your email.
(In reply to iron.udjin from comment #21) Received, thanks a lot. I'm looking at it.
OK. In the tracefile you provided, shows that: * the TCP connection does not use time stamps * the peer is not available anymore, since there are every 3 seconds 3 ICMP messages indicating that the host is not reachable. * the RACK stack sends out the same TCP segments once per ms. The question is why the stack gets into this condition. Two questions: 1. When the stack has been in such a condition, can you provide the output of sysctl net.inet.tcp.rack Maybe the stats counters there provide some hints. 2. Looking at the graphic you provided seems to indicate that the traffic volume increases when you enable RACK. I don't see a reason for a substantial increase. So I guess that some connections get into the state we are interested soon after you enable RACK. So could you start without enabling RACK, start capturing traffic with tcpdump, enable RACK, ensure that some connection hit the bug, and stop the capturing. It would be great, if I could get the .pcap file for one of the connections getting in the buggy state. Based on the .pcap file I can write a packetdrill script to try to reproduce what is happening.
(In reply to Michael Tuexen from comment #23) 1. For such looping TCP connections at Nov 16 2020 I wrote script which checks if SENDQ greater 1700000 and drop them use tcpdrop: netstat -p tcp -n | awk '{print $3" "$5" "$6}' | egrep "^[0-9]" | grep -v '^0\ ' | while read line; do SENDQ=`echo $line | awk '{print $1}'` if [ "$SENDQ" -gt "1700000" ]; then echo $line | awk '{print $2}' | awk -F\. '{print $1"."$2"."$3"."$4" "$5}' | xargs tcpdrop XXX.XXX.XXX.XXX 443 > /dev/null echo -n "`date`: " >> /root/check_drop.log echo "$line - DROPPED" >> /root/check_drop.log fi done # cat /root/check_drop.log | wc -l 460 Sometimes the script kills 5-10 connections/hour. Sometimes nothing in a day. I'll disable this script and when I catch such connection, I'll post sysctl net.inet.tcp.rack output. >So could you start without enabling RACK, start capturing traffic with tcpdump, enable RACK, ensure that some connection hit the bug, and stop the capturing. 2. It's impossible because very low percent of connections hit such looping state. And that's happening not right after enabling RACK. On the picture I wanted to show how looping connections increases traffic volume. I assume the connection should hit a few conditions to start looping. I can write script which with a certain frequency will collect network vaiables and write output to the log file. Tell me please frequency and variables which you need to see and I'll share logs via private link by webserver (will email you credentials).
(In reply to iron.udjin from comment #24) OK. It would be great to have a look at the rack stats from the sysctl output. Can you also check in which TCP state the connection is when it is in the buggy state?
(In reply to Michael Tuexen from comment #25) Just sent you check_drop.log where is connection state and Send-Q.
(In reply to iron.udjin from comment #26) Assuming that having a large enough send buffer is an indication of the problem, the TCP connection are in all possible states...
(In reply to Michael Tuexen from comment #27) Yes. net.inet.tcp.sendspace=65536 net.inet.tcp.sendbuf_inc=32768 net.inet.tcp.sendbuf_max=16777216 net.inet.tcp.sendbuf_auto=1 net.local.stream.sendspace=163840 With this settings html page load time becomes lover then with default values.
After 10 hours uptime I have 3 looping connection which produce 30Mibt/s of traffic. I cannot say exact time when they started looping because traffic graph has quite linear grow: https://prnt.sc/xv6ldt netstat -m: 265055/26920/291975 mbufs in use (current/cache/total) 130635K/124086K/254721K bytes allocated to network (current/cache/total) # sysctl net.inet.tcp.functions_available net.inet.tcp.rack net.inet.tcp.functions_available: Stack D Alias PCB count freebsd freebsd 3 rack * rack 4107 net.inet.tcp.rack.clear: 0 net.inet.tcp.rack.tlp_nada: 0 net.inet.tcp.rack.idle_reduce_oninput: 0 net.inet.tcp.rack.prog_drops: 0 net.inet.tcp.rack.runtsacks: 854985 net.inet.tcp.rack.hit_tlp_method2: 43498272 net.inet.tcp.rack.hit_tlp_method: 26318731 net.inet.tcp.rack.tlp_calc_entered: 143382781 net.inet.tcp.rack.sack_short: 4099975 net.inet.tcp.rack.sack_restart: 0 net.inet.tcp.rack.sack_long: 466643 net.inet.tcp.rack.split_limited: 7165 net.inet.tcp.rack.alloc_limited_conns: 45 net.inet.tcp.rack.alloc_limited: 51035137 net.inet.tcp.rack.allocemerg: 0 net.inet.tcp.rack.allochard: 0 net.inet.tcp.rack.allocs: 107989742 net.inet.tcp.rack.saw_enetunreach: 0 net.inet.tcp.rack.saw_enobufs: 0 net.inet.tcp.rack.unpaced: 138312500 net.inet.tcp.rack.paced: 9822106 net.inet.tcp.rack.arm_tlp: 0 net.inet.tcp.rack.arm_rack: 0 net.inet.tcp.rack.rack_to_tot: 51807262 net.inet.tcp.rack.tlp_retran_fail: 43664 net.inet.tcp.rack.tlp_retran_bytes: 249582917 net.inet.tcp.rack.tlp_retran: 487864 net.inet.tcp.rack.tlp_new: 80270 net.inet.tcp.rack.tlp_to_total: 543162 net.inet.tcp.rack.reordering: 686406 net.inet.tcp.rack.findhigh: 26081 net.inet.tcp.rack.tsnf: 0 net.inet.tcp.rack.prrsndnew: 2614014 net.inet.tcp.rack.prrsndret: 55895077 net.inet.tcp.rack.badfr_bytes: 255611021 net.inet.tcp.rack.badfr: 241600 net.inet.tcp.rack.inc_var: 0 net.inet.tcp.rack.pktdelay: 1 net.inet.tcp.rack.reorder_fade: 60000 net.inet.tcp.rack.rtt_tlp_thresh: 1 net.inet.tcp.rack.reorder_thresh: 2 net.inet.tcp.rack.earlyrecovery: 1 net.inet.tcp.rack.earlyrecoveryseg: 6 net.inet.tcp.rack.minto: 1 net.inet.tcp.rack.prr_sendalot: 1 net.inet.tcp.rack.hptsi_seg_max: 40 net.inet.tcp.rack.hptsi_every_seg: 1 net.inet.tcp.rack.hptsi_reduces: 4 net.inet.tcp.rack.tlp_cwnd_flag: 0 net.inet.tcp.rack.recovery_prop: 10 net.inet.tcp.rack.recovery_loss_prop: 0 net.inet.tcp.rack.tlp_retry: 2 net.inet.tcp.rack.maxrto: 30000 net.inet.tcp.rack.minrto: 30 net.inet.tcp.rack.rack_tlimit: 0 net.inet.tcp.rack.rack_tlp_in_recovery: 1 net.inet.tcp.rack.send_oldest: 0 net.inet.tcp.rack.sblklimit: 128 net.inet.tcp.rack.precache: 1 net.inet.tcp.rack.tlpminto: 10 net.inet.tcp.rack.delayed_ack: 200 net.inet.tcp.rack.sackfiltering: 1 net.inet.tcp.rack.bb_verbose: 0 net.inet.tcp.rack.idle_reduce_high: 0 net.inet.tcp.rack.min_pace_segs: 6 net.inet.tcp.rack.min_pace_time: 0 net.inet.tcp.rack.tlpmethod: 2 net.inet.tcp.rack.data_after_close: 1 net.inet.tcp.rack.rate_sample_method: 1 net.inet.tcp.rack.map_splitlimit: 256 net.inet.tcp.rack.map_limit: 1024 # sockstat -sSPtcp | grep AAA.AAA.AAA.AAA ? ? ? ? tcp4 XXX.XXX.XXX.XXX:443 AAA.AAA.AAA.AAA:62799 LAST_ACK rack # sockstat -sSPtcp | grep BBB.BBB.BBB.BBB ? ? ? ? tcp4 XXX.XXX.XXX.XXX:443 BBB.BBB.BBB.BBB:32440 LAST_ACK rack # sockstat -sSPtcp | grep CCC.CCC.CCC.CCC ? ? ? ? tcp4 XXX.XXX.XXX.XXX:443 CCC.CCC.CCC.CCC:50608 CLOSING rack
(In reply to Michael Tuexen from comment #27) Today I had luck to capture the beginning of connection which being looping in LAST_ACK state. Uptime 10 days. RAM usage: 1732691K/66655K/1799346K bytes allocated to network (current/cache/total) ...even at night when ~500 total connections. # sysctl net.inet.tcp.rack net.inet.tcp.rack.clear: 0 net.inet.tcp.rack.tlp_nada: 0 net.inet.tcp.rack.idle_reduce_oninput: 0 net.inet.tcp.rack.prog_drops: 0 net.inet.tcp.rack.runtsacks: 206499410 net.inet.tcp.rack.hit_tlp_method2: 1174282781 net.inet.tcp.rack.hit_tlp_method: 816793264 net.inet.tcp.rack.tlp_calc_entered: 4296358296 net.inet.tcp.rack.sack_short: 144131132 net.inet.tcp.rack.sack_restart: 0 net.inet.tcp.rack.sack_long: 14627478 net.inet.tcp.rack.split_limited: 155396 net.inet.tcp.rack.alloc_limited_conns: 1673 net.inet.tcp.rack.alloc_limited: 6547638618 net.inet.tcp.rack.allocemerg: 0 net.inet.tcp.rack.allochard: 0 net.inet.tcp.rack.allocs: 3101693560 net.inet.tcp.rack.saw_enetunreach: 0 net.inet.tcp.rack.saw_enobufs: 0 net.inet.tcp.rack.unpaced: 8990056544 net.inet.tcp.rack.paced: 328607481 net.inet.tcp.rack.arm_tlp: 0 net.inet.tcp.rack.arm_rack: 0 net.inet.tcp.rack.rack_to_tot: 4001642380 net.inet.tcp.rack.tlp_retran_fail: 1353037 net.inet.tcp.rack.tlp_retran_bytes: 7637629125 net.inet.tcp.rack.tlp_retran: 14327372 net.inet.tcp.rack.tlp_new: 2632317 net.inet.tcp.rack.tlp_to_total: 16766301 net.inet.tcp.rack.reordering: 21745075 net.inet.tcp.rack.findhigh: 480921 net.inet.tcp.rack.tsnf: 0 net.inet.tcp.rack.prrsndnew: 95096931 net.inet.tcp.rack.prrsndret: 6697135814 net.inet.tcp.rack.badfr_bytes: 8262477089 net.inet.tcp.rack.badfr: 7695368 net.inet.tcp.rack.inc_var: 0 net.inet.tcp.rack.pktdelay: 1 net.inet.tcp.rack.reorder_fade: 60000 net.inet.tcp.rack.rtt_tlp_thresh: 1 net.inet.tcp.rack.reorder_thresh: 2 net.inet.tcp.rack.earlyrecovery: 1 net.inet.tcp.rack.earlyrecoveryseg: 6 net.inet.tcp.rack.minto: 1 net.inet.tcp.rack.prr_sendalot: 1 net.inet.tcp.rack.hptsi_seg_max: 40 net.inet.tcp.rack.hptsi_every_seg: 1 net.inet.tcp.rack.hptsi_reduces: 4 net.inet.tcp.rack.tlp_cwnd_flag: 0 net.inet.tcp.rack.recovery_prop: 10 net.inet.tcp.rack.recovery_loss_prop: 0 net.inet.tcp.rack.tlp_retry: 2 net.inet.tcp.rack.maxrto: 30000 net.inet.tcp.rack.minrto: 30 net.inet.tcp.rack.rack_tlimit: 0 net.inet.tcp.rack.rack_tlp_in_recovery: 1 net.inet.tcp.rack.send_oldest: 0 net.inet.tcp.rack.sblklimit: 128 net.inet.tcp.rack.precache: 1 net.inet.tcp.rack.tlpminto: 10 net.inet.tcp.rack.delayed_ack: 200 net.inet.tcp.rack.sackfiltering: 1 net.inet.tcp.rack.bb_verbose: 0 net.inet.tcp.rack.idle_reduce_high: 0 net.inet.tcp.rack.min_pace_segs: 6 net.inet.tcp.rack.min_pace_time: 0 net.inet.tcp.rack.tlpmethod: 2 net.inet.tcp.rack.data_after_close: 1 net.inet.tcp.rack.rate_sample_method: 1 net.inet.tcp.rack.map_splitlimit: 256 net.inet.tcp.rack.map_limit: 1024 Michael, I just sent you .pcap file. Please let me know in case you need any other info.
My memory is shaky, you might want to check with Johannes Lundberg, but I think we saw this at LLNW too with LAST_ACK a few years ago. I believe we added a send loop counter and exit as a workaround because we never root caused it.
Please reopen if the problem still exists in main. The RACK stack has changed substantially.