Bug 256217 - [tcp] High system load because of interrupts with RACK
Summary: [tcp] High system load because of interrupts with RACK
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa, performance
Depends on:
Blocks:
 
Reported: 2021-05-28 10:57 UTC by Christos Chatzaras
Modified: 2021-07-17 02:26 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable13?
koobs: mfc-stable12-
koobs: mfc-stable11-


Attachments
tcp_hpts panic (373.20 KB, image/png)
2021-05-29 14:36 UTC, Christos Chatzaras
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christos Chatzaras 2021-05-28 10:57:00 UTC
After I enable RACK I notice high system load because of high interrupts.

I compile a kernel with these options:

makeoptions   WITH_EXTRA_TCP_STACKS=1
options       TCPHPTS
options       RATELIMIT

I add in the /etc/sysctl.conf :

net.inet.tcp.functions_default=rack

and in /boot/loader.conf :

kern.eventtimer.timer=HPET

After I reboot the servers I notice 5% - 45% interrupts (some servers show more and some less) using "top".

The issue happens with "kern.eventtimer.timer=LAPIC" too.

Then I disable RACK and switch back to LAPIC and after reboot "top" shows 0.0% - 0.1% interrupts.

Switching to "net.inet.tcp.functions_default=freebsd" without reboot the interrupts don't decrease.

Searching for similar issues I found this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241958
Comment 1 Christos Chatzaras 2021-05-28 12:55:59 UTC
With both LAPIC and HPET there are many interrups.

----

CPU 0: 39.0% user,  0.0% nice,  6.7% system, 47.2% interrupt,  7.1% idle
CPU 1: 40.6% user,  0.0% nice,  3.5% system, 46.1% interrupt,  9.8% idle
CPU 2: 44.5% user,  0.0% nice,  5.5% system, 43.7% interrupt,  6.3% idle
CPU 3: 42.5% user,  0.0% nice,  6.7% system, 42.9% interrupt,  7.9% idle
CPU 4: 38.2% user,  0.0% nice,  5.5% system, 48.4% interrupt,  7.9% idle
CPU 5: 40.2% user,  0.0% nice,  5.1% system, 47.2% interrupt,  7.5% idle
CPU 6: 42.5% user,  0.0% nice,  5.9% system, 44.1% interrupt,  7.5% idle
CPU 7: 44.5% user,  0.0% nice,  6.7% system, 44.5% interrupt,  4.3% idle


----

sysctl kern.eventtimer.timer=LAPIC

cpu0:timer                          3814       3810
cpu1:timer                          3868       3864
cpu2:timer                          3947       3943
cpu3:timer                          3916       3912
cpu4:timer                          3780       3776
cpu5:timer                          3917       3913
cpu6:timer                          3860       3856
cpu7:timer                          3814       3810

----

sysctl kern.eventtimer.timer=HPET

irq120: hpet0:t0                    3047       3047
irq121: hpet0:t1                    3053       3053
irq122: hpet0:t2                    3146       3146
irq123: hpet0:t3                    3099       3099
irq124: hpet0:t4                    2942       2942
irq125: hpet0:t5                    3120       3120
irq126: hpet0:t6                    3037       3037
irq127: hpet0:t7                    3049       3049
irq129: ahci0                        101        101
irq130: em0:irq0                     401        401
Comment 2 Michael Tuexen freebsd_committer freebsd_triage 2021-05-28 13:03:44 UTC
Are you experiencing this in a VM? If yes, what the the virtualiser, what is the host OS?

Do you experience the load also when using the kernel you described but:
* Don't have loaded the RACK stack at all.
* Have loaded the RACK stack, but not using it at all (so don't set net.inet.tcp.functions_default=rack)

We had a report earlier, but could never reproduce it.
Comment 3 Christos Chatzaras 2021-05-28 14:04:41 UTC
These are dedicated servers with no virtualisation.

----

NIC:

em0: <Intel(R) PRO/1000 Network Connection> mem 0xf7000000-0xf701ffff irq 16 at device 31.6 on pci0

----

1) tcp_rack_load="YES" && "net.inet.tcp.functions_default=freebsd" && reboot = problem does NOT exist

2) tcp_rack_load="NO" && "net.inet.tcp.functions_default=freebsd" && reboot = problem does NOT exist

----

3) tcp_rack_load="YES" && "net.inet.tcp.functions_default=rack" && reboot = the problem EXIST.

Then if without reboot I "net.inet.tcp.functions_default=freebsd" the problem still EXIST.

But when after I change to "freebsd" I run "kldunload -f tcp_rack" the problem does NOT exist.
Comment 4 Michael Tuexen freebsd_committer freebsd_triage 2021-05-28 17:42:30 UTC
So loading the RACK module, but not loading it, does not trigger the issue.

Getting rid of all RACK based TCP connections seems to resolve the issue.

Can you do the following experiment?

1) Load RACK and use net.inet.tcp.functions_default=freebsd at boot time. You should not experience the problem.
2) Switch the stack for new connections to RACK by using sysctl net.inet.tcp.functions_default=rack. You can check the stack being used by using sockstat -SPtcp. The problem should now show up once new connections are established.
3) Switch the stack for new connections to the base stack by using sysctl net.inet.tcp.functions_default=freebsd. Either wait until the connections using RACK have been closed or kill them by using tcpdrop -S rack. This should resolve the issue.

If the behaviour is as I think it is, the problem is only there if you have active RACK connections.
Comment 5 Christos Chatzaras 2021-05-28 19:00:15 UTC
To avoid reboot I did this which I believe is the same:

sysctl net.inet.tcp.functions_default
net.inet.tcp.functions_default: freebsd

kldload tcp_rack

sysctl net.inet.tcp.functions_default=rack

I restart all services so new connections (mostly nginx) use "rack".

I wait for few minutes and interrupts increase to ~5%

sysctl net.inet.tcp.functions_default=freebsd

tcpdrop -S rack

Interrupts decrease to ~ 0.3% (looks like less connections = less interrupts)

New connections still use "rack", so I restart again all services.

Now all connections use "freebsd" and interrupts decrease to ~ 0.0%

----

Also something I notice is that during the issue if I run "sysctl kern.eventtimer.timer=HPET" then interrupts immediately increase. If I run "sysctl kern.eventtimer.timer=LAPIC" the interrupts immediately decrease.

----

So it looks like that active rack connections cause the issue.
Comment 6 Michael Tuexen freebsd_committer freebsd_triage 2021-05-28 19:21:51 UTC
(In reply to Christos Chatzaras from comment #5)
Thanks for testing, I guess you confirmed what I assumed is causing the behaviour.

I have no idea why this happens and if it is sort of expected or not. I'll try to reproduce it locally and will report. I'll also bring it up on a biweekly transport call.
Comment 7 Christos Chatzaras 2021-05-29 09:52:34 UTC
I was able to reproduce the issue with a "test" server && 10 VPS running Linux. On each VPS I run wrk (benchmarking tool):

wrk -c 1000 -d 3600s http://url

This creates 10000 concurrent connections to "test" server.

Also in my nginx.conf I replace "keepalive_timeout 60;" with "keepalive_timeout 0;" so connections are not reused which helps to show more interrupts faster.

Then I kill all "wrk" and after 1 minute:

sockstat -sSPtcp | grep rack | wc -l
    7421

At that moment "top" shows 20% interrupts and "netstat 1" show 1-10 packets / sec which I believe is my ssh session, so no activity. Also at that moment tcp "rack" states were FIN_WAIT_1 and CLOSING.

After few minutes most "rack" connections close and "top" shows ~ 0.7% interrupts. At that moment I had 4 stuck connections in LAST_ACK state which I drop using "tcpdrop -s LAST_ACK" and finally "top" shows ~ 0.0% interrupts.

If it helps I can give root access to the "test" server.
Comment 8 Michael Tuexen freebsd_committer freebsd_triage 2021-05-29 10:49:17 UTC
(In reply to Christos Chatzaras from comment #7)
According to rrs@ (who wrote the HPTS system), the interrupt issue is know. There is code, which reduces the interrupt load, but that code has not yet been committed to the tree.

Regarding stuck connections: Do they clear up after some time or do they keep around?
Comment 9 Christos Chatzaras 2021-05-29 11:45:15 UTC
HPTS system is "kern.eventtimer.timer=HPET", right? With HPET I see (with "top") ~ 2 times more interrupts in comparison with LAPIC.

I tried again to reproduce the issue with a STABLE/13 kernel and it exists too. Next I will try with a CURRENT kernel as I see some fixes for RACK.

Regarding the "stuck" connection in LAST_ACK state someone tries to brute force SSH and "sshguard" blocks the connections (I have 4 "stuck" connection for SSH at the moment). Also I have 2 "stuck" connections in port 80 from "wrk benchmark". I found this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=25986 which is old PR but @johalun replied that he can see this issue few months ago. But they are not talking about "rack" so this is not related. The only relation with "rack" is that these "stuck" connections create interrupts. At the moment these connections "stuck" for more than 35 minutes. Also I see some "stuck" connections in other servers that use "freebsd" stack.
Comment 10 Michael Tuexen freebsd_committer freebsd_triage 2021-05-29 12:25:49 UTC
(In reply to Christos Chatzaras from comment #9)
> HPTS system is "kern.eventtimer.timer=HPET", right? With HPET I see (with "top") ~ 2 > times more interrupts in comparison with LAPIC.

No. HPTS is a system for high resolution timing, which can be used by TCP. The configuration parameters you are referring to are generic time sources.
Please note that when using HPTS with RACK, more events are generated and handled compared to the RACK stack. That is what HPTS is for. However, the interrupt load 
can be reduced by some optimisations. These are the optimisations rrs@ is referring to.

> I tried again to reproduce the issue with a STABLE/13 kernel and it exists too. Next > I will try with a CURRENT kernel as I see some fixes for RACK.

releng/13 and stable/13 should be very similar with respect to RACK. All improvements
are only in current right now.

> Regarding the "stuck" connection in LAST_ACK state someone tries to brute force SSH and "sshguard" blocks the connections (I have 4 "stuck" connection for SSH at the moment). Also I have 2 "stuck" connections in port 80 from "wrk benchmark". I found this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=25986 which is old PR but @johalun replied that he can see this issue few months ago. But they are not talking about "rack" so this is not related. The only relation with "rack" is that these "stuck" connections create interrupts. At the moment these connections "stuck" for more than 35 minutes. Also I see some "stuck" connections in other servers that use "freebsd" stack.

I don't know how sshguard works, but I would be interested in understanding why that
happens...
Comment 11 Christos Chatzaras 2021-05-29 13:09:37 UTC
SSHGuard ( https://www.freshports.org/security/sshguard ) is a tool that checks /var/log/auth.log , /var/log/maillog, etc and blocks IPs that try to brute force passwords. When someone tries to login for example to SSH multiple times using wrong password it blocks their IP using IPFW (or PF). As the remote host is not "reachable" any more we can't receive any packets from it which I believe makes the connection to stuck in LAST_ACK state.

I tried to use a CURRENT kernel with 13.0 userland, I connect successfully using SSH, but after I run the "wrk benchmark" the server hang. Maybe I have to build a CURRENT userland too. I ask datacenter to connect a KVM to see if monitor shows more information.
Comment 12 Christos Chatzaras 2021-05-29 14:36:55 UTC
Created attachment 225360 [details]
tcp_hpts panic

The CURRENT kernel (with 13.0 userland) panic because of LRO. I disable LRO and I successfully boot the server. But during the "wrk benchmark" it panic (tcp_hpts).
Comment 13 Michael Tuexen freebsd_committer freebsd_triage 2021-07-13 20:43:51 UTC
I have MFCed to stable/13 some performance improvements committed by rrs@ to main.

Can you retest to see if the load is now less than before?
Comment 14 Christos Chatzaras 2021-07-16 11:29:43 UTC
(In reply to Michael Tuexen from comment #13)

Thank you. At the moment I don't have a test system available. When I have I will redo the tests and report the results.