Our goal is to achieve the highest number of TCP queries per second (QPS) on a given hardware. A TCP query being defined here as: Establishing a TCP connection, sending of a small request, reading the small response, closing the connection. By configuring a single NIC receive queue bound to a single CPU core, TCP performance results on FreeBSD are great: We got ~52k QPS before being CPU bound, and we achieved the same result on Linux. However, by configuring 4 NIC receive queues each bound to a different core of the same CPU, results are lower than expected: We got only ~56k QPS, where we reached ~200k QPS on Linux on the same hardware. We investigated the cause of this performance scaling issue: The PMC profiling showed that more than half of CPU time was spent in _rw_rlock() and _rw_wlock_hard(), and then the lock profing showed a lock contention on ipi_lock of the TCP pcbinfo structure (ipi_lock being acquired with the INP_INFO_*LOCK macros). Below our lock profiling result ordered by "total accumulated lock wait time": (Lock profiling done on vanilla FreeBSD 9.2) # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n> -k 4 -r | head -20 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 265 39477 4669994 57027602 840000 5 67 0 780171 sys/netinet/tcp_usrreq.c:728 (rw:tcp) 248 39225 9498849 39991390 2044168 4 19 0 1919503 sys/netinet/tcp_input.c:775 (rw:tcp) 234 39474 589181 39241879 840000 0 46 0 702845 sys/netinet/tcp_usrreq.c:984 (rw:tcp) 234 43262 807708 22694780 840000 0 27 0 814240 sys/netinet/tcp_usrreq.c:635 (rw:tcp) 821 39218 8592541 22346613 1106252 7 20 0 1068157 sys/netinet/tcp_input.c:1019 (rw:tcp) 995 37316 1210480 6822269 343692 3 19 0 324585 sys/netinet/tcp_input.c:962 (rw:tcp) The top 6 lock profiling entries are all related to the same INP_INFO (rw:tcp) lock, below more details: #1 In tcp_usr_shutdown() https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L728 #2 In tcp_input() for SYN/FIN/RST TCP packets https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L775 #3 In tcp_usr_close() https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L984 #4 In tcp_usr_accept() https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_usrreq.c#L635 #5 In tcp_input() for incoming TCP packets when the corresponding connection is not in ESTABLISHED state. In general the client ACK packet of TCP three-way handshake that is going to create the connection. https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L1019 #6 tcp_input() for incoming TCP packets when the corresponding connection is in TIME_WAIT state https://github.com/freebsd/freebsd/blob/releng/9.2/sys/netinet/tcp_input.c#L962 Our explanation for such lock contention is straightforward: Our typical workload entails this packet sequence: Received TCP packets: Sent TCP packets: #1 SYN -> <- SYN-ACK #2 ACK -> #3 query data -> <- ACK + response data <- FIN #4 ACK -> #5 FIN -> <- ACK For #1, #2, #4, #5 received packets the write lock on INP_INFO is required in tcp_input(), only #3 does not require this lock. Which means that only 1/5th of all received packets will be proceed in parallel for the entire TCP stack. Moreover the lock is also required in all major TCP syscalls: tcp_usr_shutdown(), tcp_usr_close() and tcp_usr_accept(). We are aware than achieving a rate of 200k TCP connections per second is a specific goal but a better TCP connection setup/teardown scalability could benefit to other TCP network services as well. Fix: This locking contention is tricky to fix: A main pain point to mitigate this lock contention being that this global TCP lock does not only protect globally shared data but also does create a critical section for the whole TCP stack. Then, restructuring TCP stack locking in one shot could lead to complex race conditions and make tests and reviews impractical. Our current strategy to lower risk is to break down this lock contention mitigation task: 1. Remove INP_INFO lock from locations it is not actually required 2. Replace INP_INFO lock by more specific locks where appropriate 3. Change lock order from "INP_INFO lock (before) INP" to "INP lock (before) INP_INFO" 4. Then push INP_INFO lock deeper in the stack where appropriate 5. Introduce a INP_HASH_BUCKET replacing INP_INFO where appropriate Note: By "where appropriate" we mean TCP stack parts where INP_INFO is a proven major contention point _and_ change side effects are clear enough to be reviewed. The main goal being to ease test and review of each step. Patch attached with submission follows: How-To-Repeat: Below details to reproduce this performance issue contention using open source software: o Software used: - TCP client: ab version 2.4 - TCP server: nginx version 1.4.2 o Software configurations: - server: See joined nginx.conf Core binding on our 12 cores server: - The 4 NIC receive queues are bounded to cores 0, 1, 2 and 3. - The 4 nginx workers are bounded to cores 4, 5, 6, and 7. - client: launch: $ for i in $(seq 0 11); do \ taskset -c $i ab -c 2000 -n 1000000 http://server/test.html & done Note: We use the same Linux load driver to load both Linux and FreeBSD, we did not try to launch ab from a FreeBSD box, sorry. - 'test.html' HTML page is simpy: <html><head><title>Title</title></head><body><p>Body</p></body></html> You should get: - TCP request size: 92 bytes - TCP response size: 206 bytes o Tunables/sysctls parameters: - Main tunables to tune: # We want 4 receive queues hw.ixgbe.num_queues=4 # Other tunables kern.ipc.maxsockets kern.ipc.nmbclusters kern.maxfiles kern.maxfilesperproc net.inet.tcp.hostcache.hashsize net.inet.tcp.hostcache.cachelimit net.inet.tcp.hostcache.bucketlimit net.inet.tcp.tcbhashsize net.inet.tcp.syncache.hashsize net.inet.tcp.syncache.bucketlimit - sysctl to tune: # Values to increase kern.ipc.maxsockbuf kern.ipc.somaxconn net.inet.tcp.maxtcptw o Monitoring tools: - We use i7z to check when the server is CPU bounded (from sysutils/i7z port) which should give you 100% in C0 state "Processor running without halting" on NIC receive queues bound cores: ---- True Frequency (without accounting Turbo) 3325 MHz Socket [0] - [physical cores=6, logical cores=6, max online cores ever=6] CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz TURBO ENABLED on 6 Cores, Hyper Threading OFF Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26]) Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is 27x/27x/26x/26x/26x/26x Real Current Frequency 3373.71 MHz (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp Core 1 [0]: 3370.76 (25.34x) 103 0 0 0 43 Core 2 [1]: 3361.13 (25.27x) 103 0 0 0 42 Core 3 [2]: 3373.71 (25.37x) 105 0 0 0 43 Core 4 [3]: 3339.75 (25.11x) 106 0 0 0 42 Core 5 [4]: 3323.90 (24.99x) 65.9 34.1 0 0 42 Core 6 [5]: 3323.90 (24.99x) 65.9 34.1 0 0 41 Socket [1] - [physical cores=6, logical cores=6, max online cores ever=6] CPU Multiplier 25x || Bus clock frequency (BCLK) 133.00 MHz TURBO ENABLED on 6 Cores, Hyper Threading OFF Max Frequency without considering Turbo 3458.00 MHz (133.00 x [26]) Max TURBO Multiplier (if Enabled) with 1/2/3/4/5/6 cores is 27x/27x/26x/26x/26x/26x Real Current Frequency 3309.13 MHz (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp Core 1 [6]: 3309.13 (24.88x) 47.5 52.8 0 0 43 Core 2 [7]: 3308.36 (24.87x) 48 52.3 0 0 42 Core 3 [8]: 3266.36 (24.56x) 1 99.6 0 0 34 Core 4 [9]: 3244.74 (24.40x) 1 99.6 0 0 33 Core 5 [10]: 3274.51 (24.62x) 1 99.4 0 0 38 Core 6 [11]: 3244.08 (24.39x) 1 99.5 0 0 36 C0 = Processor running without halting C1 = Processor running with halts (States >C0 are power saver) C3 = Cores running with PLL turned off and core cache turned off C6 = Everything in C3 + core state saved to last level cache ---- o PMC profiling: The flat profile of 'unhalted-cycles' of the core 1 should look like: % cumulative self self total time seconds seconds calls ms/call ms/call name 55.6 198867.00 198867.00 291942 681.19 681.43 _rw_wlock_hard [7] 2.6 208068.00 9201.00 8961 1026.78 4849.93 tcp_do_segment [14] 2.4 216592.00 8524.00 86 99116.28 102597.15 sched_idletd [26] 2.3 224825.00 8233.00 8233 1000.00 1000.00 _rw_rlock [27] 1.9 231498.00 6673.00 12106 551.21 27396.73 ixgbe_rxeof [2] 1.4 236638.00 5140.00 310457 16.56 1004.01 tcp_input [6] 1.2 241074.00 4436.00 5051 878.24 1000.00 in_pcblookup_hash_locked [36] 1.2 245317.00 4243.00 4243 1000.00 1000.00 bcopy [39] 1.1 249392.00 4075.00 2290 1779.48 3295.95 knote [30] 1.0 252956.00 3564.00 366 9737.70 18562.04 ixgbe_mq_start [31] 0.9 256274.00 3318.00 7047 470.84 3348.41 _syncache_add [16] 0.8 259312.00 3038.00 3038 1000.00 1000.00 bzero [51] 0.8 262269.00 2957.00 6253 472.89 2900.54 ip_output [18] 0.8 264978.00 2709.00 3804 712.15 1009.00 callout_lock [42] 0.6 267185.00 2207.00 2207 1000.00 1000.00 memcpy [64] 0.6 269273.00 2088.00 365 5720.55 7524.50 ixgbe_xmit [56] 0.6 271321.00 2048.00 2048 1000.00 1000.00 bcmp [67] 0.6 273291.00 1970.00 1970 1000.00 1000.73 _rw_runlock [68] 0.5 275188.00 1897.00 1897 1000.00 1000.00 syncache_lookup [71] And the call graph profile of _rw_wlock_hard of 'unhalted-cycles' of the core 1: 0.68 0.00 1/291942 tcp_slowtimo [331] 10.22 0.00 15/291942 syncache_expand [17] 35.42 0.01 52/291942 in_pcbdrop [77] 126.70 0.04 186/291942 tcp_usr_detach [65] 208.44 0.07 306/291942 tcp_usr_attach [34] 2094.65 0.73 3075/291942 in_pcblookup_hash [22] 196390.89 68.73 288307/291942 tcp_input [6] [7] 55.6 198867.00 69.60 291942 _rw_wlock_hard [7] 24.96 14.43 39/50 turnstile_trywait [216] 7.20 5.71 12/15 turnstile_cancel [258] 4.00 6.90 3/3 turnstile_wait [275] 3.02 0.00 3/1277 critical_enter [87] 2.13 0.25 2/2061 spinlock_exit <cycle 1> [94] 0.00 1.00 1/1 lockstat_nsecs [320]
Responsible Changed From-To: freebsd-bugs->freebsd-net Over to networking group
Joined a first patch that removes INP_INFO lock from tcp_usr_accept(): This changes simply follows the advice made in corresponding code comment: "A better fix would prevent the socket from being placed in the listen queue until all fields are fully initialized." For more technical details, check the comment in related change below: http://svnweb.freebsd.org/base?view=revision&revision=175612 With this patch applied we see no regressions and a performance improvement of ~5% i.e with 9.2 vanilla kernel: 52k TCP Queries Per Second, with 9.2 + joined patch: 55k TCP QPS. -- Julien
Author: gnn Date: Tue Jan 28 20:28:32 2014 New Revision: 261242 URL: http://svnweb.freebsd.org/changeset/base/261242 Log: Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb Modified: head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_usrreq.c Modified: head/sys/netinet/tcp_syncache.c ============================================================================== --- head/sys/netinet/tcp_syncache.c Tue Jan 28 19:12:31 2014 (r261241) +++ head/sys/netinet/tcp_syncache.c Tue Jan 28 20:28:32 2014 (r261242) @@ -682,7 +682,7 @@ syncache_socket(struct syncache *sc, str * connection when the SYN arrived. If we can't create * the connection, abort it. */ - so = sonewconn(lso, SS_ISCONNECTED); + so = sonewconn(lso, 0); if (so == NULL) { /* * Drop the connection; we will either send a RST or @@ -922,6 +922,8 @@ syncache_socket(struct syncache *sc, str INP_WUNLOCK(inp); + soisconnected(so); + TCPSTAT_INC(tcps_accepts); return (so); Modified: head/sys/netinet/tcp_usrreq.c ============================================================================== --- head/sys/netinet/tcp_usrreq.c Tue Jan 28 19:12:31 2014 (r261241) +++ head/sys/netinet/tcp_usrreq.c Tue Jan 28 20:28:32 2014 (r261242) @@ -610,13 +610,6 @@ out: /* * Accept a connection. Essentially all the work is done at higher levels; * just return the address of the peer, storing through addr. - * - * The rationale for acquiring the tcbinfo lock here is somewhat complicated, - * and is described in detail in the commit log entry for r175612. Acquiring - * it delays an accept(2) racing with sonewconn(), which inserts the socket - * before the inpcb address/port fields are initialized. A better fix would - * prevent the socket from being placed in the listen queue until all fields - * are fully initialized. */ static int tcp_usr_accept(struct socket *so, struct sockaddr **nam) @@ -633,7 +626,6 @@ tcp_usr_accept(struct socket *so, struct inp = sotoinpcb(so); KASSERT(inp != NULL, ("tcp_usr_accept: inp == NULL")); - INP_INFO_RLOCK(&V_tcbinfo); INP_WLOCK(inp); if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { error = ECONNABORTED; @@ -653,7 +645,6 @@ tcp_usr_accept(struct socket *so, struct out: TCPDEBUG2(PRU_ACCEPT); INP_WUNLOCK(inp); - INP_INFO_RUNLOCK(&V_tcbinfo); if (error == 0) *nam = in_sockaddr(port, &addr); return error; _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
State Changed From-To: open->patched Patched with commit 261242
Just a follow-up that updates lock profiling results with short-lived TCP connection traffic on FreeBSD-10.0 RELEASE: (Previous results were made on FreeBSD-9.2 RELEASE) o FreeBSD-10 RELEASE: # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 37 321900 3049892 13033648 610019 4 21 0 588013 sys/netinet/tcp_input.c:778 (rw:tcp) tcp_input() (SYN|FIN|RST) 51 115462 3240265 12270984 553157 5 22 0 545293 sys/netinet/tcp_input.c:1013 (rw:tcp) tcp_input() (state != ESTABLISHED) 29 62577 1170617 8754815 305885 3 28 0 296845 sys/netinet/tcp_usrreq.c:728 (rw:tcp) tcp_usr_close() 6 62645 146544 8548857 292058 0 29 0 283587 sys/netinet/tcp_usrreq.c:984 (rw:tcp) tcp_usr_shutdown() 11 62595 198811 6525067 309009 0 21 0 304522 sys/netinet/tcp_usrreq.c:635 (rw:tcp) tcp_usr_accept() - If lock contention spots moved a little between 9.2 and 10.0, nothing major as the top 5 still belongs to (rw:tcp) lock (a.k.a. TCP INP_INFO). o FreeBSD-10 RELEASE + PCBGROUP kernel option (by popular demand): # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -5 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 58 84250 2970633 13154832 622401 4 21 0 598964 sys/netinet/tcp_input.c:778 (rw:tcp) tcp_input() (SYN|FIN|RST) 47 224326 3375328 12945466 562451 6 23 0 554567 sys/netinet/tcp_input.c:1013 (rw:tcp) tcp_input() (state != ESTABLISHED) 22 84332 1193078 9693951 311555 3 31 0 302420 sys/netinet/tcp_usrreq.c:728 (rw:tcp) tcp_usr_close() 6 84307 151411 9137383 298120 0 30 0 289496 sys/netinet/tcp_usrreq.c:984 (rw:tcp) tcp_usr_shutdown() 15 84351 201705 6504520 314353 0 20 0 310270 sys/netinet/tcp_usrreq.c:635 (rw:tcp) tcp_usr_accept() - No changes at all in first ranks by using PCBGROUP option on FreeBSD-10 RELEASE. I have indeed checked that PCBGROUP was in use as at #36 rank there is the specific pcbgroup lock: 11 9 289817 4815 1505626 0 0 0 16054 sys/netinet/in_pcb.c:1530 (sleep mutex:pcbgroup) o FreeBSD-10 RELEASE + current lock mitigation patches [1][2]: # sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats | sort -n -k 4 -r | head -20 debug.lock.prof.stats: max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name 29 297 3781629 13476466 734686 5 18 0 715214 sys/netinet/tcp_input.c:778 (rw:tcp) tcp_input() (SYN|FIN|RST) 35 287 3817278 12301410 672907 5 18 0 669324 sys/netinet/tcp_input.c:1013 (rw:tcp) tcp_input() (state != ESTABLISHED) 18 170 1392058 2494823 367131 3 6 0 357888 sys/netinet/tcp_usrreq.c:719 (rw:tcp) tcp_usr_shutdown() 7 141 182209 2433120 350488 0 6 0 344878 sys/netinet/tcp_usrreq.c:975 (rw:tcp) tcp_usr_close() 10 259 26786 933073 38101 0 24 0 37624 sys/netinet/tcp_timer.c:493 (rw:tcp) tcp_timer_rexmt() - No more tcp_usr_accept() (expected) o Global results: Maximum short-lived TCP connection rate without dropping a single packet: - FreeBSD 10.0 RELEASE: 40.0k - FreeBSD 10.0 RELEASE + PCBGROUP: 40.0k - FreeBSD 10.0 RELEASE + patches: 56.8k [1] Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. http://svnweb.freebsd.org/base?view=revision&revision=261242 [2] tw-clock-v2.patch attached in: http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038124.html -- Julien
Further refinements to the original patch are currently in review. Those will need to go in before this is merged to 10.
A commit references this bug: Author: jch Date: Mon Aug 3 12:13:58 UTC 2015 New revision: 286227 URL: https://svnweb.freebsd.org/changeset/base/286227 Log: Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc. Changes: head/sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c head/sys/dev/cxgb/ulp/tom/cxgb_listen.c head/sys/dev/cxgbe/tom/t4_connect.c head/sys/dev/cxgbe/tom/t4_cpl_io.c head/sys/dev/cxgbe/tom/t4_listen.c head/sys/netinet/in_pcb.c head/sys/netinet/in_pcb.h head/sys/netinet/tcp_input.c head/sys/netinet/tcp_subr.c head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_timer.c head/sys/netinet/tcp_timewait.c head/sys/netinet/tcp_usrreq.c head/sys/netinet/toecore.c head/sys/netinet6/in6_pcb.c
Fixed in 11.0-CURRENT
A commit references this bug: Author: jch Date: Mon Jul 18 08:20:32 UTC 2016 New revision: 302995 URL: https://svnweb.freebsd.org/changeset/base/302995 Log: MFC r261242: Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full discussion of this change and its measured effects. PR: 183659 Submitted by: Julien Charbon Reviewed by: jhb Changes: _U stable/10/ stable/10/sys/netinet/tcp_syncache.c stable/10/sys/netinet/tcp_usrreq.c
A commit references this bug: Author: jch Date: Thu Nov 24 14:48:47 UTC 2016 New revision: 309108 URL: https://svnweb.freebsd.org/changeset/base/309108 Log: MFC r286227, r286443: r286227: Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc. r286443: Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch Changes: stable/10/sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c stable/10/sys/dev/cxgb/ulp/tom/cxgb_listen.c stable/10/sys/dev/cxgbe/tom/t4_connect.c stable/10/sys/dev/cxgbe/tom/t4_cpl_io.c stable/10/sys/dev/cxgbe/tom/t4_listen.c stable/10/sys/netinet/in_pcb.c stable/10/sys/netinet/in_pcb.h stable/10/sys/netinet/tcp_input.c stable/10/sys/netinet/tcp_subr.c stable/10/sys/netinet/tcp_syncache.c stable/10/sys/netinet/tcp_timer.c stable/10/sys/netinet/tcp_timewait.c stable/10/sys/netinet/tcp_usrreq.c stable/10/sys/netinet/toecore.c stable/10/sys/netinet6/in6_pcb.c