Summary: | Highly loaded web server cath race condition on _close () from /lib/libc.so.7 with accf_http | ||
---|---|---|---|
Product: | Base System | Reporter: | Alexey <ucu8u1b-ol> |
Component: | kern | Assignee: | Julien Charbon <jch> |
Status: | Closed FIXED | ||
Severity: | Affects Some People | CC: | chris, cstdenis, glebius, iam, jcharbon |
Priority: | --- | ||
Version: | 10.3-STABLE | ||
Hardware: | amd64 | ||
OS: | Any |
Description
Alexey
2016-09-23 10:10:04 UTC
I have the same problem on 11.0-RC3 if reset_timedout_connection is enabled in Nginx (accept filters are enabled too). Running truss on a worker stuck in soclos produces this output: close(49) ERR#9 'Bad file descriptor' write(36,"2016/09/19 18:16:03 [crit] 15495#0: close() socket 49 failed (9: Bad file descriptor)\n",86) = 86 (0x56) and then worker starts to process requests as usual. Such entries are written to log when reset_timedout_connection is disabled (workers do not stuck in this case): [info] 89888#0: close() socket 12 failed (54: Connection reset by peer) Problem appeared after moving from FreeBSD 9.2 to 11.0-RC3. Discussion on Nginx forum (in russian): https://forum.nginx.org/read.php?21,269501,269778#msg-269778 I checked some SVN versions between r302223 and r305091 (10-STABLE) problem started since r302995 https://svnweb.freebsd.org/base?view=revision&sortby=date&revision=302995 Also, if I remove accept_filter=httpready from listen in nginx or disbale reset_timedout_connection option in nginx, problem is not repeated with SVN verions after r302995. /Alexey Thanks for reporting this issue, this is quite interesting. #1. Could you revert r302995 and check that everything is fine without it? It will increase a bit the contention on accept() but at 2000 req/sec it is fine, it starts to matter around 30k accept/sec. #2. Then could you try this change with r302995 in place: diff --git a/sys/netinet/tcp_syncache.c b/sys/netinet/tcp_syncache.c index f71d028..6d7ab69 100644 --- a/sys/netinet/tcp_syncache.c +++ b/sys/netinet/tcp_syncache.c @@ -922,7 +922,9 @@ syncache_socket(struct syncache *sc, struct socket *lso, struct mbuf *m) INP_WUNLOCK(inp); - soisconnected(so); + if ((so->so_options & SO_ACCEPTFILTER) == 0) { + soisconnected(so); + } TCPSTAT_INC(tcps_accepts); return (so); I just smoke tested it but not validated it, I am currently setting up nginx and accept_filter to reproduce your issue. The rational is that r302995 and using accept_filter=httpready are quite symmetric but in r302995+accept_filter=httpready soisconnected() case soisconnected() is called sooner than later, and in a different locking context. Thank you for you answer Julien. I tested both r302995 and r302994. In r302995 problem exist and in r302994 is not. I'll compile r302995+your patch right now and will be back with results. If you want to test, please check you have "reset_timedout_connection on;" in nginx.conf also. As It OFF by default, and problem is not repeated without reset_timedout_connection=on. /Alexey 1-st result after 1 hour: everything works OK with r302995M (+patch). We'll test more. Ok, thanks for your time testing. Actually the more I look at the code, the more I think this soisconnected() call is useless in all cases (accept filter or not). If your test is ok I will create a review to see if others think the same. Thank you! Looks like patch solved our problem. 5 hours under test works OK (when problem exists, 1-st worker stopped after 5-10 minutes and all workers stopped no more then in 1 hour under same load.) Ok, I will push this exact same change in HEAD. And MFC it in stable/10 and stable/11. I was not able to reproduce your issue, but I did not see any changes with or without the patch. I will later create a review to completely remove this extraneous soisconnected() call. Thanks again. A commit references this bug: Author: jch Date: Thu Sep 29 11:18:48 UTC 2016 New revision: 306443 URL: https://svnweb.freebsd.org/changeset/base/306443 Log: Fix an issue with accept_filter introduced with r261242: As a side effect of r261242 when using accept_filter the first call to soisconnected() is done earlier in tcp_input() instead of tcp_do_segment() context. Restore the expected behaviour. Note: This call to soisconnected() seems to be extraneous in all cases (with or without accept_filter). Will be addressed in a separate commit. PR: 212920 Reported by: Alexey Tested by: Alexey, jch Sponsored by: Verisign, Inc. MFC after: 1 week Changes: head/sys/netinet/tcp_syncache.c The review for removing this extraneous soisconnected() call: Remove an extraneous call to soisconnected() in syncache_socket(), introduced with r261242. https://reviews.freebsd.org/D8072 A commit references this bug: Author: jch Date: Sun Oct 9 21:02:34 UTC 2016 New revision: 306922 URL: https://svnweb.freebsd.org/changeset/base/306922 Log: MFC r306443: Fix an issue with accept_filter introduced with r261242: As a side effect of r261242 when using accept_filter the first call to soisconnected() is done earlier in tcp_input() instead of tcp_do_segment() context. Restore the expected behaviour. Note: This call to soisconnected() seems to be extraneous in all cases (with or without accept_filter). Will be addressed in a separate commit. PR: 212920 Reported by: Alexey Tested by: Alexey, jch Sponsored by: Verisign, Inc. Changes: _U stable/11/ stable/11/sys/netinet/tcp_syncache.c A commit references this bug: Author: jch Date: Sun Oct 9 21:35:12 UTC 2016 New revision: 306925 URL: https://svnweb.freebsd.org/changeset/base/306925 Log: MFC r306443: Fix an issue with accept_filter introduced with r261242: As a side effect of r261242 when using accept_filter the first call to soisconnected() is done earlier in tcp_input() instead of tcp_do_segment() context. Restore the expected behaviour. Note: This call to soisconnected() seems to be extraneous in all cases (with or without accept_filter). Will be addressed in a separate commit. PR: 212920 Reported by: Alexey Tested by: Alexey, jch Sponsored by: Verisign, Inc. Changes: stable/10/sys/netinet/tcp_syncache.c Hello! After update to r306942 problem gone Thanks a lot! With best regards /Alexey Thanks Alexey for your time and clear bug report. Please push this fix out to 11.0-RELEASE, it is a problematic bug that has been fixed for months but never pushed to release versions. Sorry, we don't have time machine, thus we can't push it to 11.0-RELEASE, which was cut before the bug was fixed. The fix is in the stable/11 branch, so the future 11.1-RELEASE will be not affected. |