Created attachment 192211 [details] test program The attached program repeatedly spawns a child process that creates a socketpair. One socket is passed back to the parent. One byte is written into the other end. The parent should be able to read this byte, but sometimes 0 (EOF) is returned. It seems that the parent reuses stale socket/sockbuf memory from prior iterations. To simplify the fd passing I've used fd_send/fd_recv from libnv. Those use SCM_RIGHTS under the hood. The program must be compiled with 'cc -O3 fdpass.c -lnv'.
Created attachment 192213 [details] Dtrace script for debugging Here is a Dtrace script that could be helpful in tracing down the issue.
Created attachment 192214 [details] excerpt of Dtrace script output Here is part of the Dtrace debug log. When 'got err' is printed it means that the parent could not read the byte from the child. At this point the address of the receive sockbuf is 'fffff801071df148' and the SBS_CANTRCVMORE flag is set (CPU 1). However, further above, this sockbuf looks fine in the child (CPU 3) and there aren't any calls to socantrcvmore_locked() in the meantime. Even further above, this sockbuf was destroyed by the parent in a previous iteration (CPU 1) and therefore SBS_CANTRCVMORE is set. But this should not affect the current iteration. It looks like the memory from the child process on CPU 3 isn't made visible to CPU 1 properly.
Created attachment 192216 [details] test program (fixed) I've mistakenly uploaded the program with a workaround applied. Here is a fixed version with the workaround behind a ifdef.
To uncover the bug it helps to run multiple instances at the same time and to remove the debug output or to pipe it into /dev/null. After a few seconds, at least some instances should be dead. I've tested only amd64 for now. I could reproduce it on stock FreeBSD 10/11 systems: FreeBSD 10.3-RELEASE-p24 FreeBSD 10.3-RELEASE-p24 #0: Wed Nov 15 04:57:40 UTC 2017 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 hw.model: Intel(R) Celeron(R) CPU N3150 @ 1.60GHz FreeBSD 11.1-RELEASE-p7 FreeBSD 11.1-RELEASE-p7 #0: Tue Mar 6 09:33:30 UTC 2018 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 hw.model: Intel(R) Celeron(R) CPU G530 @ 2.40GHz
Created attachment 192350 [details] tar archive with test program, Dtrace script and log output I've attached a program that reproduces the bug faster. I've also updated the Dtrace script and added some new log output. The Dtrace script now prints all socantrcvmore() calls, not only those from "a.out". It turns out that the kernel sometimes closes the socket in the unp garbage collector (look for sockbuf fffff800adec3b50 in the debug log). So there probably is no cache issue after all. Investigating further...
Created attachment 192514 [details] test program (faster reproduction) Here is a better test program that should uncover the race in a few seconds at most. Compile with "cc fdpass-race-fast.c -pthread -lnv". I'm spawning another thread that just creates and closes socketpairs. This triggers the unp garbage collector constantly. I believe the race is in soreceive_generic() in sys/kern/uipc_socket.c. The "in flight" domain socket is taken off the so_rcv sockbuf, but it is not externalized yet (dom_externalize() hasn't been called). This leads to the GC thinking that the socket is still "in flight" but it doesn't actually get marked because it is off the so_rcv sockbuf. Now the GC thinks this socket is unreachable and closes it.
Created attachment 192515 [details] patch for a GC that detects cycles I've tried to modify the GC so that it detects cycles in all "in flight" sockets and only destroys sockets that are in such a cycle. With this patch the test program doesn't exit anymore. I wonder if there can be false negatives with a GC that just detects cycles. DragonFly has some interesting comments[1]. DragonFly still suffers from this race, though. It looks like Linux had the same race when they still used a mark & sweep GC instead of a cycle detecting one. Search for "rewrite AF_UNIX garbage collector". The BUGS section in the man page for send(2) looks related to this problem. This seems to be a long standing issue. [1]: https://github.com/DragonFlyBSD/DragonFlyBSD/blob/b9f81f6ad2f9ae32fb3ec268a1ea2edac10f373e/sys/kern/uipc_usrreq.c#L72
A commit references this bug: Author: pho Date: Tue May 15 15:56:52 UTC 2018 New revision: 333634 URL: https://svnweb.freebsd.org/changeset/base/333634 Log: Added a new socketpair() test scenario. PR: 227285 Obtained from: jan.kokemueller@gmail.com Changes: user/pho/stress2/misc/socketpair3.sh
Triage at 2019 Waterloo Hackathon, CC markj@ and kib@ for investigation
Modified version of patch for review at https://reviews.freebsd.org/D23142
*** Bug 239803 has been marked as a duplicate of this bug. ***
A commit references this bug: Author: jah Date: Sat Jan 25 08:57:27 UTC 2020 New revision: 357110 URL: https://svnweb.freebsd.org/changeset/base/357110 Log: Implement cycle-detecting garbage collector for AF_UNIX sockets The existing AF_UNIX socket garbage collector destroys any socket which may potentially be in a cycle, as indicated by its file reference count being equal to its enqueue count. However, this can produce false positives for in-flight sockets which aren't part of a cycle but are part of one or more SCM_RIGHTS mssages and which have been closed on the sending side. If the garbage collector happens to run at exactly the wrong time, destruction of these sockets will render them unusable on the receiving side, such that no previously-written data may be read. This change rewrites the garbage collector to precisely detect cycles: 1. The existing check of msgcount==f_count is still used to determine whether the socket is potentially in a cycle. 2. The socket is now placed on a local "dead list", which is used to reduce iteration time (and therefore contention on the global unp_link_rwlock). 3. The first pass through the dead list removes each potentially-dead socket's outgoing references from the graph of potentially-dead sockets, using a gc-specific copy of the original reference count. 4. The second series of passes through the dead list removes from the list any socket whose remaining gc refcount is non-zero, as this indicates the socket is actually accessible outside of any possible cycle. Iteration is repeated until no further sockets are removed from the dead list. 5. Sockets remaining in the dead list are destroyed as before. PR: 227285 Submitted by: jan.kokemueller@gmail.com (prior version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D23142 Changes: head/sys/kern/uipc_usrreq.c head/sys/sys/unpcb.h