Bug 227285 - File descriptor passing does not work reliably on SMP system
Summary: File descriptor passing does not work reliably on SMP system
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Jason A. Harmening
Keywords: patch
: 239803 (view as bug list)
Depends on:
Reported: 2018-04-04 14:24 UTC by Jan Kokemüller
Modified: 2020-05-09 04:29 UTC (History)
7 users (show)

See Also:

test program (2.42 KB, text/plain)
2018-04-04 14:24 UTC, Jan Kokemüller
no flags Details
Dtrace script for debugging (2.47 KB, text/plain)
2018-04-04 14:27 UTC, Jan Kokemüller
no flags Details
excerpt of Dtrace script output (39.45 KB, text/plain)
2018-04-04 14:34 UTC, Jan Kokemüller
no flags Details
test program (fixed) (2.47 KB, text/plain)
2018-04-04 16:05 UTC, Jan Kokemüller
no flags Details
tar archive with test program, Dtrace script and log output (50.50 KB, application/x-tar)
2018-04-08 20:06 UTC, Jan Kokemüller
no flags Details
test program (faster reproduction) (2.64 KB, text/plain)
2018-04-14 19:42 UTC, Jan Kokemüller
no flags Details
patch for a GC that detects cycles (7.51 KB, patch)
2018-04-14 19:59 UTC, Jan Kokemüller
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Kokemüller 2018-04-04 14:24:45 UTC
Created attachment 192211 [details]
test program

The attached program repeatedly spawns a child process that creates a socketpair. One socket is passed back to the parent. One byte is written into the other end. The parent should be able to read this byte, but sometimes 0 (EOF) is returned.

It seems that the parent reuses stale socket/sockbuf memory from prior iterations.

To simplify the fd passing I've used fd_send/fd_recv from libnv. Those use SCM_RIGHTS under the hood. The program must be compiled with 'cc -O3 fdpass.c -lnv'.
Comment 1 Jan Kokemüller 2018-04-04 14:27:00 UTC
Created attachment 192213 [details]
Dtrace script for debugging

Here is a Dtrace script that could be helpful in tracing down the issue.
Comment 2 Jan Kokemüller 2018-04-04 14:34:51 UTC
Created attachment 192214 [details]
excerpt of Dtrace script output

Here is part of the Dtrace debug log.

When 'got err' is printed it means that the parent could not read the byte from the child. At this point the address of the receive sockbuf is 'fffff801071df148' and the SBS_CANTRCVMORE flag is set (CPU 1).

However, further above, this sockbuf looks fine in the child (CPU 3) and there aren't any calls to socantrcvmore_locked() in the meantime.

Even further above, this sockbuf was destroyed by the parent in a previous iteration (CPU 1) and therefore SBS_CANTRCVMORE is set. But this should not affect the current iteration. It looks like the memory from the child process on CPU 3 isn't made visible to CPU 1 properly.
Comment 3 Jan Kokemüller 2018-04-04 16:05:22 UTC
Created attachment 192216 [details]
test program (fixed)

I've mistakenly uploaded the program with a workaround applied. Here is a fixed version with the workaround behind a ifdef.
Comment 4 Jan Kokemüller 2018-04-05 06:17:26 UTC
To uncover the bug it helps to run multiple instances at the same time and to remove the debug output or to pipe it into /dev/null. After a few seconds, at least some instances should be dead.

I've tested only amd64 for now. I could reproduce it on stock FreeBSD 10/11 systems:

FreeBSD 10.3-RELEASE-p24 FreeBSD 10.3-RELEASE-p24 #0: Wed Nov 15 04:57:40 UTC 2017     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
hw.model: Intel(R) Celeron(R) CPU  N3150  @ 1.60GHz

FreeBSD 11.1-RELEASE-p7 FreeBSD 11.1-RELEASE-p7 #0: Tue Mar  6 09:33:30 UTC 2018     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
hw.model: Intel(R) Celeron(R) CPU G530 @ 2.40GHz
Comment 5 Jan Kokemüller 2018-04-08 20:06:01 UTC
Created attachment 192350 [details]
tar archive with test program, Dtrace script and log output

I've attached a program that reproduces the bug faster. I've also updated the Dtrace script and added some new log output. The Dtrace script now prints all socantrcvmore() calls, not only those from "a.out".

It turns out that the kernel sometimes closes the socket in the unp garbage collector (look for sockbuf fffff800adec3b50 in the debug log). So there probably is no cache issue after all.

Investigating further...
Comment 6 Jan Kokemüller 2018-04-14 19:42:45 UTC
Created attachment 192514 [details]
test program (faster reproduction)

Here is a better test program that should uncover the race in a few seconds at most. Compile with "cc fdpass-race-fast.c -pthread -lnv". I'm spawning another thread that just creates and closes socketpairs. This triggers the unp garbage collector constantly.

I believe the race is in soreceive_generic() in sys/kern/uipc_socket.c. The "in flight" domain socket is taken off the so_rcv sockbuf, but it is not externalized yet (dom_externalize() hasn't been called). This leads to the GC thinking that the socket is still "in flight" but it doesn't actually get marked because it is off the so_rcv sockbuf. Now the GC thinks this socket is unreachable and closes it.
Comment 7 Jan Kokemüller 2018-04-14 19:59:37 UTC
Created attachment 192515 [details]
patch for a GC that detects cycles

I've tried to modify the GC so that it detects cycles in all "in flight" sockets and only destroys sockets that are in such a cycle. With this patch the test program doesn't exit anymore.

I wonder if there can be false negatives with a GC that just detects cycles. DragonFly has some interesting comments[1]. DragonFly still suffers from this race, though.

It looks like Linux had the same race when they still used a mark & sweep GC instead of a cycle detecting one. Search for "rewrite AF_UNIX garbage collector".

The BUGS section in the man page for send(2) looks related to this problem. This seems to be a long standing issue.

[1]: https://github.com/DragonFlyBSD/DragonFlyBSD/blob/b9f81f6ad2f9ae32fb3ec268a1ea2edac10f373e/sys/kern/uipc_usrreq.c#L72
Comment 8 commit-hook freebsd_committer 2018-05-15 15:57:13 UTC
A commit references this bug:

Author: pho
Date: Tue May 15 15:56:52 UTC 2018
New revision: 333634
URL: https://svnweb.freebsd.org/changeset/base/333634

  Added a new socketpair() test scenario.

  PR:		227285
  Obtained from:	jan.kokemueller@gmail.com

Comment 9 Ed Maste freebsd_committer 2019-05-21 19:26:22 UTC
Triage at 2019 Waterloo Hackathon, CC markj@ and kib@ for investigation
Comment 10 Jason A. Harmening freebsd_committer 2020-01-16 16:58:17 UTC
Modified version of patch for review at https://reviews.freebsd.org/D23142
Comment 11 Jason A. Harmening freebsd_committer 2020-01-16 16:59:28 UTC
*** Bug 239803 has been marked as a duplicate of this bug. ***
Comment 12 commit-hook freebsd_committer 2020-01-25 08:57:59 UTC
A commit references this bug:

Author: jah
Date: Sat Jan 25 08:57:27 UTC 2020
New revision: 357110
URL: https://svnweb.freebsd.org/changeset/base/357110

  Implement cycle-detecting garbage collector for AF_UNIX sockets

  The existing AF_UNIX socket garbage collector destroys any socket
  which may potentially be in a cycle, as indicated by its file reference
  count being equal to its enqueue count. However, this can produce false
  positives for in-flight sockets which aren't part of a cycle but are
  part of one or more SCM_RIGHTS mssages and which have been closed
  on the sending side. If the garbage collector happens to run at
  exactly the wrong time, destruction of these sockets will render them
  unusable on the receiving side, such that no previously-written data
  may be read.

  This change rewrites the garbage collector to precisely detect cycles:

  1. The existing check of msgcount==f_count is still used to determine
     whether the socket is potentially in a cycle.
  2. The socket is now placed on a local "dead list", which is used to
     reduce iteration time (and therefore contention on the global
  3. The first pass through the dead list removes each potentially-dead
     socket's outgoing references from the graph of potentially-dead
     sockets, using a gc-specific copy of the original reference count.
  4. The second series of passes through the dead list removes from the
     list any socket whose remaining gc refcount is non-zero, as this
     indicates the socket is actually accessible outside of any possible
     cycle.  Iteration is repeated until no further sockets are removed
     from the dead list.
  5. Sockets remaining in the dead list are destroyed as before.

  PR:		227285
  Submitted by:	jan.kokemueller@gmail.com (prior version)
  Reviewed by:	markj
  Differential Revision:	https://reviews.freebsd.org/D23142