Bug 203162 - when close(fd) on a fifo fails with EINTR, the file descriptor is not really closed
Summary: when close(fd) on a fifo fails with EINTR, the file descriptor is not really ...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: Konstantin Belousov
URL:
Keywords: patch, patch-ready
Depends on:
Blocks:
 
Reported: 2015-09-16 22:34 UTC by Victor Stinner
Modified: 2015-12-15 13:24 UTC (History)
4 users (show)

See Also:
koobs: mfc-stable9?
koobs: mfc-stable10?


Attachments
Program to reproduce the bug (1.95 KB, application/gzip)
2015-09-16 22:34 UTC, Victor Stinner
no flags Details
Syscalls seen by "truss -f" when the bug occurs (3.42 KB, text/plain)
2015-09-16 22:54 UTC, Victor Stinner
no flags Details
A bit more fairness into the signal interruption of fifo open (695 bytes, patch)
2015-09-17 01:10 UTC, Konstantin Belousov
no flags Details | Diff
Ignore signal after a wakeup, if there was a complimentary open, even transient (1.90 KB, patch)
2015-09-17 15:34 UTC, Konstantin Belousov
no flags Details | Diff
Ignore signal after a wakeup, if there was a complimentary open, even transient v2 (1.91 KB, patch)
2015-09-18 23:17 UTC, Konstantin Belousov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Victor Stinner 2015-09-16 22:34:14 UTC
Created attachment 161126 [details]
Program to reproduce the bug

tl;dr The close() syscall doesn't close correctly a FIFO file descriptor when close() is interrupted by a signal.

Hi,

I'm working on the Python project. Python 3.5 now retries syscalls when a syscall fails with EINTR. This change is described in https://www.python.org/dev/peps/pep-0475/

The associated unit test "test_eintr" hangs sometimes on our FreeBSD buildbots (FreeBSD 9 and 10). It took me some days to identify that test_open() of test_eintr hangs sometimes. It looks like the test hangs when the close() syscall fails with EINTR.

By the way, the BSD C library ignores EINTR in this case, so the caller of the close() function is not aware that the syscall failed with EINTR. In Python, it was also decided to ignore EINTR on close() and dup2() because the file descriptor is closed anyway. It's explained in the PEP 475 (see the link below).

The test ensures that Python retries correctly open() when the function fails with EINTR. The test uses two processes, I will call them the parent process and the child process. To get a EINTR on open(), the test uses a FIFO created by mkfifo(). The parent calls mkfifo() and immedialty tries to open the FIFO for writing: open() blocks until the child opens the FIFO for reading. Both processes uses setitimer() to inject SIGARLM signals every 10 ms. The child process sleeps 100 ms, opens the FIFO for reading and then close it.

Attached tarball contains a C program based on the Python unit test.

To reproduce the bug, run ./test.sh multiple times in different terminals, you have to pass a different number to each run (to name the truss log file): the program should hang after between 1 and 5 minutes.

You may have to stop/restart the script: truss creates a ghost process for the child process which becomes <defunct>, so quickly we will reach the number of processes limit.

I noticed two cases: the test hangs (no more output) or the test slowly fills the terminal with "@". The "@" character is written each time that open() fails with EINTR in the child process (only in the child process, this case doesn't produce output in the parent process).

I'm quite sure that truss has bugs and fails to log correctly syscalls in the parent and the child process. To workaround truss bugs, I wrote my program to ensure that open(path, O_WRONLY) returns the fd 3 in the parent process and open(path, O_RDONLY) returns the fd 4 in the child process. So depending on the fd number, you can check if it's the parent or the child process. It helps to workaround truss bugs.

When the close() syscall fails with EINTR: fstat(fd) fails with EBAD, so the file descriptor seems to be really closed.

Note: I reproduced the bug in a VM running FreeBSD 10.1-RELEASE-p6 with a single core (1 virtual CPU in fact).

Note: I'm following evolutions of the FreeBSD kernel from the Python test suite. I noticed that FreeBSD made *huge* progresses on handling threads and signals. Congrats :-)
Comment 1 Victor Stinner 2015-09-16 22:34:39 UTC
FYI related Python bug report: https://bugs.python.org/issue25122
Comment 2 Victor Stinner 2015-09-16 22:54:32 UTC
Created attachment 161127 [details]
Syscalls seen by "truss -f" when the bug occurs

Syscalls seen by "truss -f" when the bug occurs. Attached syscalls.txt is this command as a file, just if my comment is badly formatted in Bugzilla.

parent = 57517
child = 61173

57517: open("(null)",O_WRONLY,037777777777)	 ERR#4 'Interrupted system call'
57517: SIGNAL 14 (SIGALRM)
57517: sigreturn(0x7fffffffe080,0x10006,0x7fffffffe080,0x0,0xffff80000060386e,0x0) ERR#4 'Interrupted system call'

=> parent is blocked on open(fifo, O_WRONLY) because the child didn't open it
   yet, but the open() is interrupted by signals every 10 ms.

    61173: nanosleep({0.009149571 })		 = 0 (0x0)
    61173: SIGNAL 14 (SIGALRM)
    61173: sigreturn(0x7fffffffe040,0x10006,0x7fffffffe040,0x1ff3019bb4b18,0xffffffff8093b550,0xfffff8001fcf9980) ERR#4 'Interrupted system call'

    => child is sleeping, nanosleep() is interrupted every 10 ms by a signal

57517: open("(null)",O_WRONLY,037777777777)	 = 0 (0x0)

=> strange, truss says that the parent succeeded first to open a FIFO to
   writing, whereas the child didn't open it yet. It looks more like a bug or
   race condition in truss. I prefer to think that the parent syscall succeeded
   after the open() in the child process.

=> moreover, no, the parent failed to open the FIFO: see below, it still tries
   to open it!?

=> And what is this strange file descriptor 0? It is supposed to be already
   be used by stdin...

    61173: open("fifo_57517_465",O_RDONLY,042716203) = 4 (0x4)

    => child process opened the FIFO for reading, it should unblock open() in
       the parent.

    61173: close(4)					 ERR#4 'Interrupted system call'

    => oh oh, closing the FIFO in the child process failed with EINTR...
       Bad things happens since this point.

57517: SIGNAL 14 (SIGALRM)
57517: sigreturn(0x7fffffffe080,0x10006,0x7fffffffe080,0x0,0xffff80000060386e,0x0) = 0 (0x0)

=> open() succeeded or was interrupted by EINTR?

    61173: fstat(4,0x7fffffffe660)			 ERR#9 'Bad file descriptor'

    => fstat() confirms that the FIFO file descriptor is really closed in the
       child.

    61173: setitimer(0,{0.000000, 0.000000 },0x0)	 = 0 (0x0)

    61173: close(3)					 = 0 (0x0)

    => fd 3 is only used to workaround truss bugs, it's a copy of the stdout
       (fd 1).

    61173: sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0)
    61173: sigprocmask(SIG_SETMASK,0x0,0x0)		 = 0 (0x0)
    61173: sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0)
    61173: sigprocmask(SIG_SETMASK,0x0,0x0)		 = 0 (0x0)
    61173: sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0)
    61173: sigprocmask(SIG_SETMASK,0x0,0x0)		 = 0 (0x0)
    61173: process exit, rval = 0
     ERR#4 'Interrupted system call'

57517: open("(null)",O_WRONLY,037777777777)	 ERR#4 'Interrupted system call'
57517: SIGNAL 14 (SIGALRM)
57517: sigreturn(0x7fffffffe080,0x10006,0x7fffffffe080,0x0,0xffff80000060386e,0x0) ERR#4 'Interrupted system call'

=> open() still fails in the parent, whereas the child process succeeded to
   open it for reading *and* closed it. The parent open() is supposed to be
   unblocked since the child open() succeeded, no?

(... open sequence is repeated forever, it never succeed ...)
Comment 3 Konstantin Belousov freebsd_committer 2015-09-16 23:36:28 UTC
(In reply to Victor Stinner from comment #1)
I see the 'hang', where the parent is sleeping on open(O_WRONLY), and there is no child.  The situation is not too mysterious, see the explanation below.

Test sets up two processes, parent does open(fifo, O_WRONLY), child does open(fifo, O_RDONLY).  Either process can proceed further if other process enters open(), or a signal is delivered interrupting the sleep waiting for the residual open.  Assume that the parent sleeps in open waiting for child/reader, child enters open and since there is writer waiting, the open in child succeeded.  Child wakes up the parent to notify that open(O_WRONLY) can succeed as well (there is reader after all).  And, the crusial point, assume that the parent gets signal notification before it gets wakeup notice.

In this setup, open() in parent returns with EINTR.  On the other hand, child successfully opened fifo for read, and after some more work, it exits.  But the parent retries the open(O_WRONLY).  There is no child to open fifo, so open() cannot succeed, it only gets interrupted by signal periodically.  And, parent would not start another child until this open succeed.

Hopefully this can be followed.

FWIW, do not use truss, use kdump -i.
Comment 4 Konstantin Belousov freebsd_committer 2015-09-17 01:10:59 UTC
Created attachment 161136 [details]
A bit more fairness into the signal interruption of fifo open

The patch would prefer to return success over the EINTR when signal is delivered while other end of fifo becomes ready simultaneously.

Note that the patch still does not fix the fundamental underground issue of the troubles, which is in the test structure itself.  If child reader manages to close fifo before the parent writer sleeping in open has a chance to run, parent would deadlock, since nothing else would open the fifo for read.
Comment 5 Victor Stinner 2015-09-17 09:33:08 UTC
"The patch would prefer to return success over the EINTR when signal is delivered while other end of fifo becomes ready simultaneously."

Do you mean that the signal will be delivered later to the application? Ignoring completly a signal doesn't sound like a good idea :-)

"Note that the patch still does not fix the fundamental underground issue of the troubles, which is in the test structure itself."

I'm not sure that I understand: do you mean that the test has a race condition?

FYI the Python test suite is executed on a wide range of architectures and platforms: http://buildbot.python.org/all/waterfall?category=3.x.stable&category=3.x.unstable

The test_eintr test is executed and pass on Linux, Mac OS X, Solaris (and OpenIndiana), OpenBSD, etc. For FreeBSD, test_eintr hangs 50% of the time (raw estimation).

For OpenBSD, I'm not sure, the buildbot is currently disabled. For Mac OS X, I'm surprised, I thaught that Mac OS X stole a large part of FreeBSD code :-)

Well... I had a lot of trouble to write a reliable C program to reproduce the issue, so the test may hangs on any platform, but I didn't notice hang (I don't check _each_ build of each platforms, it would take too much time).

But at least, I can say that the bug fails frequently on FreeBSD, whereas I never saw it failing on Linux.

"If child reader manages to close fifo before the parent writer sleeping in open has a chance to run, parent would deadlock, since nothing else would open the fifo for read."

On Linux, the open(O_RDONLY) in the child blocks until the open(O_WRONLY) succeed. The Linux kernel looks to ensure that both open succeed "at the time same".


Short test in Python.

Parent:

$ python3
Python 3.4.2 (default, Jul  9 2015, 17:24:30) 
[GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] on linux
>>> import os
>>> os.mkfifo("fifo")
>>> f=open("fifo", "wb")
^Z
[1]+  Stoppé                 python3 -q

Child:

haypo@smithers$ python3 
Python 3.4.2 (default, Jul  9 2015, 17:24:30) 
[GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] on linux
>>> f=open("fifo", "rb")

I used CTRL+z to stop the parent process: the child process hangs until I restart the parent process (using "fg").
Comment 6 Konstantin Belousov freebsd_committer 2015-09-17 13:43:22 UTC
(In reply to Victor Stinner from comment #5)
FreeBSD also ensures that both open(O_RDONLY) and open(O_WRONLY) succeed simultaneously, but there is a race.

Assume that both threads doing the opens noticed each other presence, but right after that the thread doing open(O_RDONLY) continued execution, while the thread doing(O_WRONLY) was not put on the CPU immediately.  It was a scheduler decision, for whatever reasons.  Then, open(O_RDONLY) succeeds, the thread continues the execution and then it could even run as far as to execute close() on the fd returned by open.  After that, when open(O_WRONLY) thread eventually gets CPU, it notices that there is no readers (there was one, but it is gone now).  So it returns to the sleep waiting for a reader.
Comment 7 Konstantin Belousov freebsd_committer 2015-09-17 15:34:38 UTC
Created attachment 161150 [details]
Ignore signal after a wakeup, if there was a complimentary open, even transient

The new version of the patch is even more aggressive with preferring a normal return when there is both signal delivered and parallel open.  Now, I run several thousand iterations of your test without hang, while before it would reliably hang on first or second dozens.

Some clarification about signals.  Patch does not cause the signal to be lost.  Before the patch, you get EINTR error when signal was generated _and_ parallel complimentary type of open happen. Now, the syscall returns with success and file descriptor, but before the usermode continues, a signal frame is pushed on stack and signal handler run.  In other words, things are changed so that it _looks_ like the signal was delivered slightly after the wakeup from open.
Comment 8 Jilles Tjoelker freebsd_committer 2015-09-18 22:40:00 UTC
The patch "Ignore signal after a wakeup, if there was a complimentary open, even transient" looks good to me, except that fi_rgen and fi_wgen should have an unsigned type to avoid undefined behaviour due to overflow.
Comment 9 Konstantin Belousov freebsd_committer 2015-09-18 23:17:43 UTC
Created attachment 161179 [details]
Ignore signal after a wakeup, if there was a complimentary open, even transient v2

Thank Jilles, patch is updated to use unsigned type.  Also, I think there is no reason to use longs, int generation counter should be enough.

Probably, readers and writers counters should be also reduced to int, but this is for different change.
Comment 10 Baptiste Daroussin freebsd_committer 2015-09-19 10:52:50 UTC
After applying the patch I can confirm that now the issue reported is not appearing anymore.

On a box where the test did hang after few seconds I have been running the test 8h without any issue
Comment 11 commit-hook freebsd_committer 2015-09-20 21:19:30 UTC
A commit references this bug:

Author: kib
Date: Sun Sep 20 21:18:34 UTC 2015
New revision: 288044
URL: https://svnweb.freebsd.org/changeset/base/288044

Log:
  Ensure that when a blockable open of fifo returns success, a valid
  file descriptor opened for complimentary access exists as well.

  The implementation of the guarantee is done by counting the
  generations of readers and writers opens.  We return success and not
  EINTR or ERESTART error, when the sleep for complimentary opening is
  interrupted, but the generation was changed during the sleep.

  Longer explanation: assume there are two threads, A doing open("fifo",
  O_RDONLY) and B doing open("fifo", O_WRONLY), and no other threads
  either trying to open the fifo, nor there are any file descriptors
  referencing the fifo.  Before the change, it was possible e.g. for for
  thread A to return a valid file descriptor, while thread B returned
  EINTR if a signal to B was delivered simultaneously with the wakeup
  from A.  After the change, in this situation both A::open() and
  B::open() succeed and the signal is made "as if" it was noticed
  slightly later.  Note that the signal actual delivery is not changed,
  it is done by ast on syscall return path, so signal handler is still
  executed before first instruction after syscall.

  See PR for the code demonstrating the issue.

  PR:	203162
  Reported by:	Victor Stinner victor.stinner@gmail.com
  Reviewed by:	jilles
  Tested by:	bapt, pho
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week

Changes:
  head/sys/fs/fifofs/fifo_vnops.c
Comment 12 commit-hook freebsd_committer 2015-09-21 08:45:19 UTC
A commit references this bug:

Author: pho
Date: Mon Sep 21 08:44:24 UTC 2015
New revision: 288058
URL: https://svnweb.freebsd.org/changeset/base/288058

Log:
  Added a regression test.

  PR:		203162
  In collaboration with:	 kib
  Sponsored by:	EMC / Isilon storage division

Changes:
  user/pho/stress2/misc/mkfifo3.sh
Comment 13 Kubilay Kocak freebsd_committer 2015-10-04 08:26:08 UTC
Please set mfc-* flags to "+" once MFC'd to stable/10 and stable/9. If not applicable to a particular branch, set to "-" with comment
Comment 14 Kubilay Kocak freebsd_committer 2015-12-15 10:40:17 UTC
Assign to committer that resolved.

Pending MFC