Bug 46036 - [RFE] select is unsuitable for implementing a periodic timeout + fix
Summary: [RFE] select is unsuitable for implementing a periodic timeout + fix
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 5.0-CURRENT
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2002-12-06 20:20 UTC by Enache Adrian
Modified: 2018-05-20 23:50 UTC (History)
0 users

See Also:


Attachments
file.diff (3.39 KB, patch)
2002-12-06 20:20 UTC, Enache Adrian
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Enache Adrian 2002-12-06 20:20:01 UTC
	select(),nanosleep(),poll() sleep one tick longer, _except_
	when previous similar system call was interrupted by a signal
	or by i/o conditions on the polled file descriptors.

	I find this rather buggy and inconsequent.
	It badly breaks any code which tries to hook both timers
	and input polling on a select() loop and defeats benchmark
	programs which test for the accuracy of timeouts.

	This is due to tvtohz (sys/kern/kern_clock.c) which always
	adds one to its result to allow for the current tick to
	expire. Instead of adding one it should be better testing
	for a non-zero result.

	I modified like this my 4.4 kernel long time ago and have
	seen so far no trouble at all.Same thing more recently with 5.0.

Fix: patch to kern_clock.c and other places which assume that 
	tvtohz adds one follows.

	notes:
	tvtohz has been renamed to hzto in NetBSD but is the same.

	of course one can write
	ticks = (sec * hz) + (((unsigned long)usec + (tick - 1)) / tick) ?: 1;
	but this is a GCC extension.

-------------------------------8x-----------------------------------
How-To-Repeat: 	This little program prints the timeout accuracy of the 
	select() system call as "[+-] sec usec".It catches SIGQUIT
	and polls for console input.
	It use to give:

   unpatched (regular) kernel:
	+   0 s       9938 us

		--input
	+   0 s       9959 us
	+   0 s       9972 us
	+   0 s       9930 us
	^\      --interrupt
	+   0 s       9954 us
	+   0 s       9954 us

   patched kernel:
	-   0 s         44 us
	^\      --interrupt
	-   0 s          3 us
	-   0 s        100 us

		--input
	-   0 s         44 us
	-   0 s         47 us

-------------------------------8x-----------------------------------
#include <sys/types.h>
#include <sys/time.h>
#include <sys/sysctl.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>
#include <time.h>
#include <err.h>
#include <stdio.h>

fd_set rdfs;
int ms;

void sig(int unused) {}

int main(int argc,char **argv)
{
	int k,mib[2];
	struct timeval tv,savtv,otv;
	struct clockinfo clock;
	char buf[256];
	signal(SIGQUIT,sig);
	if (argc > 1) ms = atoi(argv[1]);
	if (!ms) ms = 500;
	savtv.tv_usec = ms % 1000 * 1000; savtv.tv_sec = ms / 1000;
	mib[0] = CTL_KERN; mib[1] = KERN_CLOCKRATE;
	k = sizeof clock;
	if (sysctl(mib,2,&clock,&k,NULL,0)) err(1,"sysctl");
	savtv.tv_usec -= savtv.tv_usec % clock.tick;
redo:
	tv = savtv; gettimeofday(&otv,NULL);
	/* a quick fix is to substract one tick from tv here
	   but this will break things when compiled on other
	   OS ( linux for instance ) */
	for(;;) {
		FD_SET(0,&rdfs);
		k = select(1,&rdfs,NULL,NULL,&tv);
		gettimeofday(&tv,NULL);
		timersub(&tv,&otv,&tv);
		switch(k) {
		case 0:
			if (timercmp(&tv,&savtv,<))
				{ timersub(&savtv,&tv,&tv);k = '-'; }
			else
				{ timersub(&tv,&savtv,&tv);k = '+'; }
			printf("%c%4d s %10d us\n",k,tv.tv_sec,tv.tv_usec);
			goto redo;
		case -1:
			if (errno != EINTR) err(1,"select");
			printf("\t--interrupt\n");
			break;
		default:
			read(0,buf,256);
			printf("\t--input\n");
			break;
		}
		timersub(&savtv,&tv,&tv);
		if (tv.tv_sec < 0) timerclear(&tv);
	}
}
-------------------------------8x-----------------------------------
Comment 1 Bruce Evans 2002-12-07 12:57:40 UTC
On Fri, 6 Dec 2002, Enache Adrian wrote:

> >Description:
> 	select(),nanosleep(),poll() sleep one tick longer, _except_
> 	when previous similar system call was interrupted by a signal
> 	or by i/o conditions on the polled file descriptors.
>
> 	I find this rather buggy and inconsequent.
> 	It badly breaks any code which tries to hook both timers
> 	and input polling on a select() loop and defeats benchmark
> 	programs which test for the accuracy of timeouts.
>
> 	This is due to tvtohz (sys/kern/kern_clock.c) which always
> 	adds one to its result to allow for the current tick to
> 	expire. Instead of adding one it should be better testing
> 	for a non-zero result.

tvtohz() actually adds 1 to ensure that the sleep time is at least the
specified time.  This gives a minimum sleep time of 0.0 ticks longer
than the specified sleep time (as required by standards) and a maximum
sleep time of about 1.0 tick longer than the specified sleep time,
with an average sleep time of about 0.5 ticks longer.  Not adding 1
would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
longer than specified; i.e., it would be shorter than specified (broken)
in most cases.

Individual syscalls could handle this by sleeping for the shorter time
and checking whether the timeout has actually expired when they wake up,
but most don't, and this would be a pessimization in most cases.  Of
the above syscalls, only nanosleep() checks that the timeout has actually
expired.  It does this mainly for the technical reason that very large
timeouts (ones greater than INT_MAX ticks) are required to work and may
even be useful (especially if HZ is large).  Checking is a pessimization
in at least the following cases, because we will just have to wait again
after wasting time waking up to check:
- most short timeouts that are a multiple of `tick'.
- half or the cases for random timeouts of 0.5 ticks (since the current
  tick will expire before the timeout half the time), and proportionally
  for all random timeouts of 1 tick or shorter.

> >How-To-Repeat:
> 	This little program prints the timeout accuracy of the
> 	select() system call as "[+-] sec usec".It catches SIGQUIT
> 	and polls for console input.
> 	It use to give:
>
>    unpatched (regular) kernel:
> 	+   0 s       9938 us
>
> 		--input
> 	+   0 s       9959 us
> 	+   0 s       9972 us
> 	+   0 s       9930 us
> 	^\      --interrupt
> 	+   0 s       9954 us
> 	+   0 s       9954 us

This is the correct behaviour.  We are mostly in sync with the clock
interrupt, and the default timeout is a multiple of clock.tick, so we
must sleep for an extra tick (less epsilon) for the timeout to expire.
select() would have to go back to sleep for an extra tick if tvtohz()
didn't add 1.

>    patched kernel:
> 	-   0 s         44 us
> 	^\      --interrupt
> 	-   0 s          3 us
> 	-   0 s        100 us
>
> 		--input
> 	-   0 s         44 us
> 	-   0 s         47 us

This is incorrect.  The times are the values
((actual timeout) - (specified timeout)).  These should be >= 0.
The errors are small here because the timeout is a multiple of clock.tick
and we are mostly in sync with clock interrupts.  In general the error
from not adding 1 is up to 1 tick (see above).

> -------------------------------8x-----------------------------------
> >Fix:
> 	patch to kern_clock.c and other places which assume that
> 	tvtohz adds one follows.

This mainly undoes some FreeBSD fixes.

> 	notes:
> 	tvtohz has been renamed to hzto in NetBSD but is the same.

Actually, FreeBSD renamed hzto to tvtohz.

> redo:
> 	tv = savtv; gettimeofday(&otv,NULL);
> 	/* a quick fix is to substract one tick from tv here
> 	   but this will break things when compiled on other
> 	   OS ( linux for instance ) */
> 	for(;;) {
> 		FD_SET(0,&rdfs);
> 		k = select(1,&rdfs,NULL,NULL,&tv);
> 		gettimeofday(&tv,NULL);

I thought that Linux gets timeouts right -- at least sleep() used to
work except in very early versions of Linux-0.x.  Perhaps it uses
clock interrupts to get to get close to the timeout and then
udelay() to get closer if the residual timeout is small.  In general,
the residual could be up to `tick' usec, but in cases that are in sync
with the clock like your test program, the residual would be small
(about 44 usec) and udelay() could reasonably handle it.

Bruce
Comment 2 Bruce Evans 2002-12-08 08:06:45 UTC
On Sun, 8 Dec 2002, Enache Adrian wrote:

> On Sat, Dec 07, 2002 at 11:57:40PM +1100, Bruce Evans wrote:
> > tvtohz() actually adds 1 to ensure that the sleep time is at least the
> > specified time.  This gives a minimum sleep time of 0.0 ticks longer
> > than the specified sleep time (as required by standards) and a maximum
> > sleep time of about 1.0 tick longer than the specified sleep time,
> > with an average sleep time of about 0.5 ticks longer.  Not adding 1
> > would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
> > longer than specified; i.e., it would be shorter than specified (broken)
> > in most cases.
> ...
> > and we are mostly in sync with clock interrupts.  In general the error
> > from not adding 1 is up to 1 tick (see above).
> The average sleep time seems to be with ~ 0.995 longer on an idle

I think you mean 0.009950.

> machine - much longer on a loaded one where ticks are lost.

Hopefully ticks won't be really lost (in the kernel), although they can
be and are lost due to bugs.  Ticks man be lost as far as applications
can see of course.

> For the error to be down to -1 tick with an average of -0.5 ticks
> (when not adding 1) the tick value must be very small - this doesn't
> seem to happen with the standard 100hz. Under high load the error
> tends to be positive.

It does (would) happen for random timeouts.  If you look at the clock
in a loop then you normally see timeouts of ((1 tick) - epsilon)
(when not adding 1) or ((2 ticks) - epsilon) (when adding 1) because
the first one syncs with the clock.  E.g., nanotime({1 nsec}) normally
returns at a time epsilon after a clock tick occurs, so a subsequent
nanotime({1 nsec}) will wait a full clock tick or two (less epsilon)
(if nanotome() is implemented using clock ticks even for short intervals).

> When the user has asked for something not multiple of tick the
> value will be correctly rounded up by tvtohz() and the error will
> be positive too.

No.  E.g., if the interval is 9999 usec then it will be rounded up
to 1 tick (10000 usec).  The next clock tick may occur after 1 usec,
so if 1 tick is not added then the timeout will return after 1 usec
(plus epsilon) instead of after the requested 9999 usec.

> > I thought that Linux gets timeouts right -- at least sleep() used to
> > work except in very early versions of Linux-0.x.  Perhaps it uses
> ...
> Only nanosleep() works like in FreeBSD ( one tick longer by just
> adding one - this is why many recommend to use select() instead
> of usleep() ...)

This is a bug in Linux :-).

> Does it exist a standard specification for this case ?
> A quick browsing through susv3 shows that only nanosleep()
> is required to do so:
> functions/select.html:
> 	"If the timeout parameter is not a  null pointer,
> 	 it specifies a maximum interval to wait for the
>           selection to complete."
> functions/nanosleep.html
> 	"But, except for the case of being interrupted by
> 	a signal, the suspension time shall not be less than
> 	the time specified by rqtp, as measured by the
> 	system clock CLOCK_REALTIME"

select() and its timeouts are now specified in POSIX, not quite
as clearly as this.  From draft7 of POSIX.1-200x:

%%%
31187              limitations on the granularity of timeout intervals. If the requested timeout interval requires a
31188              finer granularity than the implementation supports, the actual timeout interval shall be rounded
31189              up to the next supported value.
%%%

Bruce
Comment 3 Enache Adrian 2002-12-08 20:33:41 UTC
On Sat, Dec 07, 2002 at 11:57:40PM +1100, Bruce Evans wrote:
> tvtohz() actually adds 1 to ensure that the sleep time is at least the
> specified time.  This gives a minimum sleep time of 0.0 ticks longer
> than the specified sleep time (as required by standards) and a maximum
> sleep time of about 1.0 tick longer than the specified sleep time,
> with an average sleep time of about 0.5 ticks longer.  Not adding 1
> would give a {min,max,avg} sleep time of about {-1.0,0.0,-0.5} ticks
> longer than specified; i.e., it would be shorter than specified (broken)
> in most cases.
...
> and we are mostly in sync with clock interrupts.  In general the error
> from not adding 1 is up to 1 tick (see above).
The average sleep time seems to be with ~ 0.995 longer on an idle
machine - much longer on a loaded one where ticks are lost.

For the error to be down to -1 tick with an average of -0.5 ticks
(when not adding 1) the tick value must be very small - this doesn't
seem to happen with the standard 100hz. Under high load the error
tends to be positive.

When the user has asked for something not multiple of tick the
value will be correctly rounded up by tvtohz() and the error will
be positive too.

> I thought that Linux gets timeouts right -- at least sleep() used to
> work except in very early versions of Linux-0.x.  Perhaps it uses
> clock interrupts to get to get close to the timeout and then
> udelay() to get closer if the residual timeout is small.  In general,
> the residual could be up to `tick' usec, but in cases that are in sync
> with the clock like your test program, the residual would be small
> (about 44 usec) and udelay() could reasonably handle it.
Oh no.
Linux doesn't seem to call udelay() to handle the residual timeout.

Only nanosleep() works like in FreeBSD ( one tick longer by just
adding one - this is why many recommend to use select() instead
of usleep() ...)

from kernel/timer.c
	expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);

The silly test program, when run - with some minor modifications
-----------------------
+#include <sys/param.h>
-       struct clockinfo clock;
-       mib[0] = CTL_KERN; mib[1] = KERN_CLOCKRATE;
-       k = sizeof clock;
-       if (sysctl(mib,2,&clock,&k,NULL,0)) err(1,"sysctl");
-       savtv.tv_usec -= savtv.tv_usec % clock.tick;
+       savtv.tv_usec -= savtv.tv_usec % (1000000 / HZ);
_______________________
on linux 2.4.18 says:
-   0 s        171 us
-   0 s        181 us
-   0 s        184 us

        --input
-   0 s        169 us
-   0 s        180 us

According to you it is broken to sleep less than specified.

Does it exist a standard specification for this case ?
A quick browsing through susv3 shows that only nanosleep()
is required to do so:
functions/select.html:
	"If the timeout parameter is not a  null pointer,
	 it specifies a maximum interval to wait for the
          selection to complete."
functions/nanosleep.html
	"But, except for the case of being interrupted by
	a signal, the suspension time shall not be less than
	the time specified by rqtp, as measured by the
	system clock CLOCK_REALTIME"

but I'll keep reading on ...

thank you & best regards
Adi
Comment 4 K. Macy freebsd_committer freebsd_triage 2007-11-16 07:03:16 UTC
State Changed
From-To: open->feedback


Please comment on whether or not this is in fact a bug and then throw back in to the pool. 
Thanks. 


Comment 5 K. Macy freebsd_committer freebsd_triage 2007-11-16 07:03:16 UTC
Responsible Changed
From-To: freebsd-bugs->bde


Please comment on whether or not this is in fact a bug and then throw back in to the pool.  
Thanks.
Comment 6 K. Macy freebsd_committer freebsd_triage 2007-11-16 20:31:16 UTC
State Changed
From-To: feedback->suspended


This constitutes a RFE and no one is available to work on it right now. 


Comment 7 K. Macy freebsd_committer freebsd_triage 2007-11-16 20:31:16 UTC
Responsible Changed
From-To: bde->freebsd-bugs


This constitutes a RFE and no one is available to work on it right now.
Comment 8 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:50:58 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"