Bug 79339

Summary: [kernel] [patch] Kernel time code sync with improvements from DragonFly
Product: Base System Reporter: Joshua Coombs <jcoombs>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me Keywords: patch
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Joshua Coombs 2005-03-29 19:30:06 UTC
Based on: http://www.dragonflybsd.org/docs/nanosleep/

The improvements to kernel timekeeping appear to apply directly to FreeBSD 4.x based on my own quick testing.

Fix: 

/usr/src/sys/kern/kern_clock.c
325c325
<                       / tick + 1;
---
>                       / tick;
328c328
<                       + ((unsigned long)usec + (tick - 1)) / tick + 1;
---
>                       + ((unsigned long)usec + (tick - 1)) / tick;



/usr/src/sys/kern/kern_time.c
232c232
<       int error;
---
>       int error, sleepticks;
241a242
>                 sleepticks = tvtohz(&tv);
243c244
<                   tvtohz(&tv));
---
>                     (sleepticks < 1)? 1 : sleepticks);
252c253,254
<                               *rmt = ts;
---
>                                 rmt->tv_sec = ts.tv_sec;
>                                 rmt->tv_nsec = ts.tv_nsec;
258c260,261
<               ts3 = ts;
---
>                 ts3.tv_sec = ts.tv_sec;
>                 ts3.tv_nsec = ts.tv_nsec;
260a264,265
>                 if (tv.tv_sec == 0 && tv.tv_usec < tick)
>                         return (0);



/usr/src/sys/i386/isa/clock.c
113c113,114
< #define       TIMER_DIV(x) ((timer_freq + (x) / 2) / (x))
---
> #define TIMER_DIV(x) (timer_freq / (x))
> #define FRAC_ADJUST(x) (timer_freq - ((timer freq / (x)) * (x)))
141a143
> u_int   timer0_frac_freq;
204a207,209
>         int phase;
>         int delta;
>
215a221,236
>
>         phase = 1000000 / timer0_frac_freq;
>         delta = timecounter->tc_microtime.tv_usec % phase;
> #if 1
>       disable_intr();
>         if (delta < (phase >> 1)) {
>                 outb(TIMER_CNTR0, timer0_max_count & 0xff);
>                 outb(TIMER_CNTR0, timer0_max_count >> 8);
>         } else {
>                 outb(TIMER_CNTR0, (timer0_max_count +1) & 0xff);
>                 outb(TIMER_CNTR0, (timer0_max_count +1) >> 8);
>                 ++i8254_offset;
>         }
>       enable_intr();
> #endif
>
236a258
>                 timer0_frac_freq = new_rate;
247,248c269,270
<               if ((timer0_prescaler_count += timer0_max_count)
<                   >= hardclock_max_count) {
---
>                 timer0_prescaler_count += timer0_max_count;
>                 if (timer0_prescaler_count >= hardclock_max_count) {
689a712
>         timer0_frac_freq = intr_freq;
1221c1244
<       count = timer0_max_count - ((high << 8) | low);
---
>         count = timer0_max_count + 1 - ((high << 8) | low);



Note, the diffs above are just the code, proper credit must be given to  Paul Herman and Matt Dillon, in addition the DFly patches listed at the source url contain comments indicating what the code is doing.

Sample post patch test data:
0.000020
0.000226
0.000336
0.000284
0.000234
0.000187
0.000132
0.000082
0.000035
0.000242
0.000348
0.000295
0.000246
0.000192
0.000137
0.000090
0.000043
0.000252
0.000361
0.000307

A sawtooth is still present, but the accuracy is MUCH better.  I suspect my hack application of the PLL function isn't correct or my P133 is slow enough that I'm observing some other latencies.  I have observed occasional negative offsets, which according to the article are strictly forbidden by RFCs, so please check my work.  I believe they were the result of my playing with a hz value too high for the machine to reasonably handle, and are not occuring with saner values for hz.
How-To-Repeat: /*
 * Copyright (c) 2003 Paul Herman
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR
 * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

/*
 * $DragonFly: site/data/docs/nanosleep/wakeup_latency.c,v 1.1 2004/01/22 21:55:58 justin Exp $
 */

#include <sys/time.h>
#include <sys/resource.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define ONE_SECOND	1000000L

int count = 200;
int debug = 0;

int main (int ac, char **av) {
	long s;
	double diff;
	struct timeval tv1, tv2;

	if (ac > 1 && av[1])
		count = strtol(av[1], NULL, 10);

	while(count--) {
		gettimeofday(&tv1, NULL);
			/*
			 * Calculate the number of microseconds to sleep so we
			 * can wakeup right when the second hand hits zero.
			 *
			 * The latency for the following two statements is minimal.
			 * On a > 1.0GHz machine, the subtraction is done in a few
			 * nanoseconds, and the syscall to usleep/nanosleep is usualy
			 * less than 800 ns or 0.8 us.
			 */
		s = ONE_SECOND - tv1.tv_usec;
		usleep(s);
		gettimeofday(&tv2, NULL);

		diff = (double)(tv2.tv_usec - (tv1.tv_usec + s))/1e6;
		diff += (double)(tv2.tv_sec - tv1.tv_sec);
		if (debug)
			printf("(%ld.%.6ld) ", tv2.tv_sec, tv2.tv_usec);
		printf("%.6f\n", diff);
	}
	return 0;
}

4.11 returns the following sample data:
0.016126
0.016146
0.016162
0.016181
0.016199
0.016218
0.016238
0.016259
0.016274
0.016292
0.016310
0.016342
0.016359
0.016366
Comment 1 Joshua Coombs 2005-03-30 15:33:59 UTC
Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
issue on 6-current as well to nail down the problem scope.
Comment 2 gemini 2005-03-31 07:44:55 UTC
Joshua Coombs wrote:
> The following reply was made to PR kern/79339; it has been noted by GNATS.
> 
> From: "Joshua Coombs" <jcoombs@gwi.net>
> To: <freebsd-gnats-submit@FreeBSD.org>,
> 	"Joshua Coombs" <jcoombs@gwi.net>
> Cc:  
> Subject: Re: kern/79339: [patch] Kernel time code sync with improvements from DragonFly
> Date: Wed, 30 Mar 2005 09:33:59 -0500
> 
>  Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
>  I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
>  issue on 6-current as well to nail down the problem scope. 

Please also look at what's actually in DragonFly's CVS repository.  Your 
PR is based on the original patch, while the code in DragonFly is more 
sophisticated.  Namely, tvtohz() was split into two functions, 
tvtohz_low() and tvtohz_high(), which replace the original function 
depending on the context tvtohz() appears in.

 From this I conclude that the original patch is insufficient (likely to 
break parts of the kernel), and that integrating this improvement into 
FreeBSD might not be as easy and straightforward as it appears to be at 
first glance.  On the other hand, with some effort it ought to be doable.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net
Comment 3 gemini 2005-03-31 07:44:55 UTC
Joshua Coombs wrote:
> The following reply was made to PR kern/79339; it has been noted by GNATS.
> 
> From: "Joshua Coombs" <jcoombs@gwi.net>
> To: <freebsd-gnats-submit@FreeBSD.org>,
> 	"Joshua Coombs" <jcoombs@gwi.net>
> Cc:  
> Subject: Re: kern/79339: [patch] Kernel time code sync with improvements from DragonFly
> Date: Wed, 30 Mar 2005 09:33:59 -0500
> 
>  Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
>  I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
>  issue on 6-current as well to nail down the problem scope. 

Please also look at what's actually in DragonFly's CVS repository.  Your 
PR is based on the original patch, while the code in DragonFly is more 
sophisticated.  Namely, tvtohz() was split into two functions, 
tvtohz_low() and tvtohz_high(), which replace the original function 
depending on the context tvtohz() appears in.

 From this I conclude that the original patch is insufficient (likely to 
break parts of the kernel), and that integrating this improvement into 
FreeBSD might not be as easy and straightforward as it appears to be at 
first glance.  On the other hand, with some effort it ought to be doable.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net
_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscribe@freebsd.org"
Comment 4 Bruce Evans 2005-03-31 11:50:50 UTC
On Thu, 31 Mar 2005, Uwe Doering wrote:

> Joshua Coombs wrote:
>>  Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. 
>> I've not yet tested the proposed fix on 5-x.  I will try dupilcating this 
>> issue on 6-current as well to nail down the problem scope. 
>
> Please also look at what's actually in DragonFly's CVS repository.  Your PR 
> is based on the original patch, while the code in DragonFly is more 
> sophisticated.  Namely, tvtohz() was split into two functions, tvtohz_low() 
> and tvtohz_high(), which replace the original function depending on the 
> context tvtohz() appears in.
>
> From this I conclude that the original patch is insufficient (likely to break 
> parts of the kernel), and that integrating this improvement into FreeBSD 
> might not be as easy and straightforward as it appears to be at first glance. 
> On the other hand, with some effort it ought to be doable.

Indeed.

Here is a discussion of some of the bugs in the patch:

% >Fix:
% /usr/src/sys/kern/kern_clock.c
% 325c325
% <                       / tick + 1;
% ---
% >                       / tick;
% 328c328
% <                       + ((unsigned long)usec + (tick - 1)) / tick + 1;
% ---
% >                       + ((unsigned long)usec + (tick - 1)) / tick;

This breaks all callers of tvtohz() except the one that is changed in
the patch to expect this API change.  The comment before tvtohz() still
says that tvtohz() adds 1.

% /usr/src/sys/kern/kern_time.c
% 232c232
% <       int error;
% ---
% >       int error, sleepticks;
% 241a242
% >                 sleepticks = tvtohz(&tv);
% 243c244
% <                   tvtohz(&tv));
% ---
% >                     (sleepticks < 1)? 1 : sleepticks);

This is more or less correct.  1 should be subtracted from tvtohz() in
callers that do a careful comparision of the times before and after
the sleep so that they can tell if the sleep time has completely
expired.

The function here (nanosleep1()) is not quite such a caller.  It does
a sloppy comparision of times, using getnanouptime() instead of
nanouptime().  getnanouptime() has a resolution of 1/ticktock_hz, where
ticktock_hz is appoximately min(hz, 1000) (normally just hz), so there
is a possible error of 2/ticktock_hz in the comparision.  I think all
the errors go the same way, so the maximum error is 1/ticktock_hz.
The extra tick added by tvtohz() accidentally compensates for this
error.  Synchronization effects may reduce (or increase?) the error.
The first getnanouptime() is unsynchronized, but ones done just after
timeout returns are synced with clock interrupts, so they give a
fairly accurate time every hz/ticktock_hz hardclock interrupts.
Anyway, if 1 is subtracted from tztvohz(), then naouptime() should
be used to avoid these errors.

There are many other callers like nanosleep1(): the ones for select(2),
poll(2) and setitimer(2).  These all depend on tvtohz() adding 1 to
ensure that they sleep for the specified interval, and they all do
sloppy comparisions like nanosleep1(), so they all need similar changes
if you want timeouts to be synchronized with 1/HZ second boundaries as
perfectly as possible.

% 252c253,254
% <                               *rmt = ts;
% ---
% >                                 rmt->tv_sec = ts.tv_sec;
% >                                 rmt->tv_nsec = ts.tv_nsec;
% 258c260,261
% <               ts3 = ts;
% ---
% >                 ts3.tv_sec = ts.tv_sec;
% >                 ts3.tv_nsec = ts.tv_nsec;

These changes just introduce style bugs.

% 260a264,265
% >                 if (tv.tv_sec == 0 && tv.tv_usec < tick)
% >                         return (0);

This can't be right.  We have just not-so-carefully checked whether
the time has expired, and only get here when it hasn't.
(tv.tv_sec == 0 && tv.tv_usec < tick) means that we would have preferred
the sleep time to be less than 1 tick.  We had to request a sleep of
exactly 1 tick because less than 1 is impossible (this is with 1
subtracted from tvtohz()).  Sleeping for exactly 1 tick is also
impossible, so we have woken up after an interval of anywhere between
0+epsilon and (1-epsilon+latency) seconds.  The interval may be
significantly smaller or larger than than `tv' and we must go back to
sleep if it is smaller.  The above change breaks this.

I think the problem that this change is supposed to fix is related to
the tick frequency not being an exact multiple of 1/HZ.  Also, to avoid
sleeping longer than necessary, we should try to wake up 1 tick early
and then decide whether to sleep another tick or 2 to finish.  Note
that although tvtohz() always rounds up, physical sleep intervals are
always shorter than the specified timeout, so waking up 1 tick early
is very common for unsynchonized sleeps.  Thus if we subtract 1 from
tvtohz(), we often wake up 1 tick early as a side effect, which is what
we want, but there is a problem: suppose that that everything is in
perfect sync, but the hardclock interrupt frequency is slightly less
than 1/HZ seconds.  Then we may wake up 5 usec or so early and decide
to go back to sleep, giving a large error.  Changes later in the patch
are related to this.  I think we shouldn't do anything special here
except possibly return early if `tv' is very small.

Going around the loop in nanosleep1() an extra time is a small
pessimization.  Using nanouptime() to get the decision of whether to
loop right is a pessimization too, but it is relatively small.

% /usr/src/sys/i386/isa/clock.c
% 113c113,114
% < #define       TIMER_DIV(x) ((timer_freq + (x) / 2) / (x))
% ---
% > #define TIMER_DIV(x) (timer_freq / (x))
% > #define FRAC_ADJUST(x) (timer_freq - ((timer freq / (x)) * (x)))

Reducing TIMER_DIV() unconditionally would be harmless under FreeBSD.
It's rounding to nearest dates from there was little more than hardclock
ticks for timekeeping.  Now HZ and the hardclock interrupt frequency
are almost unrelated to timekeeping.

% 141a143
% > u_int   timer0_frac_freq;
% 204a207,209
% >         int phase;
% >         int delta;
% >
% 215a221,236
% >
% >         phase = 1000000 / timer0_frac_freq;
% >         delta = timecounter->tc_microtime.tv_usec % phase;

tc_microtime.tv_usec is not quite the right thing to use here.  It is
updated every tick or two so it might be up to date, but it has
unnecessary jitter.  microtime() would give a more accurate timestamp.
I think microtime() and not microuptime() is the correct function to
use here, since we want to sync with the real time.  OTOH, nanosleep1()
and friends use the uptime, so they must be looked at some more to
determine the effects of using different time scales on syncing.  I
think the synchronization done here is honored by nanosleep1() despite
the different scales, and sync is only lost when the clock is changed
using settimeofday() (then everything gets out of sync).

% > #if 1
% >       disable_intr();

The clock should be read inside this critical section.

% >         if (delta < (phase >> 1)) {
% >                 outb(TIMER_CNTR0, timer0_max_count & 0xff);
% >                 outb(TIMER_CNTR0, timer0_max_count >> 8);
% >         } else {
% >                 outb(TIMER_CNTR0, (timer0_max_count +1) & 0xff);
% >                 outb(TIMER_CNTR0, (timer0_max_count +1) >> 8);
% >                 ++i8254_offset;
% >         }

I think i8254_offset needs to be reinitialized every time the maximum
count is reprogrammed.  This is not done in set_timer_freq(); however,
most callers of set_timer_freq() initialize or update the i8254
timecounter immediately after, and testing shows that this reduces
lost ticks to an acceptable value (usually, and hopefully always < 10).
Correctly reprogramming the i8254 on every interrupt is harder.  Losing
even 1 tick per interrupt is too much, but I think the above can
sometimes lose 100 (if clkintr() is delayed for that long, which can
easily happen especially in RELENG_4 since clkintr() is not a fast
interrupt handler there).  See nearby code that calls
i8254_get_timecount() inside a critical section for a way to reduce
the error to at most 5 ticks.  It takes about 5 ticks just to read the
counter.  This is still far too large to do on every clock tick.  All
of this only matters if the i8254 is used for timekeeping.

% >       enable_intr();
% > #endif
% >
% 236a258
% >                 timer0_frac_freq = new_rate;
% 247,248c269,270
% <               if ((timer0_prescaler_count += timer0_max_count)
% <                   >= hardclock_max_count) {
% ---
% >                 timer0_prescaler_count += timer0_max_count;
% >                 if (timer0_prescaler_count >= hardclock_max_count) {

This change is just to style.

% 689a712
% >         timer0_frac_freq = intr_freq;

The changes seem to be too simple to give a PLL.  I didn't check the details
for this.

% 1221c1244
% <       count = timer0_max_count - ((high << 8) | low);
% ---
% >         count = timer0_max_count + 1 - ((high << 8) | low);

Always adding 1 here seems to be wrong.  Shouldn't you only add 1 if
timer0_max_count isn't actually the max count, i.e., when the max count
has been programmed to be 1 more than usual?  All references to
timer0_max_count are potentially wrong when timer0_max_count isn't
actually the max count.  You add 1 to i8254_offset in the above; this
seems to be to adjust for 1 of the references being wrong, but it doesn't
seem to adjust for `count' being 1 too large.

% A sawtooth is still present, but the accuracy is MUCH better.  I suspect my hack application of the PLL function isn't correct or my P133 is slow enough that I'm observing some other latencies.  I have observed occasional negative offsets, which according to the article are strictly forbidden by RFCs, so please check my work.  I believe they were the result of my playing with a hz value too high for the machine to reasonably handle, and are not occuring with saner values for hz.

I only agree with the non-hardware changes (don't sleep for an extra
tick in nanosleep1() and friends if this is easy to avoid).  All that
that perfect sync of real time with hardclock() clock gives is the
possibility of waking up on precisely 1/HZ boundaries relative to real
time (with whole seconds being boundaries).  System activity lengthens
sleeps by indeterminate amounts except on unloaded systems.  The average
error for a random sleep on an unloaded systems would still be 0.5/HZ
(or 1.5/HZ without the nanosleep1() change).

Bruce
Comment 5 Mark Linimon freebsd_committer freebsd_triage 2005-10-25 03:51:07 UTC
State Changed
From-To: open->suspended

Mark as suspended.  Followups to the original posting seem to indicate 
that the patches cannot be accepted as-is.
Comment 6 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:52:53 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"
Comment 7 Andriy Gapon freebsd_committer freebsd_triage 2018-05-21 06:45:43 UTC
Is this still an issue?
Comment 8 Graham Perrin freebsd_committer freebsd_triage 2022-10-17 12:40:34 UTC
Keyword: 

    patch
or  patch-ready

– in lieu of summary line prefix: 

    [patch]

* bulk change for the keyword
* summary lines may be edited manually (not in bulk). 

Keyword descriptions and search interface: 

    <https://bugs.freebsd.org/bugzilla/describekeywords.cgi>