Bug 17842 - Erratic user time reports for long running processes
Summary: Erratic user time reports for long running processes
Status: Closed DUPLICATE of bug 76972
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 3.4-RELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2000-04-07 11:50 UTC by Kees Hendrikse
Modified: 2018-04-22 23:25 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kees Hendrikse 2000-04-07 11:50:01 UTC
	After a process has consumed more than (about) 379400 seconds of user
	time, ru.ru_utime.tv_sec rolls over to 0. After that, the counter is
	updated with twice the amount of time actually spent in user mode.
	The small program below shows this behaviour, after running for a
	couple of days:

	ce=379950 cu=378871
	ce=380079 cu=378999
	ce=380207 cu=379127
	ce=380336 cu=379255
	ce=380464 cu=379384
	ANOMALY: st=954617438 pe=380464 ce=380592 pu=379384 cu=237
	ANOMALY: st=954617438 pe=380592 ce=380721 pu=237 cu=493
	ANOMALY: st=954617438 pe=380721 ce=380849 pu=493 cu=749
	ANOMALY: st=954617438 pe=380849 ce=380978 pu=749 cu=1005
	ANOMALY: st=954617438 pe=380978 ce=381106 pu=1005 cu=1260
	ANOMALY: st=954617438 pe=381106 ce=381234 pu=1260 cu=1516
	ANOMALY: st=954617438 pe=381234 ce=381363 pu=1516 cu=1772
	ANOMALY: st=954617438 pe=381363 ce=381491 pu=1772 cu=2027

 	The same erratic numbers are reported via /proc/PID/status.

	This bug effects long running programs that consume lots of CPU time
	and use ru.ru_utime for internal statistics. For example, it causes
	the setiathome client to return results with negative amounts of
	CPU time spent, causing the results to be rejected.

How-To-Repeat: 
	Run this program for more than 105 hours (with nice -20):

/* pointless.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

main()
{
  struct rusage ru;
  struct timeval tv;
  struct timezone tz;
  clock_t start, pelap, celap, puser, cuser;
  long i;

  srandomdev();
  gettimeofday(&tv, &tz);
  getrusage(0, &ru);
  start = tv.tv_sec;
  celap = 0;
  cuser = ru.ru_utime.tv_sec;

  while (1) {
    for (i=0; i < 100000000; i++) random();
    gettimeofday(&tv, &tz);
    getrusage(0, &ru);
    pelap = celap;
    celap = tv.tv_sec - start;
    puser = cuser;
    cuser = ru.ru_utime.tv_sec;

    if (celap < pelap || cuser < puser || cuser - puser > celap - pelap + 1)
      fprintf(stderr, "ANOMALY: st=%ld pe=%ld ce=%ld pu=%ld cu=%ld\n",
              start, pelap, celap, puser, cuser);
    else fprintf(stderr, "ce=%ld cu=%ld\n", celap, cuser);
  }
}
Comment 1 Sheldon Hearn 2000-04-12 12:29:11 UTC
On Fri, 07 Apr 2000 12:42:59 +0200, Kees Hendrikse wrote:

> 	Tested on a stock 3.4-RELEASE, also tested on a stock 2.2.6-RELEASE.
> 	Probably a bug present in all releases of FreeBSD.

I've been running the program for more than 48 hours without an ANOMALY
lines being printed.  I'm using 5.0-CURRENT.

Does this happen when you _don't_ run it with a high nice scheduling
parameter?  If so, then I suspect this may be fixed in 4.0-RELEASE as
well.

Ciao,
Sheldon.
Comment 2 Kees Hendrikse 2000-04-24 01:12:02 UTC
Hi Sheldon,

> Excellent!  Let me know what you find.

FreeBSD 4.0 doesn't trigger the "ANOMALY" output lines, but it exhibits
an interesting variation of the same problem: it just stops increasing
ru.ru_utime.tv_sec:

(...)
ce=459822 cu=378572
ce=459938 cu=378688
ce=460054 cu=378803
ce=460170 cu=378919
ce=460285 cu=379034
ce=460401 cu=379150
ce=460517 cu=379265 <--- that's all folks....
ce=460633 cu=379265
ce=460748 cu=379265
ce=460864 cu=379265
(...)
ce=545783 cu=379265
ce=545899 cu=379265
ce=546015 cu=379265
ce=546130 cu=379265
ce=546246 cu=379265

The program is still running...


top shows:

last pid:  9456;  load averages:  1.00,  1.00,  1.00    up 6+07:57:54  01:42:24
15 processes:  2 running, 13 sleeping
CPU states:  0.4% user, 98.1% nice,  0.4% system,  1.2% interrupt,  0.0% idle
Mem: 4156K Active, 4072K Inact, 3500K Wired, 1576K Cache, 1023K Buf, 1052K Free
Swap: 64M Total, 60K Used, 64M Free

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
  162 root     103  20   804K   236K RUN    129.5H 97.22% 97.22% pointless
(...)


/proc/162/status confirms the output of "pointless":

# cd /proc/162
# for i in 1 2 3 4 5
> do
> cat status
> sleep 2
> done
pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16922,998963 nochan 0 0 0,0,0,2,3,4,5,20,31 -
pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,228881 nochan 0 0 0,0,0,2,3,4,5,20,31 -
pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,619831 nochan 0 0 0,0,0,2,3,4,5,20,31 -
pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,10718 nochan 0 0 0,0,0,2,3,4,5,20,31 -
pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,401572 nochan 0 0 0,0,0,2,3,4,5,20,31 -
                                                   ^^^^^^^^^^^^^


What baffles me most: why stop counting at 379265,687860 seconds? 

-- 
Kees Hendrikse                               | email:     kees@echelon.nl
                                             | web:        www.echelon.nl
ECHELON consultancy and software development | phone: +31 (0)53 48 36 585
PO Box 545, 7500AM Enschede, The Netherlands | fax:   +31 (0)53 43 36 222
Comment 3 Mike Barcroft freebsd_committer 2001-07-22 02:08:17 UTC
State Changed
From-To: open->feedback


Does this problem still occur in newer versions of FreeBSD, 
such as 4.3-RELEASE?
Comment 4 Sheldon Hearn freebsd_committer 2002-01-18 16:14:39 UTC
State Changed
From-To: feedback->closed

Automatic feedback timeout.  If additional feedback that warrants 
the re-opening of this PR is available but not included in the 
audit trail, please include the feedback in a reply to this message 
(preserving the Subject line) and ask that the PR be re-opened.
Comment 5 Bruce Evans freebsd_committer 2005-02-02 07:38:29 UTC
State Changed
From-To: closed->open

Closed in error.
Comment 6 Bruce Evans freebsd_committer 2005-02-02 07:48:07 UTC
State Changed
From-To: open->closed

Superseded by kern/76972, which has a detailed analysis and a partial 
fix (a multiplication overflows after about 105 hours of runtime if 
stathz has its normal value of 128...).
Comment 7 Bruce Evans 2005-02-02 09:56:50 UTC
On Wed, 26 Apr 2000, Bruce Evans meant to Cc this to freebsd-gnats-submit
but only Cced it to freebsd-bugs:

[The submitter of the PR wrote:]
> >  What baffles me most: why stop counting at 379265,687860 seconds?
>
> The multiplication in "uu = (tu * ut) / tt;" in kern_resource.c
> overflows near there.  Here tu is the total time in usec, ut is the
> user tick count and tt is the total tick count.  If all ticks are user
> ticks, then overflow occurs for tu = sqrt(2^64 / 10^6 / hz) = 429496
> seconds when hz = 100.  Normally there are some interrupt and system
> ticks, so overflow occurs for a somewhat larger tu and a somewhat
> smaller ut -- apparently about 379 / 429 times smaller in your case.
>
> 4.0 enforces monotonicity of the resource times, so truncated values
> resulting from the overflow don't cause the resource times to stick
                              ^^^^^ oops, delete "don't" here
> at the first value where overflow occurs.
>
> Bruce

The factor of 379 / 429 was because I got hz and stathz mixed up.
sqrt(2^64 / 10^6 / 128) is 379265.06...

See PR 76972 for more details, except about the stickyness of the value.
The value might not stick at the first overflowed value -- overflow just
gives garbage values, and monotonicity limits the jitter caused by the
garbage.

Bruce
Comment 8 Ed Maste freebsd_committer 2018-04-22 23:25:03 UTC

*** This bug has been marked as a duplicate of bug 76972 ***