After a process has consumed more than (about) 379400 seconds of user time, ru.ru_utime.tv_sec rolls over to 0. After that, the counter is updated with twice the amount of time actually spent in user mode. The small program below shows this behaviour, after running for a couple of days: ce=379950 cu=378871 ce=380079 cu=378999 ce=380207 cu=379127 ce=380336 cu=379255 ce=380464 cu=379384 ANOMALY: st=954617438 pe=380464 ce=380592 pu=379384 cu=237 ANOMALY: st=954617438 pe=380592 ce=380721 pu=237 cu=493 ANOMALY: st=954617438 pe=380721 ce=380849 pu=493 cu=749 ANOMALY: st=954617438 pe=380849 ce=380978 pu=749 cu=1005 ANOMALY: st=954617438 pe=380978 ce=381106 pu=1005 cu=1260 ANOMALY: st=954617438 pe=381106 ce=381234 pu=1260 cu=1516 ANOMALY: st=954617438 pe=381234 ce=381363 pu=1516 cu=1772 ANOMALY: st=954617438 pe=381363 ce=381491 pu=1772 cu=2027 The same erratic numbers are reported via /proc/PID/status. This bug effects long running programs that consume lots of CPU time and use ru.ru_utime for internal statistics. For example, it causes the setiathome client to return results with negative amounts of CPU time spent, causing the results to be rejected. How-To-Repeat: Run this program for more than 105 hours (with nice -20): /* pointless.c */ #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> main() { struct rusage ru; struct timeval tv; struct timezone tz; clock_t start, pelap, celap, puser, cuser; long i; srandomdev(); gettimeofday(&tv, &tz); getrusage(0, &ru); start = tv.tv_sec; celap = 0; cuser = ru.ru_utime.tv_sec; while (1) { for (i=0; i < 100000000; i++) random(); gettimeofday(&tv, &tz); getrusage(0, &ru); pelap = celap; celap = tv.tv_sec - start; puser = cuser; cuser = ru.ru_utime.tv_sec; if (celap < pelap || cuser < puser || cuser - puser > celap - pelap + 1) fprintf(stderr, "ANOMALY: st=%ld pe=%ld ce=%ld pu=%ld cu=%ld\n", start, pelap, celap, puser, cuser); else fprintf(stderr, "ce=%ld cu=%ld\n", celap, cuser); } }
On Fri, 07 Apr 2000 12:42:59 +0200, Kees Hendrikse wrote: > Tested on a stock 3.4-RELEASE, also tested on a stock 2.2.6-RELEASE. > Probably a bug present in all releases of FreeBSD. I've been running the program for more than 48 hours without an ANOMALY lines being printed. I'm using 5.0-CURRENT. Does this happen when you _don't_ run it with a high nice scheduling parameter? If so, then I suspect this may be fixed in 4.0-RELEASE as well. Ciao, Sheldon.
Hi Sheldon, > Excellent! Let me know what you find. FreeBSD 4.0 doesn't trigger the "ANOMALY" output lines, but it exhibits an interesting variation of the same problem: it just stops increasing ru.ru_utime.tv_sec: (...) ce=459822 cu=378572 ce=459938 cu=378688 ce=460054 cu=378803 ce=460170 cu=378919 ce=460285 cu=379034 ce=460401 cu=379150 ce=460517 cu=379265 <--- that's all folks.... ce=460633 cu=379265 ce=460748 cu=379265 ce=460864 cu=379265 (...) ce=545783 cu=379265 ce=545899 cu=379265 ce=546015 cu=379265 ce=546130 cu=379265 ce=546246 cu=379265 The program is still running... top shows: last pid: 9456; load averages: 1.00, 1.00, 1.00 up 6+07:57:54 01:42:24 15 processes: 2 running, 13 sleeping CPU states: 0.4% user, 98.1% nice, 0.4% system, 1.2% interrupt, 0.0% idle Mem: 4156K Active, 4072K Inact, 3500K Wired, 1576K Cache, 1023K Buf, 1052K Free Swap: 64M Total, 60K Used, 64M Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 162 root 103 20 804K 236K RUN 129.5H 97.22% 97.22% pointless (...) /proc/162/status confirms the output of "pointless": # cd /proc/162 # for i in 1 2 3 4 5 > do > cat status > sleep 2 > done pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16922,998963 nochan 0 0 0,0,0,2,3,4,5,20,31 - pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,228881 nochan 0 0 0,0,0,2,3,4,5,20,31 - pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16923,619831 nochan 0 0 0,0,0,2,3,4,5,20,31 - pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,10718 nochan 0 0 0,0,0,2,3,4,5,20,31 - pointless 162 1 162 0 12,0 noflags 955986493,95119 379265,687860 16924,401572 nochan 0 0 0,0,0,2,3,4,5,20,31 - ^^^^^^^^^^^^^ What baffles me most: why stop counting at 379265,687860 seconds? -- Kees Hendrikse | email: kees@echelon.nl | web: www.echelon.nl ECHELON consultancy and software development | phone: +31 (0)53 48 36 585 PO Box 545, 7500AM Enschede, The Netherlands | fax: +31 (0)53 43 36 222
State Changed From-To: open->feedback Does this problem still occur in newer versions of FreeBSD, such as 4.3-RELEASE?
State Changed From-To: feedback->closed Automatic feedback timeout. If additional feedback that warrants the re-opening of this PR is available but not included in the audit trail, please include the feedback in a reply to this message (preserving the Subject line) and ask that the PR be re-opened.
State Changed From-To: closed->open Closed in error.
State Changed From-To: open->closed Superseded by kern/76972, which has a detailed analysis and a partial fix (a multiplication overflows after about 105 hours of runtime if stathz has its normal value of 128...).
On Wed, 26 Apr 2000, Bruce Evans meant to Cc this to freebsd-gnats-submit but only Cced it to freebsd-bugs: [The submitter of the PR wrote:] > > What baffles me most: why stop counting at 379265,687860 seconds? > > The multiplication in "uu = (tu * ut) / tt;" in kern_resource.c > overflows near there. Here tu is the total time in usec, ut is the > user tick count and tt is the total tick count. If all ticks are user > ticks, then overflow occurs for tu = sqrt(2^64 / 10^6 / hz) = 429496 > seconds when hz = 100. Normally there are some interrupt and system > ticks, so overflow occurs for a somewhat larger tu and a somewhat > smaller ut -- apparently about 379 / 429 times smaller in your case. > > 4.0 enforces monotonicity of the resource times, so truncated values > resulting from the overflow don't cause the resource times to stick ^^^^^ oops, delete "don't" here > at the first value where overflow occurs. > > Bruce The factor of 379 / 429 was because I got hz and stathz mixed up. sqrt(2^64 / 10^6 / 128) is 379265.06... See PR 76972 for more details, except about the stickyness of the value. The value might not stick at the first overflowed value -- overflow just gives garbage values, and monotonicity limits the jitter caused by the garbage. Bruce
*** This bug has been marked as a duplicate of bug 76972 ***