Summary: | kern/kern_resource: Integer overflow in function calcru1 | ||
---|---|---|---|
Product: | Base System | Reporter: | WHR <msl0000023508> |
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> |
Status: | Open --- | ||
Severity: | Affects Some People | CC: | cem, jhb, koobs, msl0000023508, sjg |
Priority: | --- | Keywords: | needs-qa |
Version: | 12.3-STABLE | Flags: | koobs:
maintainer-feedback?
(sjg) koobs: maintainer-feedback? (cem) koobs: maintainer-feedback? (jhb) koobs: mfc-stable13? koobs: mfc-stable12? |
Hardware: | Any | ||
OS: | Any | ||
See Also: |
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197481 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=176991 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=76972 https://reviews.freebsd.org/D33902 |
I believe I've reproduced this on a 12.3-RELEASE-p5 GENERIC amd64 virtualbox virtual machine. After suspending the guest and upgrading virtualbox (on host) and resuming, the following was logged in /var/log/messages: Jul 28 00:34:10 123-RELEASE-p5-amd64-9e36 kernel: calcru: runtime went backwards from 5425487903787745 usec to 18518189771 usec for pid 11 (idle) Additionally, the following is observed in top output: PID JID USERNAME THR PRI NICE SIZE RES SWAP STATE C TIME WCPU COMMAND 11 0 root 2 155 ki31 0B 32K 0B RUN 0 ??? 199.70% [idle] Originally running `ps -p` on the idle process id showed time and system not changing: [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 17087 -o comm,time,systime COMMAND TIME SYSTIME [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424922:45.67 90424922:45.67 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424922:45.67 90424922:45.67 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424922:45.67 90424922:45.67 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424922:45.67 90424922:45.67 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424922:45.67 90424922:45.67 But after a small period of time started changing again: [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424939:04.68 90424939:04.68 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424939:05.76 90424939:05.76 [koobs@123-RELEASE-p5-amd64-9e36:/usr/ports] ps -p 11 -o comm,time,systime COMMAND TIME SYSTIME idle 90424939:06.43 90424939:06.43 Looking at kern/kern_resource history: 1) Conrad (cem) resolved a calcru1 overflow in base 23e5e43ccd0f via bug 76972 (Bruce followed up with base 23e5e43ccd0f) 2) Simon (sjg) disabled the message for virtual machines in base bacb140f31aa via review D33902 (unsure whether this was eventually MFC'd or not, the commit didn't include MFC tag) There may be behaviour improvements possible in this case, particularly for the '???' reported times in top. In particular: 1) Had the message been disabled in my (VM) case, I would not have been able to correlate/isolate the cause via dmesg 2) What could we report for runtime in the event of overflow instead of '???' What do other OS's do? With dynamic frequency/power scaling being the norm these days, increasing heterogeneity in system hardware (big.little etc) in the future, and given this message appears to only be an (intended) 'report' or diagnostic message solely to indicate the state change (and not a fatal error) ... 3) What other ways might FreeBSD more broadly/gracefully handle timing changes going forward? ^Triage: Request feedback from Conrad re (remaining?) calcru1 overflow and Simon/John re virtual machine situation. I didn't see any kernel message printed in dmesg(8) when this happens, but the issue prevented the affected process being correctly shown in some 'top' implementations. (I uses nmon(1) with linprocfs(4)) |
The issue is from following code in function calcru1: su = (tu * st) / tt; the 'tu * st' is overflowed in my system, causing the 'p_rux.rux_su' is no longer growing for that process. For example, running ps(1) shows that 'systime' is no longer growing, but 'time' is not yet overflowed and keep growing $ ps -p 17087 -o comm,time,systime COMMAND TIME SYSTIME VBoxHeadless 8951:24.70 6339:45.52 $ ps -p 17087 -o comm,time,systime COMMAND TIME SYSTIME VBoxHeadless 8951:25.72 6339:45.52 $ ps -p 17087 -o comm,time,systime COMMAND TIME SYSTIME VBoxHeadless 8951:26.54 6339:45.52 $ ps -p 17087 -o comm,time,systime COMMAND TIME SYSTIME VBoxHeadless 8951:27.60 6339:45.52 Running kgdb(1) shows an overflow is occurred when 'cputick2usec(ruxp->rux_runtime)' multiplying '$p->p_rux.rux_sticks' (kgdb) p $p $3949 = (struct proc *) 0xfffff800464f5000 (kgdb) p $p->p_pid $3950 = 17087 (kgdb) p $p->p_comm $3951 = "VBoxHeadless\000\000\000\000\000\000\000" (kgdb) p $p->p_rux $3952 = {rux_runtime = 1888763265150967, rux_uticks = 866405, rux_sticks = 66653750, rux_iticks = 0, rux_uu = 6893394017, rux_su = 380385524075, rux_tu = 537211849594} (kgdb) p $p->p_rux $3953 = {rux_runtime = 1888767209058083, rux_uticks = 866408, rux_sticks = 66653898, rux_iticks = 0, rux_uu = 6893416864, rux_su = 380385524075, rux_tu = 537212971341} (kgdb) p $p->p_rux $3954 = {rux_runtime = 1888773035929970, rux_uticks = 866411, rux_sticks = 66654106, rux_iticks = 0, rux_uu = 6893440457, rux_su = 380385524075, rux_tu = 537214628650} (kgdb) p ((uint64_t)$p->p_rux.rux_runtime * 1000ll / ((cpu_ticks == tc_cpu_ticks ? timehands->th_counter->tc_frequency : cpu_tick_frequency) / 1000ll)) $3955 = 537220825631 (kgdb) p ((uint64_t)$p->p_rux.rux_runtime * 1000ll / ((cpu_ticks == tc_cpu_ticks ? timehands->th_counter->tc_frequency : cpu_tick_frequency) / 1000ll)) * $p->p_rux.rux_sticks / ($p->p_rux.rux_uticks + $p->p_rux.rux_sticks + $p->p_rux.rux_iticks) $3956 = 257136383464 (kgdb) p ((uint64_t)$p->p_rux.rux_runtime * 1000ll / ((cpu_ticks == tc_cpu_ticks ? timehands->th_counter->tc_frequency : cpu_tick_frequency) / 1000ll)) $3959 = 537288008712 (kgdb) p $p->p_rux.rux_sticks $3960 = 66663850 (kgdb) p ((uint64_t)$p->p_rux.rux_runtime * 1000ll / ((cpu_ticks == tc_cpu_ticks ? timehands->th_counter->tc_frequency : cpu_tick_frequency) / 1000ll)) * $p->p_rux.rux_sticks $3961 = 17371417884613589424 Showing the correct result of 'tu * st' in python(1): >>> 537288008712*66663850 35817687219575461200L This result is actually greater than ULLONG_MAX. Tested on kernel version 10.3-RELEASE-p20 and 11.1-RELEASE.