After updating from 13.1-RELEASE-p7 to 13.2-RELEASE-p0, I experienced a performance and power efficiency regression especially when running Java programs, in particular the Minecraft server. After some testing, this appears to be caused by commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba [Improve timeout precision of pthread_cond_timedwait().]. When building the kernel just before this commit, this regression does not occur, while building with this commit (n253051-e21c77f80c3b), the regression occurs. The regression is still present in the 6 April 2023 snapshot kernel of 14.0-CURRENT. I have observed this behavior on two systems, a server with an Intel Xeon X3460 (Lynnfield, Nehalem) CPU, as well as a desktop PC with an Intel Core i7-6700K (Skylake) CPU. When running a kernel with this commit, the WCPU usage of the java process in top is considerably higher at idle than without the commit when running a Minecraft server (1 to 2% before, 10% to 30% after). In addition, the CPU spends a lot more time in less power efficient C-states (C1) where before this commit, the CPU spends much more time in power efficient C-states such as C3. On my server, the CPU used to spend 98% time in C3 while after the commit it spends only 30% time in C3, leading to higher power consumption, more heat, and faster fan speeds.
Depending on what the applicaiton/java actually does, I can easily believe that behavior could change after the mentioned commit, but it is not necessary an OS bug, it may be an application bug or a lack of optimization. Previous code always waited for at least 2ms in pthread_cond_timedwait(), no matter what timeout is requested, while after the patch it may wake up much sooner, just in few microseconds. It may be that application really requests wake ups that often, but previously it was not properly honored. For better understanding it would be good to collect information about the actual system calls and their arguments before and after, either with truss or may be dtrace, etc.
(In reply to Alexander Motin from comment #1) I ran truss for about 5 seconds against two java processes, one running UniFi Controller and another running Minecraft server. UniFi 13.1-RELEASE: syscall seconds calls errors thr_set_name 0.001021075 2 0 kevent 13.369956893 13 0 fstatat 0.001887292 5 5 getrusage 0.003297147 14 0 _umtx_op 51.382037072 210 111 ------------- ------- ------- 64.758199479 244 116 UniFi 13.2-RELEASE: syscall seconds calls errors __sysctlbyname 0.000062617 3 0 write 0.000054957 1 0 fstatat 0.000652984 6 6 getrusage 0.001700589 21 0 thr_set_name 0.000423376 4 0 kevent 24.622512777 24 0 _umtx_op 55.068817041 262 194 ------------- ------- ------- 79.694224341 321 200 Minecraft 13.1-RELEASE: syscall seconds calls errors getrusage 0.012089758 74 0 sched_yield 0.088819070 726 0 _umtx_op 22.515308049 838 838 ------------- ------- ------- 22.616216877 1638 838 Minecraft 13.2-RELEASE: syscall seconds calls errors getrusage 0.002642594 106 0 sched_yield 0.750417219 37623 0 _umtx_op 28.945530935 37797 37797 ------------- ------- ------- 29.698590748 75526 37797 A few lines from Minecraft on 13.1-RELEASE: _umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) _umtx_op(0x8010661d0,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdecebd68) ERR#60 'Operation timed out' _umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) _umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) A few lines from Minecraft on 13.2-RELEASE: _umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) _umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) _umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) _umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out' sched_yield() = 0 (0x0) This is my first time using truss, or doing any sort of "debugging" so I'm not very familiar, if I didn't do something right please tell me and I'll try do it again properly. I hope this helps with understanding.
So minecraft is generating many, many more calls to _umtx_op() than before. If the kernel commit is reverted, and nothing else changes, does the number of calls stay the same? If not, then there may indeed be a kernel bug. It would be useful to see the timeout being passed. Some output from a dtrace script like: # dtrace -n 'fbt::do_wait:entry /args[3] && pid == <pid of minecraft process>/{print(*args[3]);}' would be useful.
(In reply to Mark Johnston from comment #3) I cloned releng/13.2 and reverted commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba. Reverting this commit fixes the regression. I also did the dtrace output as asked - thank you for providing the dtrace script. releng/13.2 (standard, same as RELEASE): syscall seconds calls errors __sysctlbyname 0.000650415 6 0 getrusage 0.008321625 116 0 sched_yield 1.287887942 29936 0 _umtx_op 41.138906134 30136 30132 ------------- ------- ------- 42.435766116 60194 30132 dtrace: description 'fbt::do_wait:entry ' matched 1 probe CPU ID FUNCTION:NAME 2 18111 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0xc2 long tv_nsec = 0x2ed64521 } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } 2 18111 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0xc2 long tv_nsec = 0x2ed7d390 } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } releng/13.2 (with commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba reverted): syscall seconds calls errors getrusage 0.003569531 58 0 sched_yield 0.168773634 3781 0 _umtx_op 16.840977734 3873 3873 ------------- ------- ------- 17.013320899 7712 3873 dtrace: description 'fbt::do_wait:entry ' matched 1 probe CPU ID FUNCTION:NAME 4 18111 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0x6f long tv_nsec = 0xba6df4a } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } 3 18111 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0x6f long tv_nsec = 0x8bb061f } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } So it seems that reverting the commit fixes the issue. It also helps the UniFi Controller as well (although that wasn't impacted as drastically as Minecraft). This might be a kernel bug as you mentioned?
So it looks like userspace is always specifying an absolute time. The callout code will not specify a precision in that case, maybe that is relevant to the problem, or there is a bug in that code path.
Interestingly, there is a bug report for Minecraft that seems related: https://bugs.mojang.com/browse/MC-183518 It seems that an extremely low timeout is causing massive numbers of calls per second. This issue wasn't noticeable before, but now is, probably due to the increased precision of timing allowed with the commit. For Linux, changing the value in /proc/<pid>/timerslack_ns apparently helps, but it isn't there on FreeBSD. I noticed that the difference between "steps" of the tv_nsec value in the dtrace output was seemingly tied to kern.hz before the commit, so kern.hz=100 would give a 10ms step, and kern.hz=1000 would give a 1ms step, no matter what kern.eventtimer.periodic is set to, which I confirmed with dtrace. After the commit, kern.hz doesn't seem to have an impact when kern.eventtimer.periodic=0: the step is always 0.1ms. However, using kern.eventtimer.periodic=1 changes things: it becomes 0.5ms with kern.hz=1000 and 2.5ms with kern.hz=100. From my (limited) understanding, the commit improves the precision of timed waits, but the old behavior inadvertently helped with CPU utilization and power efficiency by preventing software from being too abusive with short timeouts and constantly waking the CPU. Please let me know if there is any other information I can provide.
Your analysis exactly matches my expectations. I'd say it is a Minecraft bug, not OS. May be we could introduce some tunable to limit minimal sleep time, but I hate tunables in general.
(In reply to Alexander Motin from comment #7) A minimal sleep time tunable would be helpful as Linux seems to have something similar(ish), in /proc/<pid>/timerslack_ns which I tested and confirmed to help. I think the ultimate solution would be for Mojang to adjust Minecraft server so that it doesn't call for such short sleep times, however, that bug report has been open for a long time (along with a few others) and the solution was basically "use /proc/<pid>/timerslack_ns". I tried to open a bug with Mojang for it hoping that they will try and fix it this time, however, their bug reporting platform refuses to allow me to create an account at this time for some reason. I'll try again soon. In any case, would a tunable for this be at all feasible?
Try this https://reviews.freebsd.org/D39584
Thank you, I applied the diff on a clean git clone on releng/13.2 and tested, but unfortunately it doesn't appear to have any effect. The sysctl is visible as kern.ipc.umtx_min_timeout. I set it to 10ms (10000000ns) by issuing "sysctl kern.ipc.umtx_min_timeout=10000000" but unfortunately there doesn't seem to be any effect. CPU usage stays the same. Here's the truss and dtrace output: syscall seconds calls errors __sysctlbyname 0.000986089 6 0 getrusage 0.008186177 106 0 sched_yield 1.172105081 28475 0 _umtx_op 30.594233465 28659 28654 ------------- ------- ------- 31.775510812 57246 28654 dtrace: description 'fbt::do_wait:entry ' matched 1 probe CPU ID FUNCTION:NAME 5 18113 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0x16a long tv_nsec = 0x279c2e2a } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } 6 18113 do_wait:entry struct _umtx_time { struct timespec _timeout = { time_t tv_sec = 0x16a long tv_nsec = 0x279dd9fd } __uint32_t _flags = 0x1 __uint32_t _clockid = 0x4 } It seems the high volume of syscalls stayed the same, with the same short 0.1ms delay. Assuming this can be made to work though, this tunable would be really greatly appreciated.
Yes, there was at least one conceptual bug in the patch which made it non-functional. Try the update please.
Thank you, I applied the updated patch and tested, now Java seems to hang after Minecraft server has started when the sysctl kern.ipc.umtx_min_timeout is set to anything except 0. I tried 1, 10, 100000, 10000000, etc... which all resulted in a hang, but 0 is perfectly fine. These are the last few lines from truss when Java hangs and the truss output stops. WCPU usage of java process drops to 0%. Only way to stop it at this point is kill -9. _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0) _umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_free_count",24,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_cache_count",25,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_inactive_count",28,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0) mprotect(0x27ceb6394000,65536,PROT_READ|PROT_WRITE|PROT_EXEC) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_free_count",24,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_cache_count",25,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0) __sysctlbyname("vm.stats.vm.v_inactive_count",28,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0) In addition, when the sysctl is set to anything except 0, dtrace shows this output from the moment I start Java/Minecraft to the moment it freezes: dtrace: description 'fbt::do_wait:entry ' matched 1 probe If I set the sysctl back to 0, kill and restart Java/Minecraft, then dtrace shows lots of output as expected. As soon as I set the sysctl again, then dtrace output immediately stops as well as truss output, leaving the above. It's as if it is not entering do_wait at all when the sysctl is set?
I misused abs time instead of the uptime for the timeout recalculation. Please try the updated patch.
Thank you, it works now. Setting kern.ipc.umtx_min_timeout to 10000000 (10ms) gives nice low CPU usage, and the CPU is spending much more time in low-power C-states. I'll keep an eye on it for a while more but for now it seems perfect. I really appreciate this patch, and I'm sure others will find this very helpful. May I ask if this patch will be included in an update or patch to FreeBSD?
Can we expect this patch in -STABLE? Thank you.
Can people retest with the updated patch from https://reviews.freebsd.org/D39584 ? Now the env var LIBPTHREAD_UMTX_MIN_TIMEOUT should be set per-process, instead of the global sysctl knob.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=93ca6ff2958cace3b7b883ed797ea6539881a6a2 commit 93ca6ff2958cace3b7b883ed797ea6539881a6a2 Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-04-15 17:46:12 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-04-18 23:22:28 +0000 umtx: allow to configure minimal timeout (in nanoseconds) PR: 270785 Reviewed by: markj, mav Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39584 lib/libc/sys/_umtx_op.2 | 11 +++++++++++ lib/libthr/libthr.3 | 6 ++++++ lib/libthr/thread/thr_init.c | 11 +++++++++++ sys/kern/kern_thread.c | 4 ++-- sys/kern/kern_umtx.c | 45 ++++++++++++++++++++++++++++++++++++++++++++ sys/sys/proc.h | 1 + sys/sys/umtx.h | 2 ++ 7 files changed, 78 insertions(+), 2 deletions(-)
(In reply to Konstantin Belousov from comment #16) I tested it and it appears to work perfectly. No complaints here. Having it per-process is much appreciated. May I ask if it will be included in an update for 13.2-RELEASE? Or will it wait until 14.0?
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=68a69739cb902d083aacdb0974b9b957e0e1353b commit 68a69739cb902d083aacdb0974b9b957e0e1353b Author: Konstantin Belousov <kib@FreeBSD.org> AuthorDate: 2023-04-15 17:46:12 +0000 Commit: Konstantin Belousov <kib@FreeBSD.org> CommitDate: 2023-04-25 00:27:27 +0000 umtx: allow to configure minimal timeout (in nanoseconds) PR: 270785 (cherry picked from commit 93ca6ff2958cace3b7b883ed797ea6539881a6a2) lib/libc/sys/_umtx_op.2 | 11 +++++++++++ lib/libthr/libthr.3 | 6 ++++++ lib/libthr/thread/thr_init.c | 11 +++++++++++ sys/kern/kern_fork.c | 1 + sys/kern/kern_umtx.c | 45 ++++++++++++++++++++++++++++++++++++++++++++ sys/sys/proc.h | 1 + sys/sys/umtx.h | 2 ++ 7 files changed, 77 insertions(+)
FreeBSD 14.0-RELEASE includes the LIBPTHREAD_UMTX_MIN_TIMEOUT per-process tunable, and setting it like LIBPTHREAD_UMTX_MIN_TIMEOUT=10000000 before the java command to start a Minecraft server for example, greatly improves power efficiency. It is working great now. Thank you for this, and thank you to everyone working on FreeBSD.