Bug 270785 - Performance and power efficiency regression due to pthread_cond_timedwait() changes
Summary: Performance and power efficiency regression due to pthread_cond_timedwait() c...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: freebsd-threads (Nobody)
URL:
Keywords: performance, regression
Depends on:
Blocks:
 
Reported: 2023-04-12 13:08 UTC by bllcha013
Modified: 2023-11-21 16:26 UTC (History)
10 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description bllcha013 2023-04-12 13:08:57 UTC
After updating from 13.1-RELEASE-p7 to 13.2-RELEASE-p0, I experienced a performance and power efficiency regression especially when running Java programs, in particular the Minecraft server.

After some testing, this appears to be caused by commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba [Improve timeout precision of pthread_cond_timedwait().]. When building the kernel just before this commit, this regression does not occur, while building with this commit (n253051-e21c77f80c3b), the regression occurs. The regression is still present in the 6 April 2023 snapshot kernel of 14.0-CURRENT.

I have observed this behavior on two systems, a server with an Intel Xeon X3460 (Lynnfield, Nehalem) CPU, as well as a desktop PC with an Intel Core i7-6700K (Skylake) CPU.

When running a kernel with this commit, the WCPU usage of the java process in top is considerably higher at idle than without the commit when running a Minecraft server (1 to 2% before, 10% to 30% after). In addition, the CPU spends a lot more time in less power efficient C-states (C1) where before this commit, the CPU spends much more time in power efficient C-states such as C3. On my server, the CPU used to spend 98% time in C3 while after the commit it spends only 30% time in C3, leading to higher power consumption, more heat, and faster fan speeds.
Comment 1 Alexander Motin freebsd_committer freebsd_triage 2023-04-12 18:31:56 UTC
Depending on what the applicaiton/java actually does, I can easily believe that behavior could change after the mentioned commit, but it is not necessary an OS bug, it may be an application bug or a lack of optimization.  Previous code always waited for at least 2ms in pthread_cond_timedwait(), no matter what timeout is requested, while after the patch it may wake up much sooner, just in few microseconds.  It may be that application really requests wake ups that often, but previously it was not properly honored.  For better understanding it would be good to collect information about the actual system calls and their arguments before and after, either with truss or may be dtrace, etc.
Comment 2 bllcha013 2023-04-12 20:13:51 UTC
(In reply to Alexander Motin from comment #1)

I ran truss for about 5 seconds against two java processes, one running UniFi Controller and another running Minecraft server.

UniFi 13.1-RELEASE:

syscall                     seconds   calls  errors
thr_set_name            0.001021075       2       0
kevent                 13.369956893      13       0
fstatat                 0.001887292       5       5
getrusage               0.003297147      14       0
_umtx_op               51.382037072     210     111
                      ------------- ------- -------
                       64.758199479     244     116

UniFi 13.2-RELEASE:

syscall                     seconds   calls  errors
__sysctlbyname          0.000062617       3       0
write                   0.000054957       1       0
fstatat                 0.000652984       6       6
getrusage               0.001700589      21       0
thr_set_name            0.000423376       4       0
kevent                 24.622512777      24       0
_umtx_op               55.068817041     262     194
                      ------------- ------- -------
                       79.694224341     321     200

Minecraft 13.1-RELEASE:

syscall                     seconds   calls  errors
getrusage               0.012089758      74       0
sched_yield             0.088819070     726       0
_umtx_op               22.515308049     838     838
                      ------------- ------- -------
                       22.616216877    1638     838

Minecraft 13.2-RELEASE:

syscall                     seconds   calls  errors
getrusage               0.002642594     106       0
sched_yield             0.750417219   37623       0
_umtx_op               28.945530935   37797   37797
                      ------------- ------- -------
                       29.698590748   75526   37797

A few lines from Minecraft on 13.1-RELEASE:

_umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)
_umtx_op(0x8010661d0,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdecebd68) ERR#60 'Operation timed out'
_umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)
_umtx_op(0x801066458,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd2ac688) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)

A few lines from Minecraft on 13.2-RELEASE:

_umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)
_umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)
_umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)
_umtx_op(0x18defb81b470,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x18e0d4980728) ERR#60 'Operation timed out'
sched_yield()					 = 0 (0x0)

This is my first time using truss, or doing any sort of "debugging" so I'm not very familiar, if I didn't do something right please tell me and I'll try do it again properly.

I hope this helps with understanding.
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2023-04-13 16:35:27 UTC
So minecraft is generating many, many more calls to _umtx_op() than before.  If the kernel commit is reverted, and nothing else changes, does the number of calls stay the same?  If not, then there may indeed be a kernel bug.

It would be useful to see the timeout being passed.  Some output from a dtrace script like:

# dtrace -n 'fbt::do_wait:entry /args[3] && pid == <pid of minecraft process>/{print(*args[3]);}'

would be useful.
Comment 4 bllcha013 2023-04-13 20:15:10 UTC
(In reply to Mark Johnston from comment #3)

I cloned releng/13.2 and reverted commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba. Reverting this commit fixes the regression.

I also did the dtrace output as asked - thank you for providing the dtrace script.


releng/13.2 (standard, same as RELEASE):

syscall                     seconds   calls  errors
__sysctlbyname          0.000650415       6       0
getrusage               0.008321625     116       0
sched_yield             1.287887942   29936       0
_umtx_op               41.138906134   30136   30132
                      ------------- ------- -------
                       42.435766116   60194   30132

dtrace: description 'fbt::do_wait:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2  18111                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0xc2
        long tv_nsec = 0x2ed64521
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}
  2  18111                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0xc2
        long tv_nsec = 0x2ed7d390
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}


releng/13.2 (with commit e21c77f80c3b1cf2bd0a9f874e15e7a8e49f0dba reverted):

syscall                     seconds   calls  errors
getrusage               0.003569531      58       0
sched_yield             0.168773634    3781       0
_umtx_op               16.840977734    3873    3873
                      ------------- ------- -------
                       17.013320899    7712    3873

dtrace: description 'fbt::do_wait:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  4  18111                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0x6f
        long tv_nsec = 0xba6df4a
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}
  3  18111                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0x6f
        long tv_nsec = 0x8bb061f
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}


So it seems that reverting the commit fixes the issue. It also helps the UniFi Controller as well (although that wasn't impacted as drastically as Minecraft). This might be a kernel bug as you mentioned?
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2023-04-13 22:31:52 UTC
So it looks like userspace is always specifying an absolute time.  The callout code will not specify a precision in that case, maybe that is relevant to the problem, or there is a bug in that code path.
Comment 6 bllcha013 2023-04-14 15:39:42 UTC
Interestingly, there is a bug report for Minecraft that seems related: https://bugs.mojang.com/browse/MC-183518

It seems that an extremely low timeout is causing massive numbers of calls per second. This issue wasn't noticeable before, but now is, probably due to the increased precision of timing allowed with the commit. For Linux, changing the value in /proc/<pid>/timerslack_ns apparently helps, but it isn't there on FreeBSD.

I noticed that the difference between "steps" of the tv_nsec value in the dtrace output was seemingly tied to kern.hz before the commit, so kern.hz=100 would give a 10ms step, and kern.hz=1000 would give a 1ms step, no matter what kern.eventtimer.periodic is set to, which I confirmed with dtrace. After the commit, kern.hz doesn't seem to have an impact when kern.eventtimer.periodic=0: the step is always 0.1ms. However, using kern.eventtimer.periodic=1 changes things: it becomes 0.5ms with kern.hz=1000 and 2.5ms with kern.hz=100.

From my (limited) understanding, the commit improves the precision of timed waits, but the old behavior inadvertently helped with CPU utilization and power efficiency by preventing software from being too abusive with short timeouts and constantly waking the CPU.

Please let me know if there is any other information I can provide.
Comment 7 Alexander Motin freebsd_committer freebsd_triage 2023-04-14 15:48:04 UTC
Your analysis exactly matches my expectations.  I'd say it is a Minecraft bug, not OS.  May be we could introduce some tunable to limit minimal sleep time, but I hate tunables in general.
Comment 8 bllcha013 2023-04-15 17:18:48 UTC
(In reply to Alexander Motin from comment #7)

A minimal sleep time tunable would be helpful as Linux seems to have something similar(ish), in /proc/<pid>/timerslack_ns which I tested and confirmed to help.

I think the ultimate solution would be for Mojang to adjust Minecraft server so that it doesn't call for such short sleep times, however, that bug report has been open for a long time (along with a few others) and the solution was basically "use /proc/<pid>/timerslack_ns".

I tried to open a bug with Mojang for it hoping that they will try and fix it this time, however, their bug reporting platform refuses to allow me to create an account at this time for some reason. I'll try again soon.

In any case, would a tunable for this be at all feasible?
Comment 9 Konstantin Belousov freebsd_committer freebsd_triage 2023-04-15 17:49:14 UTC
Try this https://reviews.freebsd.org/D39584
Comment 10 bllcha013 2023-04-16 08:50:51 UTC
Thank you, I applied the diff on a clean git clone on releng/13.2 and tested, but unfortunately it doesn't appear to have any effect. The sysctl is visible as kern.ipc.umtx_min_timeout. I set it to 10ms (10000000ns) by issuing "sysctl kern.ipc.umtx_min_timeout=10000000" but unfortunately there doesn't seem to be any effect. CPU usage stays the same. Here's the truss and dtrace output:

syscall                     seconds   calls  errors
__sysctlbyname          0.000986089       6       0
getrusage               0.008186177     106       0
sched_yield             1.172105081   28475       0
_umtx_op               30.594233465   28659   28654
                      ------------- ------- -------
                       31.775510812   57246   28654

dtrace: description 'fbt::do_wait:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  5  18113                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0x16a
        long tv_nsec = 0x279c2e2a
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}
  6  18113                    do_wait:entry struct _umtx_time {
    struct timespec _timeout = {
        time_t tv_sec = 0x16a
        long tv_nsec = 0x279dd9fd
    }
    __uint32_t _flags = 0x1
    __uint32_t _clockid = 0x4
}

It seems the high volume of syscalls stayed the same, with the same short 0.1ms delay.

Assuming this can be made to work though, this tunable would be really greatly appreciated.
Comment 11 Konstantin Belousov freebsd_committer freebsd_triage 2023-04-16 10:27:04 UTC
Yes, there was at least one conceptual bug in the patch which made it
non-functional.  Try the update please.
Comment 12 bllcha013 2023-04-16 12:48:19 UTC
Thank you, I applied the updated patch and tested, now Java seems to hang after Minecraft server has started when the sysctl kern.ipc.umtx_min_timeout is set to anything except 0. I tried 1, 10, 100000, 10000000, etc... which all resulted in a hang, but 0 is perfectly fine. These are the last few lines from truss when Java hangs and the truss output stops. WCPU usage of java process drops to 0%. Only way to stop it at this point is kill -9.

_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x27cea36b0188,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_free_count",24,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_cache_count",25,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_inactive_count",28,0x27d0328ecc04,0x27d0328ecbf8,0x0,0) = 0 (0x0)
mprotect(0x27ceb6394000,65536,PROT_READ|PROT_WRITE|PROT_EXEC) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_free_count",24,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_cache_count",25,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0)
__sysctlbyname("vm.stats.vm.v_inactive_count",28,0x27d047c69c04,0x27d047c69bf8,0x0,0) = 0 (0x0)

In addition, when the sysctl is set to anything except 0, dtrace shows this output from the moment I start Java/Minecraft to the moment it freezes:

dtrace: description 'fbt::do_wait:entry ' matched 1 probe

If I set the sysctl back to 0, kill and restart Java/Minecraft, then dtrace shows lots of output as expected. As soon as I set the sysctl again, then dtrace output immediately stops as well as truss output, leaving the above. It's as if it is not entering do_wait at all when the sysctl is set?
Comment 13 Konstantin Belousov freebsd_committer freebsd_triage 2023-04-16 16:19:51 UTC
I misused abs time instead of the uptime for the timeout recalculation.
Please try the updated patch.
Comment 14 bllcha013 2023-04-16 18:02:06 UTC
Thank you, it works now. Setting kern.ipc.umtx_min_timeout to 10000000 (10ms) gives nice low CPU usage, and the CPU is spending much more time in low-power C-states. I'll keep an eye on it for a while more but for now it seems perfect.

I really appreciate this patch, and I'm sure others will find this very helpful.

May I ask if this patch will be included in an update or patch to FreeBSD?
Comment 15 iron.udjin 2023-04-17 08:23:28 UTC
Can we expect this patch in -STABLE? Thank you.
Comment 16 Konstantin Belousov freebsd_committer freebsd_triage 2023-04-17 18:40:11 UTC
Can people retest with the updated patch from
https://reviews.freebsd.org/D39584 ?

Now the env var LIBPTHREAD_UMTX_MIN_TIMEOUT should be set per-process,
instead of the global sysctl knob.
Comment 17 commit-hook freebsd_committer freebsd_triage 2023-04-18 23:23:29 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=93ca6ff2958cace3b7b883ed797ea6539881a6a2

commit 93ca6ff2958cace3b7b883ed797ea6539881a6a2
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-04-15 17:46:12 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-04-18 23:22:28 +0000

    umtx: allow to configure minimal timeout (in nanoseconds)

    PR:     270785
    Reviewed by:    markj, mav
    Sponsored by:   The FreeBSD Foundation
    MFC after:      1 week
    Differential revision:  https://reviews.freebsd.org/D39584

 lib/libc/sys/_umtx_op.2      | 11 +++++++++++
 lib/libthr/libthr.3          |  6 ++++++
 lib/libthr/thread/thr_init.c | 11 +++++++++++
 sys/kern/kern_thread.c       |  4 ++--
 sys/kern/kern_umtx.c         | 45 ++++++++++++++++++++++++++++++++++++++++++++
 sys/sys/proc.h               |  1 +
 sys/sys/umtx.h               |  2 ++
 7 files changed, 78 insertions(+), 2 deletions(-)
Comment 18 bllcha013 2023-04-20 14:40:57 UTC
(In reply to Konstantin Belousov from comment #16)

I tested it and it appears to work perfectly. No complaints here. Having it per-process is much appreciated.

May I ask if it will be included in an update for 13.2-RELEASE? Or will it wait until 14.0?
Comment 19 commit-hook freebsd_committer freebsd_triage 2023-04-25 00:39:10 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=68a69739cb902d083aacdb0974b9b957e0e1353b

commit 68a69739cb902d083aacdb0974b9b957e0e1353b
Author:     Konstantin Belousov <kib@FreeBSD.org>
AuthorDate: 2023-04-15 17:46:12 +0000
Commit:     Konstantin Belousov <kib@FreeBSD.org>
CommitDate: 2023-04-25 00:27:27 +0000

    umtx: allow to configure minimal timeout (in nanoseconds)

    PR:     270785

    (cherry picked from commit 93ca6ff2958cace3b7b883ed797ea6539881a6a2)

 lib/libc/sys/_umtx_op.2      | 11 +++++++++++
 lib/libthr/libthr.3          |  6 ++++++
 lib/libthr/thread/thr_init.c | 11 +++++++++++
 sys/kern/kern_fork.c         |  1 +
 sys/kern/kern_umtx.c         | 45 ++++++++++++++++++++++++++++++++++++++++++++
 sys/sys/proc.h               |  1 +
 sys/sys/umtx.h               |  2 ++
 7 files changed, 77 insertions(+)
Comment 20 bllcha013 2023-11-21 16:26:52 UTC
FreeBSD 14.0-RELEASE includes the LIBPTHREAD_UMTX_MIN_TIMEOUT per-process tunable, and setting it like LIBPTHREAD_UMTX_MIN_TIMEOUT=10000000 before the java command to start a Minecraft server for example, greatly improves power efficiency. It is working great now.

Thank you for this, and thank you to everyone working on FreeBSD.