In libc we have: 195 /** 196 * Returns the current locale for this thread, or the global locale if none is 197 * set. The caller does not have to free the locale. The return value from 198 * this call is not guaranteed to remain valid after the locale changes. As 199 * such, this should only be called within libc functions. 200 */ 201 static inline locale_t __get_locale(void) 202 { 203 204 if (!__has_thread_locale) { 205 return (&__xlocale_global_locale); 206 } 207 return (__thread_locale ? __thread_locale : &__xlocale_global_locale); 208 } Here, __has_thread_locale and __xlocale_globale_locale are global variables. In the common case, !__has_thread_locale is true. __thread_locale is a thread-local variable. This function is called any time MB_CUR_MAX is loaded, which may happen frequently (see PR 255551 for example). On main, __get_locale() compiles to this: 0x000000080115e300 <+0>: push %rbp 0x000000080115e301 <+1>: mov %rsp,%rbp 0x000000080115e304 <+4>: push %rbx 0x000000080115e305 <+5>: push %rax 0x000000080115e306 <+6>: mov 0x113fbb(%rip),%rbx # 0x8012722c8 0x000000080115e30d <+13>: data16 lea 0x113fa3(%rip),%rdi # 0x8012722b8 0x000000080115e315 <+21>: data16 data16 rex.W call 0x8012654b0 <__tls_get_addr@plt> 0x000000080115e31d <+29>: mov (%rax),%rax 0x000000080115e320 <+32>: test %rax,%rax 0x000000080115e323 <+35>: mov 0x113e6e(%rip),%rcx # 0x801272198 0x000000080115e32a <+42>: cmove %rcx,%rax 0x000000080115e32e <+46>: cmpl $0x0,(%rbx) 0x000000080115e331 <+49>: cmove %rcx,%rax 0x000000080115e335 <+53>: mov 0x18(%rax),%rax 0x000000080115e339 <+57>: mov 0x70(%rax),%eax 0x000000080115e33c <+60>: add $0x8,%rsp 0x000000080115e340 <+64>: pop %rbx 0x000000080115e341 <+65>: pop %rbp 0x000000080115e342 <+66>: ret In particular, the address of __thread_locale is obtained even if it isn't going to be used because no threads have set a per-thread locale using uselocale(3). But to obtain this address we have to call into rtld, and the call has a significant cost: a program which performs the comparison MB_CUR_MAX == 1 500,000,000 times runs in about 2.7s on my workstation. With libc modified to split the test of __thread_locale into a separate function, the runtime is reduced to 1.0s. I'm not quite sure why clang compiles __get_locale() this way. I presume it's to avoid branches, but it's quite suboptimal.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=71ec05a21257e159f40d54e26ad0011bb19b5134 commit 71ec05a21257e159f40d54e26ad0011bb19b5134 Author: Cyril Zhang <cyril@freebsdfoundation.org> AuthorDate: 2021-05-13 12:55:06 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-13 13:33:19 +0000 sort: Cache value of MB_CUR_MAX Every usage of MB_CUR_MAX results in a call to __mb_cur_max. This is inefficient and redundant. Caching the value of MB_CUR_MAX in a global variable removes these calls and speeds up the runtime of sort. For numeric sorting, runtime is almost halved in some tests. PR: 255551 PR: 255840 Reviewed by: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30170 usr.bin/sort/bwstring.c | 54 ++++++++++++++++++++++++------------------------ usr.bin/sort/bwstring.h | 9 ++++---- usr.bin/sort/radixsort.c | 4 ++-- usr.bin/sort/sort.c | 6 +++++- usr.bin/sort/sort.h | 6 ++++++ 5 files changed, 45 insertions(+), 34 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=df40dcbf7c794f5448c13e23670466658a620933 commit df40dcbf7c794f5448c13e23670466658a620933 Author: Cyril Zhang <cyril@freebsdfoundation.org> AuthorDate: 2021-05-13 12:55:06 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-05-20 13:15:43 +0000 sort: Cache value of MB_CUR_MAX Every usage of MB_CUR_MAX results in a call to __mb_cur_max. This is inefficient and redundant. Caching the value of MB_CUR_MAX in a global variable removes these calls and speeds up the runtime of sort. For numeric sorting, runtime is almost halved in some tests. PR: 255551 PR: 255840 Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30170 (cherry picked from commit 71ec05a21257e159f40d54e26ad0011bb19b5134) usr.bin/sort/bwstring.c | 54 ++++++++++++++++++++++++------------------------ usr.bin/sort/bwstring.h | 9 ++++---- usr.bin/sort/radixsort.c | 4 ++-- usr.bin/sort/sort.c | 6 +++++- usr.bin/sort/sort.h | 6 ++++++ 5 files changed, 45 insertions(+), 34 deletions(-)
Configuring -ftls-model=initial-exec in the libc Makefile seems to fix it. I'm not sure if there might be some undesirable side-effects of that change though. OTOH, it is already used in jemalloc.
https://reviews.freebsd.org/D31070
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=9c97062b620137a1f7cad4c6b3fb030a396b3266 commit 9c97062b620137a1f7cad4c6b3fb030a396b3266 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-07-16 02:35:28 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-07-16 02:41:10 +0000 libc: Use the initial-exec TLS model This permits more efficient accesses of thread-local variables, which are heavily used at least by jemalloc and locale-aware code. Note that on amd64 and i386, jemalloc's thread-local variables already have their TLS model overridden by defining JEMALLOC_TLS_MODEL. For now the change is applied only to tested platforms, but should in principle be enabled everywhere. PR: 255840 Suggested by: jrtc27 Reviewed by: kib MFC after: 2 months Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31070 lib/libc/Makefile | 7 +++++++ 1 file changed, 7 insertions(+)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=ad9f3a91d52105564dc32e5c0132377c74c3a204 commit ad9f3a91d52105564dc32e5c0132377c74c3a204 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2021-07-16 02:35:28 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2021-09-14 12:50:53 +0000 libc: Use the initial-exec TLS model This permits more efficient accesses of thread-local variables, which are heavily used at least by jemalloc and locale-aware code. Note that on amd64 and i386, jemalloc's thread-local variables already have their TLS model overridden by defining JEMALLOC_TLS_MODEL. For now the change is applied only to tested platforms, but should in principle be enabled everywhere. PR: 255840 Suggested by: jrtc27 Reviewed by: kib Sponsored by: The FreeBSD Foundation (cherry picked from commit 9c97062b620137a1f7cad4c6b3fb030a396b3266) lib/libc/Makefile | 7 +++++++ 1 file changed, 7 insertions(+)