Bug 255840 - __get_locale() is inefficient
Summary: __get_locale() is inefficient
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-13 13:25 UTC by Mark Johnston
Modified: 2021-09-14 12:53 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Johnston freebsd_committer 2021-05-13 13:25:20 UTC
In libc we have:

195 /**                                                                                                                                                                                                                                                                                                                       
196  * Returns the current locale for this thread, or the global locale if none is                                                                                                                                                                                                                                            
197  * set.  The caller does not have to free the locale.  The return value from                                                                                                                                                                                                                                              
198  * this call is not guaranteed to remain valid after the locale changes.  As                                                                                                                                                                                                                                              
199  * such, this should only be called within libc functions.                                                                                                                                                                                                                                                                
200  */                                                                                                                                                                                                                                                                                                                       
201 static inline locale_t __get_locale(void)                                                                                                                                                                                                                                                                                 
202 {                                                                                                                                                                                                                                                                                                                         
203                                                                                                                                                                                                                                                                                                                           
204         if (!__has_thread_locale) {                                                                                                                                                                                                                                                                                       
205                 return (&__xlocale_global_locale);                                                                                                                                                                                                                                                                        
206         }                                                                                                                                                                                                                                                                                                                 
207         return (__thread_locale ? __thread_locale : &__xlocale_global_locale);                                                                                                                                                                                                                                            
208 }

Here, __has_thread_locale and __xlocale_globale_locale are global variables.  In the common case, !__has_thread_locale is true.  __thread_locale is a thread-local variable.

This function is called any time MB_CUR_MAX is loaded, which may happen frequently (see PR 255551 for example).

On main, __get_locale() compiles to this:

   0x000000080115e300 <+0>:     push   %rbp
   0x000000080115e301 <+1>:     mov    %rsp,%rbp
   0x000000080115e304 <+4>:     push   %rbx
   0x000000080115e305 <+5>:     push   %rax
   0x000000080115e306 <+6>:     mov    0x113fbb(%rip),%rbx        # 0x8012722c8
   0x000000080115e30d <+13>:    data16 lea 0x113fa3(%rip),%rdi        # 0x8012722b8
   0x000000080115e315 <+21>:    data16 data16 rex.W call 0x8012654b0 <__tls_get_addr@plt>
   0x000000080115e31d <+29>:    mov    (%rax),%rax
   0x000000080115e320 <+32>:    test   %rax,%rax
   0x000000080115e323 <+35>:    mov    0x113e6e(%rip),%rcx        # 0x801272198
   0x000000080115e32a <+42>:    cmove  %rcx,%rax
   0x000000080115e32e <+46>:    cmpl   $0x0,(%rbx)
   0x000000080115e331 <+49>:    cmove  %rcx,%rax
   0x000000080115e335 <+53>:    mov    0x18(%rax),%rax
   0x000000080115e339 <+57>:    mov    0x70(%rax),%eax
   0x000000080115e33c <+60>:    add    $0x8,%rsp
   0x000000080115e340 <+64>:    pop    %rbx
   0x000000080115e341 <+65>:    pop    %rbp
   0x000000080115e342 <+66>:    ret    

In particular, the address of __thread_locale is obtained even if it isn't going to be used because no threads have set a per-thread locale using uselocale(3).  But to obtain this address we have to call into rtld, and the call has a significant cost: a program which performs the comparison MB_CUR_MAX == 1 500,000,000 times runs in about 2.7s on my workstation.  With libc modified to split the test of __thread_locale into a separate function, the runtime is reduced to 1.0s.

I'm not quite sure why clang compiles __get_locale() this way. I presume it's to avoid branches, but it's quite suboptimal.
Comment 1 commit-hook freebsd_committer 2021-05-13 13:35:56 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=71ec05a21257e159f40d54e26ad0011bb19b5134

commit 71ec05a21257e159f40d54e26ad0011bb19b5134
Author:     Cyril Zhang <cyril@freebsdfoundation.org>
AuthorDate: 2021-05-13 12:55:06 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-05-13 13:33:19 +0000

    sort: Cache value of MB_CUR_MAX

    Every usage of MB_CUR_MAX results in a call to __mb_cur_max.  This is
    inefficient and redundant.  Caching the value of MB_CUR_MAX in a global
    variable removes these calls and speeds up the runtime of sort.  For
    numeric sorting, runtime is almost halved in some tests.

    PR:             255551
    PR:             255840
    Reviewed by:    markj
    MFC after:      1 week
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D30170

 usr.bin/sort/bwstring.c  | 54 ++++++++++++++++++++++++------------------------
 usr.bin/sort/bwstring.h  |  9 ++++----
 usr.bin/sort/radixsort.c |  4 ++--
 usr.bin/sort/sort.c      |  6 +++++-
 usr.bin/sort/sort.h      |  6 ++++++
 5 files changed, 45 insertions(+), 34 deletions(-)
Comment 2 commit-hook freebsd_committer 2021-05-20 14:11:01 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=df40dcbf7c794f5448c13e23670466658a620933

commit df40dcbf7c794f5448c13e23670466658a620933
Author:     Cyril Zhang <cyril@freebsdfoundation.org>
AuthorDate: 2021-05-13 12:55:06 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-05-20 13:15:43 +0000

    sort: Cache value of MB_CUR_MAX

    Every usage of MB_CUR_MAX results in a call to __mb_cur_max.  This is
    inefficient and redundant.  Caching the value of MB_CUR_MAX in a global
    variable removes these calls and speeds up the runtime of sort.  For
    numeric sorting, runtime is almost halved in some tests.

    PR:             255551
    PR:             255840
    Reviewed by:    markj
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D30170

    (cherry picked from commit 71ec05a21257e159f40d54e26ad0011bb19b5134)

 usr.bin/sort/bwstring.c  | 54 ++++++++++++++++++++++++------------------------
 usr.bin/sort/bwstring.h  |  9 ++++----
 usr.bin/sort/radixsort.c |  4 ++--
 usr.bin/sort/sort.c      |  6 +++++-
 usr.bin/sort/sort.h      |  6 ++++++
 5 files changed, 45 insertions(+), 34 deletions(-)
Comment 3 Mark Johnston freebsd_committer 2021-05-21 21:58:39 UTC
Configuring -ftls-model=initial-exec in the libc Makefile seems to fix it.  I'm not sure if there might be some undesirable side-effects of that change though.  OTOH, it is already used in jemalloc.
Comment 4 Mark Johnston freebsd_committer 2021-07-05 20:51:17 UTC
https://reviews.freebsd.org/D31070
Comment 5 commit-hook freebsd_committer 2021-07-16 02:43:16 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=9c97062b620137a1f7cad4c6b3fb030a396b3266

commit 9c97062b620137a1f7cad4c6b3fb030a396b3266
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-07-16 02:35:28 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-07-16 02:41:10 +0000

    libc: Use the initial-exec TLS model

    This permits more efficient accesses of thread-local variables, which
    are heavily used at least by jemalloc and locale-aware code.  Note that
    on amd64 and i386, jemalloc's thread-local variables already have their
    TLS model overridden by defining JEMALLOC_TLS_MODEL.

    For now the change is applied only to tested platforms, but should in
    principle be enabled everywhere.

    PR:             255840
    Suggested by:   jrtc27
    Reviewed by:    kib
    MFC after:      2 months
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D31070

 lib/libc/Makefile | 7 +++++++
 1 file changed, 7 insertions(+)
Comment 6 commit-hook freebsd_committer 2021-09-14 12:52:32 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ad9f3a91d52105564dc32e5c0132377c74c3a204

commit ad9f3a91d52105564dc32e5c0132377c74c3a204
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2021-07-16 02:35:28 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2021-09-14 12:50:53 +0000

    libc: Use the initial-exec TLS model

    This permits more efficient accesses of thread-local variables, which
    are heavily used at least by jemalloc and locale-aware code.  Note that
    on amd64 and i386, jemalloc's thread-local variables already have their
    TLS model overridden by defining JEMALLOC_TLS_MODEL.

    For now the change is applied only to tested platforms, but should in
    principle be enabled everywhere.

    PR:             255840
    Suggested by:   jrtc27
    Reviewed by:    kib
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 9c97062b620137a1f7cad4c6b3fb030a396b3266)

 lib/libc/Makefile | 7 +++++++
 1 file changed, 7 insertions(+)