272238 – False sharing with pthread rw and spin locks leads to severe perf degradation

Bug 272238 - False sharing with pthread rw and spin locks leads to severe perf degradation

Summary: False sharing with pthread rw and spin locks leads to severe perf degradation

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	threads (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Some People
Assignee:	Mateusz Guzik

URL:
Keywords:	performance

Depends on:
Blocks:

Reported:	2023-06-26 23:41 UTC by Greg Becker
Modified:	2023-07-05 12:36 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
Patch to ameliorate false sharing with pthread rw and spin locks. (1.10 KB, patch) 2023-06-26 23:41 UTC, Greg Becker	no flags	Details \| Diff
Patch to reduce false sharing with pthreads rw and spin locks (1.77 KB, patch) 2023-06-27 01:26 UTC, Greg Becker	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Greg Becker 2023-06-26 23:41:16 UTC

Created attachment 243025 [details]
Patch to ameliorate false sharing with pthread rw and spin locks.

If an application allocates two or more pthread rwlocks (or spin locks) and then heavily accesses them from different CPUs it's quite likely that the application will experience severe performance degradation due to false sharing.
The problem is that these locks are small (36 and 32 bytes, respectively) and allocated on the heap via jemalloc(3).  Depending upon the state of the allocator they may wind up within the same or straddling adjacent cache lines.

For example, if I initialize four rwlocks and hammer on them from four different CPUs (one lock per CPU, such each lock is always uncontended) then on my dual E2697a 2.6GHz server I get about 10.52 million lock+inc+unlock calls per second.

With the attached patch, which rounds up the allocations to CACHE_LINE_SIZE, I get 47.68 million calls per second.  Similarly, for pthread spin locks I get about 4.53 and 50.94 million calls per second, respectively.

Overall, I am seeing roughly a 4.5x improvement with pthread rwlocks, and an 11.2x improvement with pthread spin locks.

The patch is very simple and ignores adajacent cacheline prefetch as seen on amd64 hardware.

Developed and test on:

FreeBSD sm1.cc.codeconcepts.com 14.0-CURRENT FreeBSD 14.0-CURRENT #4 n263748-b95d2237af40: Mon Jun 26 17:08:50 CDT 2023     greg@sm1.cc.codeconcepts.com:/usr/obj/usr/src/amd64.amd64/sys/SM1 amd64

Passes the kyua test:

kyua test -k /usr/tests/lib/libthr/Kyuafile

Comment 1 Mina Galić freebsd_triage

2023-06-27 00:02:59 UTC

Hi Greg,

thanks for opening this issue.
we've been trying to discourage submitting patches thru bugzilla because they tend to get lost and forgotten.
this is small and self-contained hi that by our new contribution guidelines it can be submitted via GitHub: https://docs.freebsd.org/en/articles/contributing/

Comment 2 Mateusz Guzik freebsd_committer

2023-06-27 00:08:05 UTC

Huh, that code is pretty nasty all around -- even with just one lock use performance under contention is lower than it should be.

That said, your patch pushes it forward even if one could raise some cosmetic conerners.

Preferably you would mail git format-patch output to me so that i can git am, otherwise I'm going to write my own commit message and set you as the author.

Comment 3 Greg Becker 2023-06-27 01:26:59 UTC

Created attachment 243029 [details]
Patch to reduce false sharing with pthreads rw and spin locks

Comment 4 Greg Becker 2023-06-27 01:32:49 UTC

Mina, I can resubmit via github, thanks for the pointer!

Mateusz, I uploaded a git-format-patch and will email it to you as well.
Should I revert and go the github route, or just finish up with this one here on bugzilla?

Agreed, nasty.  I have a few more patches to improve mutex and overall performance of pthread locking.  Thought I would start with something simple first and get a feel for the process.

Comment 5 commit-hook freebsd_committer

2023-06-27 11:57:07 UTC

A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a6c0d801ca5934bb9b9cca6870ea7406d5db0641

commit a6c0d801ca5934bb9b9cca6870ea7406d5db0641
Author:     Greg Becker <becker.greg@att.net>
AuthorDate: 2023-06-27 01:08:29 +0000
Commit:     Mateusz Guzik <mjg@FreeBSD.org>
CommitDate: 2023-06-27 11:56:44 +0000

    libpthread: allocate rwlocks and spinlocks in dedicated cachelines

    Reduces severe performance degradation due to false-sharing. Note that this
    does not account for hardware which can perform adjacent cacheline prefetch.

    [mjg: massaged the commit message and the patch to use aligned_alloc
    instead of malloc]

    PR:     272238
    MFC after:      1 week

 lib/libthr/thread/thr_pspinlock.c | 3 ++-
 lib/libthr/thread/thr_rwlock.c    | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

Comment 6 commit-hook freebsd_committer

2023-07-05 12:35:17 UTC

A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=cbca929869a818c9b3353753df71210816a126d6

commit cbca929869a818c9b3353753df71210816a126d6
Author:     Greg Becker <becker.greg@att.net>
AuthorDate: 2023-06-27 01:08:29 +0000
Commit:     Mateusz Guzik <mjg@FreeBSD.org>
CommitDate: 2023-07-05 12:34:09 +0000

    libpthread: allocate rwlocks and spinlocks in dedicated cachelines

    Reduces severe performance degradation due to false-sharing. Note that this
    does not account for hardware which can perform adjacent cacheline prefetch.

    [mjg: massaged the commit message and the patch to use aligned_alloc
    instead of malloc]

    PR:     272238
    MFC after:      1 week

    (cherry picked from commit a6c0d801ca5934bb9b9cca6870ea7406d5db0641)

 lib/libthr/thread/thr_pspinlock.c | 3 ++-
 lib/libthr/thread/thr_rwlock.c    | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)