Created attachment 236039 [details] tarball of litmus tests and pdfs 1. smr_enter() ordering is incorrect on arm64 (and probably other weak memory order systems) ```C static inline void smr_enter(smr_t smr) { [...] /* This is an add because we do not have atomic_store_acq_int */ atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared)); } ``` the comment here has a point, there's no atomic_store_acq_int because that would be a StoreLoad barrier which `atomic_add_acq_int` isn't sufficient for. Here is a litmus test that proves why this code above isn't correct. Let's assume 2 threads running something akin to this: start: global = 0 rd_seq = 123 wr_seq = 123 thread 0: store(&global, 2); smr_synchronize(); thread 1: smr_enter(); load(&global); it should NOT be possible for thread 2 to load `0` and for smr_synchronize() to think thread 1 was not in a critical section pinning `123` Moreover, smr_poll() absolutely needs a full memory barrier on entry, the `atomic_load_acq_int` performed by `smr_poll` aren't sufficient. ``` AArch64 MP { global = 0; wr_seq = 123; p1_rd_seq = 0; 0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq; 1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq; } P0 | P1 ; MOV X8, #2 | LDR X8, [X1] ; STR X8, [X0] | STR X8, [X2] ; LDADDL X8, X9, [X1] | DMB SY ; DMB SY | LDR X10, [X0] ; LDR X10, [X2] | ; exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0) ``` This litmus test above shows that the condition is unfortunately reachable, the one below passes (adding a full barrier in smr_scan() and adding a full barrier in smr_enter()): ``` AArch64 MP { global = 0; wr_seq = 123; p1_rd_seq = 0; 0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq; 1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq; } P0 | P1 ; MOV X8, #2 | LDR X8, [X1] ; STR X8, [X0] | STR X8, [X2] ; LDADDL X8, X9, [X1] | DMB SY ; DMB SY | LDR X10, [X0] ; LDR X10, [X2] | ; exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0) ``` Note that I think the code works on Intel today. 2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it updates the global rd_seq, this is about serializing the fast path of smr_poll with CPUs that weren't in a critical section (while the one on entry of smr_poll() is about synchronizing with the CPUs inside an active SMR critical section and was demonstrated in (1)). I confess that litmus test is much more painful to write and is left as an exercise to the reader ;) 3. smr_deferred_advance() is similarly incorrect as it doesn't guarantee proper visibility in this case: start: global = 0 rd_seq = 123 wr_seq = 123 thread 0: store(&global, 2); smr_deferred_advance(); // returns 125 but didn't update wr_seq thread 1: smr_synchronize(); // 123 -> 125 load(&global); We should never hit a case when thread1 didn't observe the global being 2 but thread 1 returned 125. Here is a modelization of this sequence with the added `dmb sy` from (1) in smr_poll: ``` AArch64 MP { global = 0; wr_seq = 123; 0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2; 1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2; } P0 | P1 ; STR X2, [X0] | LDADDL X2, X9, [X1] ; | DMB SY ; LDR X9, [X1] | LDR X10, [X0] ; ADD X9, X9, X2 | ; exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0) ``` this unfortunately tells us that the condition is reachable, and intuitively it makes sense because there's nothing in "P0" ordering anything. The fix is that when a deferred advance is made, a full barrier must be issued (it has to be a StoreLoad between the store to the global and the load of the write sequence), and indeed this litmus test passes: ``` AArch64 MP { global = 0; wr_seq = 123; 0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2; 1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2; } P0 | P1 ; STR X2, [X0] | LDADDL X2, X9, [X1] ; DMB SY | DMB SY ; LDR X9, [X1] | LDR X10, [X0] ; ADD X9, X9, X2 | ; exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0) ``` Attached is a folder with the litmus test and nice PDF renderings of why on arm64 those barriers are needed. The tentative fix IMO is something like this, though I don't use nor even know how to build FreeBSD so it might have mistakes: ```diff diff --git a/sys/kern/subr_smr.c b/sys/kern/subr_smr.c index cbbf185fee79..f7c7ccef8d6e 100644 --- a/sys/kern/subr_smr.c +++ b/sys/kern/subr_smr.c @@ -307,8 +307,10 @@ static smr_seq_t smr_deferred_advance(smr_t smr, smr_shared_t s, smr_t self) { - if (++self->c_deferred < self->c_limit) + if (++self->c_deferred < self->c_limit) { + atomic_thread_fence_seq_cst(); return (smr_shared_current(s) + SMR_SEQ_INCR); + } self->c_deferred = 0; return (smr_default_advance(smr, s)); } @@ -427,6 +429,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t s_rd_seq, CRITICAL_ASSERT(curthread); counter_u64_add_protected(poll_scan, 1); + atomic_thread_fence_seq_cst(); + /* * The read sequence can be no larger than the write sequence at * the start of the poll. @@ -450,6 +454,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t s_rd_seq, rd_seq = SMR_SEQ_MIN(rd_seq, c_seq); } + atomic_thread_fence_seq_cst(); + /* * Advance the rd_seq as long as we observed a more recent value. */ diff --git a/sys/sys/smr.h b/sys/sys/smr.h index c110be9a66c2..3d543d1979ae 100644 --- a/sys/sys/smr.h +++ b/sys/sys/smr.h @@ -133,7 +133,12 @@ smr_enter(smr_t smr) * is detected and handled there. */ /* This is an add because we do not have atomic_store_acq_int */ +#if __x86_64__ || __i386__ atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared)); +#else + atomic_store_int(&smr->c_seq, smr_shared_current(smr->c_shared)); + atomic_thread_fence_seq_cst(); +#endif } /* ```
whoops the `smr_enter-bad.litmus` has a dmb sy that isn't there today, it should be: ``` AArch64 MP { global = 0; wr_seq = 123; p1_rd_seq = 0; 0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq; 1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq; } P0 | P1 ; MOV X8, #2 | LDR X8, [X1] ; STR X8, [X0] | ; LDADDL X8, X9, [X1] | LDADDA X8, X9, [X2] ; | LDR X10, [X0] ; LDR X10, [X2] | ; exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0) ``` I'll update the tarball
Created attachment 236040 [details] proper litmus test
I believe you are right, but I would wait for an opinion of people who spent a time with smr code. That said, atomic_add_acq on x86 has the sequentially consistent semantic already (which is why you said that it works on Intel, right?). So the #ifdef from the patch in smr_enter() is not needed, use seq_cst fence for all arches.
I certainly agree with the comment about smr_enter(). On at least arm64, atomic_add_acq_int()'s acquire semantics apply only to the load; subsequent loads can be reordered with the store. I believe this is true with LSE atomics as well as the original LL/SC atomics. I also agree that the code is fine on Intel since atomic instructions prevent store-load reordering. Our atomic(9) documentation doesn't really call out this subtlety. We might have similar bugs elsewhere. I need to think about smr_poll() a bit more. > Moreover, smr_poll() absolutely needs a full memory barrier on entry, the `atomic_load_acq_int` performed by `smr_poll` aren't sufficient. The suggested patch adds two barriers to smr_poll_scan() and none to smr_poll(), is that intentional?
(In reply to Konstantin Belousov from comment #3) > That said, atomic_add_acq on x86 has the sequentially consistent semantic already (which is why you said that it works on Intel, right?). So the #ifdef from the patch in smr_enter() is not needed, use seq_cst fence for all arches. The ifdef makes some sense as an optimization. On x86 we can combine the store and barrier into one instruction, so why not do that?
(In reply to Mark Johnston from comment #4) > The suggested patch adds two barriers to smr_poll_scan() and none to smr_poll(), is that intentional? Yes, I cover this in my (2) paragraph: > 2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it updates the global rd_seq, this is about serializing the fast path of smr_poll with CPUs that weren't in a critical section (while the one on entry of smr_poll() is about synchronizing with the CPUs inside an active SMR critical section and was demonstrated in (1)). fundamentally the first barrier is about sequencing the "scan" operation with CPUs actively inside an active SMR section (and basically pairs with their smr_enter()), but the second barrier is about all the sections we ignored because they weren't active and pairs with the last `smr_leave()` that that core executed. Writing the litmus test to exhibit this problem was more work than I was willing to put in. For full disclosure I am the author of XNU's smr.[hc] and am reporting this bug as a courtesy given that FreeBSD was the inspiration. In XNU I'm definitely going to put both barriers (the current open source drop doesn't have it but as I was reasoning through this code recently I convinced myself they were needed).
(In reply to Pierre Habouzit from comment #6) oh and sorry maybe the barriers needs to go to smr_poll() instead of poll_scan(), the patch I pasted is not "serious" just a general outline of how it should look like I think.
(In reply to Pierre Habouzit from comment #6) I'm sorry, but I still don't quite see the problem. I spent some time trying to write litmus tests to explore interactions between readers, writers, and smr_poll(), but didn't find anything surprising (yet). For now I'm ignoring the deferred and lazy modes of SMR since they're not used in FreeBSD currently. I note that your litmus tests don't seem to take into account the fact that smr_advance() issues a atomic_thread_fence_rel() before updating the write sequence. (On arm64 this expands to DMB SY, the same as atomic_thread_fence_seq_cst().) With smr_enter() patched, both readers and writers issue a full memory barrier on arm64. What I don't really understand is why readers need to synchronize with smr_poll() at all. Even with the barriers you proposed, it's possible for a reader to store a snapshot of wr_seq that is older than the global rd_seq. This could happen if an interrupt or vmexit occurs between the load of wr_seq and the store to per-CPU memory. So long as smr_poll() does not advance rd_seq beyond that of any CPU actively executing a read section, it's ok, and I don't see how that can happen. What am I missing? Thanks for your patience and for taking the time to report these problems. I posted a first patch for review here: https://reviews.freebsd.org/D36370
`atomic_thread_fence_rel()` is a `dmb sy` ? I would have expected that to be a `dmb st` .. but indeed it shows I'm not familiar with with FreeBSD: ```C static __inline void atomic_thread_fence_rel(void) { dmb(sy); } ``` and indeed it makes the deferred advance correct then. It also means smr_synchronize() has a stronger semantics than I expected already and makes the smr_poll barrier on entry useless because the `atomic_load_acq_int()` in smr_poll() already give the semantics I was after on entry. I think you might be right that the one on exit of smr_poll() isn't quite needed. I think I just didn't think it through and just took `ck_epoch_synchronize_wait` as an example and just was like "it needs one so smr_scan does too" but trying to come up with a reason why it breaks yields nothing..
(In reply to Pierre Habouzit from comment #9) dmb st only orders stores, but that's not sufficient to implement a release fence. For what it's worth, LLVM on FreeBSD 14/arm64 compiles atomic_thread_fence(memory_order_release) to a "dmb ish", same as atomic_thread_fence(memory_order_seq_cst).
I do think the deferred advance might still be incorrect on platforms where FreeBSD's atomic_thread_fence_rel() does not provide a store/load barrier, such as POWER.
wow TIL, that's the same on Darwin huh
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=8694fd333556addb97acfff1feca6a1e389201ce commit 8694fd333556addb97acfff1feca6a1e389201ce Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-09-24 13:18:04 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-09-24 13:18:04 +0000 smr: Fix synchronization in smr_enter() smr_enter() must publish its observed read sequence number before issuing any subsequent memory operations. The ordering provided by atomic_add_acq_int() is insufficient on some platforms, at least on arm64, because it permits reordering of subsequent loads with the store to c_seq. Thus, use atomic_thread_fence_seq_cst() to issue a store-load barrier after publishing the read sequence number. On x86, take advantage of the fact that memory operations are not reordered with locked instructions to improve code density: we can store the observed read sequence and provide a store-load barrier with a single operation. Based on a patch from Pierre Habouzit <pierre@habouzit.net>. PR: 265974 Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D36370 sys/sys/smr.h | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=65641500dbb7a93de4a6195de90c507f9d629b84 commit 65641500dbb7a93de4a6195de90c507f9d629b84 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-09-24 13:18:04 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-10-11 13:37:15 +0000 smr: Fix synchronization in smr_enter() smr_enter() must publish its observed read sequence number before issuing any subsequent memory operations. The ordering provided by atomic_add_acq_int() is insufficient on some platforms, at least on arm64, because it permits reordering of subsequent loads with the store to c_seq. Thus, use atomic_thread_fence_seq_cst() to issue a store-load barrier after publishing the read sequence number. On x86, take advantage of the fact that memory operations are not reordered with locked instructions to improve code density: we can store the observed read sequence and provide a store-load barrier with a single operation. Based on a patch from Pierre Habouzit <pierre@habouzit.net>. PR: 265974 Reviewed by: alc (cherry picked from commit 8694fd333556addb97acfff1feca6a1e389201ce) sys/sys/smr.h | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
Thank you for the report and litmus tests.