Bug 265974 - SMR has several missing barriers
Summary: SMR has several missing barriers
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Many People
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-21 14:47 UTC by Pierre Habouzit
Modified: 2022-10-12 15:54 UTC (History)
6 users (show)

See Also:


Attachments
tarball of litmus tests and pdfs (82.00 KB, application/x-tar)
2022-08-21 14:47 UTC, Pierre Habouzit
no flags Details
proper litmus test (77.00 KB, application/x-tar)
2022-08-21 14:55 UTC, Pierre Habouzit
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pierre Habouzit 2022-08-21 14:47:25 UTC
Created attachment 236039 [details]
tarball of litmus tests and pdfs

1. smr_enter() ordering is incorrect on arm64 (and probably other weak memory order systems)

```C
static inline void
smr_enter(smr_t smr)
{
[...]
	/* This is an add because we do not have atomic_store_acq_int */
	atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
}
```

the comment here has a point, there's no atomic_store_acq_int because that would be a StoreLoad barrier which `atomic_add_acq_int` isn't sufficient for. Here is a litmus test that proves why this code above isn't correct. Let's assume 2 threads running something akin to this:

start:
    global = 0
    rd_seq = 123
    wr_seq = 123

thread 0:
    store(&global, 2);
    smr_synchronize();

thread 1:
    smr_enter();
    load(&global);


it should NOT be possible for thread 2 to load `0` and for smr_synchronize() to think thread 1 was not in a critical section pinning `123`

Moreover, smr_poll() absolutely needs a full memory barrier on entry, the `atomic_load_acq_int` performed by `smr_poll` aren't sufficient.


```
AArch64 MP
{
    global = 0;
    wr_seq = 123;
    p1_rd_seq = 0;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
 P0                     | P1                         ;
 MOV      X8, #2        | LDR        X8, [X1]        ;
 STR      X8, [X0]      | STR        X8, [X2]        ;
 LDADDL   X8, X9, [X1]  | DMB        SY              ;
 DMB      SY            | LDR        X10, [X0]       ;
 LDR      X10, [X2]     |                            ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```

This litmus test above shows that the condition is unfortunately reachable, the one below passes (adding a full barrier in smr_scan() and adding a full barrier in smr_enter()):

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;
    p1_rd_seq = 0;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
 P0                     | P1                         ;
 MOV      X8, #2        | LDR        X8, [X1]        ;
 STR      X8, [X0]      | STR        X8, [X2]        ;
 LDADDL   X8, X9, [X1]  | DMB        SY              ;
 DMB      SY            | LDR        X10, [X0]       ;
 LDR      X10, [X2]     |                            ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```


Note that I think the code works on Intel today.

2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it updates the global rd_seq, this is about serializing the fast path of smr_poll with CPUs that weren't in a critical section (while the one on entry of smr_poll() is about synchronizing with the CPUs inside an active SMR critical section and was demonstrated in (1)).

I confess that litmus test is much more painful to write and is left as an exercise to the reader ;)


3. smr_deferred_advance() is similarly incorrect as it doesn't guarantee proper visibility in this case:

start:
    global = 0
    rd_seq = 123
    wr_seq = 123

thread 0:
    store(&global, 2);
    smr_deferred_advance(); // returns 125 but didn't update wr_seq

thread 1:
    smr_synchronize(); // 123 -> 125
    load(&global);

We should never hit a case when thread1 didn't observe the global being 2 but thread 1 returned 125.

Here is a modelization of this sequence with the added `dmb sy` from (1) in smr_poll:

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
 P0                     | P1                         ;
 STR      X2, [X0]      | LDADDL     X2, X9, [X1]    ;
                        | DMB        SY              ;
 LDR      X9, [X1]      | LDR        X10, [X0]       ;
 ADD      X9, X9, X2    |                            ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```

this unfortunately tells us that the condition is reachable, and intuitively it makes sense because there's nothing in "P0" ordering anything. The fix is that when a deferred advance is made, a full barrier must be issued (it has to be a StoreLoad between the store to the global and the load of the write sequence), and indeed this litmus test passes:

```
AArch64 MP
{
    global = 0;
    wr_seq = 123;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = 2;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = 2;
}
 P0                     | P1                         ;
 STR      X2, [X0]      | LDADDL     X2, X9, [X1]    ;
 DMB      SY            | DMB        SY              ;
 LDR      X9, [X1]      | LDR        X10, [X0]       ;
 ADD      X9, X9, X2    |                            ;
exists (0:X9 = 125 /\ 1:X9 = 123 /\ 1:X10 = 0)
```

Attached is a folder with the litmus test and nice PDF renderings of why on arm64 those barriers are needed.

The tentative fix IMO is something like this, though I don't use nor even know how to build FreeBSD so it might have mistakes:

```diff
diff --git a/sys/kern/subr_smr.c b/sys/kern/subr_smr.c
index cbbf185fee79..f7c7ccef8d6e 100644
--- a/sys/kern/subr_smr.c
+++ b/sys/kern/subr_smr.c
@@ -307,8 +307,10 @@ static smr_seq_t
 smr_deferred_advance(smr_t smr, smr_shared_t s, smr_t self)
 {
 
-	if (++self->c_deferred < self->c_limit)
+	if (++self->c_deferred < self->c_limit) {
+		atomic_thread_fence_seq_cst();
 		return (smr_shared_current(s) + SMR_SEQ_INCR);
+	}
 	self->c_deferred = 0;
 	return (smr_default_advance(smr, s));
 }
@@ -427,6 +429,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t s_rd_seq,
 	CRITICAL_ASSERT(curthread);
 	counter_u64_add_protected(poll_scan, 1);
 
+	atomic_thread_fence_seq_cst();
+
 	/*
 	 * The read sequence can be no larger than the write sequence at
 	 * the start of the poll.
@@ -450,6 +454,8 @@ smr_poll_scan(smr_t smr, smr_shared_t s, smr_seq_t s_rd_seq,
 			rd_seq = SMR_SEQ_MIN(rd_seq, c_seq);
 	}
 
+	atomic_thread_fence_seq_cst();
+
 	/*
 	 * Advance the rd_seq as long as we observed a more recent value.
 	 */
diff --git a/sys/sys/smr.h b/sys/sys/smr.h
index c110be9a66c2..3d543d1979ae 100644
--- a/sys/sys/smr.h
+++ b/sys/sys/smr.h
@@ -133,7 +133,12 @@ smr_enter(smr_t smr)
 	 * is detected and handled there.
 	 */
 	/* This is an add because we do not have atomic_store_acq_int */
+#if __x86_64__ || __i386__
 	atomic_add_acq_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+#else
+	atomic_store_int(&smr->c_seq, smr_shared_current(smr->c_shared));
+	atomic_thread_fence_seq_cst();
+#endif
 }
 
 /*
```
Comment 1 Pierre Habouzit 2022-08-21 14:55:29 UTC
whoops the `smr_enter-bad.litmus` has a dmb sy that isn't there today, it should be:


```
AArch64 MP
{
    global = 0;
    wr_seq = 123;
    p1_rd_seq = 0;

    0:x0 = global; 0:x1 = wr_seq; 0:x2 = p1_rd_seq;
    1:x0 = global; 1:x1 = wr_seq; 1:x2 = p1_rd_seq;
}
 P0                     | P1                         ;
 MOV      X8, #2        | LDR        X8, [X1]        ;
 STR      X8, [X0]      |                            ;
 LDADDL   X8, X9, [X1]  | LDADDA     X8, X9, [X2]    ;
                        | LDR        X10, [X0]       ;
 LDR      X10, [X2]     |                            ;
exists (0:X10 = 0 /\ 1:X8 = 123 /\ 1:X10 = 0)
```

I'll update the tarball
Comment 2 Pierre Habouzit 2022-08-21 14:55:56 UTC
Created attachment 236040 [details]
proper litmus test
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2022-08-26 05:57:37 UTC
I believe you are right, but I would wait for an opinion of people who
spent a time with smr code.

That said, atomic_add_acq on x86 has the sequentially consistent semantic
already (which is why you said that it works on Intel, right?).  So the
#ifdef from the patch in smr_enter() is not needed, use seq_cst fence
for all arches.
Comment 4 Mark Johnston freebsd_committer freebsd_triage 2022-08-26 16:35:09 UTC
I certainly agree with the comment about smr_enter().  On at least arm64, atomic_add_acq_int()'s acquire semantics apply only to the load; subsequent loads can be reordered with the store.  I believe this is true with LSE atomics as well as the original LL/SC atomics.  I also agree that the code is fine on Intel since atomic instructions prevent store-load reordering.

Our atomic(9) documentation doesn't really call out this subtlety.  We might have similar bugs elsewhere.

I need to think about smr_poll() a bit more.

> Moreover, smr_poll() absolutely needs a full memory barrier on entry, the `atomic_load_acq_int` performed by `smr_poll` aren't sufficient.

The suggested patch adds two barriers to smr_poll_scan() and none to smr_poll(), is that intentional?
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2022-08-26 16:38:47 UTC
(In reply to Konstantin Belousov from comment #3)
> That said, atomic_add_acq on x86 has the sequentially consistent semantic
already (which is why you said that it works on Intel, right?).  So the
#ifdef from the patch in smr_enter() is not needed, use seq_cst fence
for all arches.

The ifdef makes some sense as an optimization.  On x86 we can combine the store and barrier into one instruction, so why not do that?
Comment 6 Pierre Habouzit 2022-08-26 21:40:19 UTC
(In reply to Mark Johnston from comment #4)

> The suggested patch adds two barriers to smr_poll_scan() and none to smr_poll(), is that intentional?

Yes, I cover this in my (2) paragraph:

> 2. similarly smr_poll_scan() needs a full barrier after the scan _before_ it updates the global rd_seq, this is about serializing the fast path of smr_poll with CPUs that weren't in a critical section (while the one on entry of smr_poll() is about synchronizing with the CPUs inside an active SMR critical section and was demonstrated in (1)).

fundamentally the first barrier is about sequencing the "scan" operation with CPUs actively inside an active SMR section (and basically pairs with their smr_enter()), but the second barrier is about all the sections we ignored because they weren't active and pairs with the last `smr_leave()` that that core executed.

Writing the litmus test to exhibit this problem was more work than I was willing to put in. For full disclosure I am the author of XNU's smr.[hc] and am reporting this bug as a courtesy given that FreeBSD was the inspiration. In XNU I'm definitely going to put both barriers (the current open source drop doesn't have it but as I was reasoning through this code recently I convinced myself they were needed).
Comment 7 Pierre Habouzit 2022-08-26 21:41:49 UTC
(In reply to Pierre Habouzit from comment #6)

oh and sorry maybe the barriers needs to go to smr_poll() instead of poll_scan(), the patch I pasted is not "serious" just a general outline of how it should look like I think.
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2022-08-27 00:13:30 UTC
(In reply to Pierre Habouzit from comment #6)
I'm sorry, but I still don't quite see the problem.  I spent some time trying to write litmus tests to explore interactions between readers, writers, and smr_poll(), but didn't find anything surprising (yet).  For now I'm ignoring the deferred and lazy modes of SMR since they're not used in FreeBSD currently.

I note that your litmus tests don't seem to take into account the fact that smr_advance() issues a atomic_thread_fence_rel() before updating the write sequence.  (On arm64 this expands to DMB SY, the same as atomic_thread_fence_seq_cst().)  With smr_enter() patched, both readers and writers issue a full memory barrier on arm64.

What I don't really understand is why readers need to synchronize with smr_poll() at all.  Even with the barriers you proposed, it's possible for a reader to store a snapshot of wr_seq that is older than the global rd_seq.  This could happen if an interrupt or vmexit occurs between the load of wr_seq and the store to per-CPU memory.  So long as smr_poll() does not advance rd_seq beyond that of any CPU actively executing a read section, it's ok, and I don't see how that can happen.  What am I missing?

Thanks for your patience and for taking the time to report these problems.  I posted a first patch for review here: https://reviews.freebsd.org/D36370
Comment 9 Pierre Habouzit 2022-08-28 21:30:39 UTC
`atomic_thread_fence_rel()` is a `dmb sy` ? I would have expected that to be a `dmb st` ..

but indeed it shows I'm not familiar with with FreeBSD:

```C
static __inline void
atomic_thread_fence_rel(void)
{

	dmb(sy);
}
```

and indeed it makes the deferred advance correct then.

It also means smr_synchronize() has a stronger semantics than I expected already and makes the smr_poll barrier on entry useless because the `atomic_load_acq_int()` in smr_poll() already give the semantics I was after on entry.

I think you might be right that the one on exit of smr_poll() isn't quite needed. I think I just didn't think it through and just took `ck_epoch_synchronize_wait` as an example and just was like "it needs one so smr_scan does too" but trying to come up with a reason why it breaks yields nothing..
Comment 10 Mark Johnston freebsd_committer freebsd_triage 2022-08-28 23:01:33 UTC
(In reply to Pierre Habouzit from comment #9)
dmb st only orders stores, but that's not sufficient to implement a release fence.  For what it's worth, LLVM on FreeBSD 14/arm64 compiles atomic_thread_fence(memory_order_release) to a "dmb ish", same as atomic_thread_fence(memory_order_seq_cst).
Comment 11 Mark Johnston freebsd_committer freebsd_triage 2022-08-28 23:14:43 UTC
I do think the deferred advance might still be incorrect on platforms where FreeBSD's atomic_thread_fence_rel() does not provide a store/load barrier, such as POWER.
Comment 12 Pierre Habouzit 2022-08-28 23:30:02 UTC
wow TIL, that's the same on Darwin huh
Comment 13 commit-hook freebsd_committer freebsd_triage 2022-09-24 13:38:40 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=8694fd333556addb97acfff1feca6a1e389201ce

commit 8694fd333556addb97acfff1feca6a1e389201ce
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-09-24 13:18:04 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-09-24 13:18:04 +0000

    smr: Fix synchronization in smr_enter()

    smr_enter() must publish its observed read sequence number before
    issuing any subsequent memory operations.  The ordering provided by
    atomic_add_acq_int() is insufficient on some platforms, at least on
    arm64, because it permits reordering of subsequent loads with the store
    to c_seq.

    Thus, use atomic_thread_fence_seq_cst() to issue a store-load barrier
    after publishing the read sequence number.

    On x86, take advantage of the fact that memory operations are not
    reordered with locked instructions to improve code density: we can store
    the observed read sequence and provide a store-load barrier with a
    single operation.

    Based on a patch from Pierre Habouzit <pierre@habouzit.net>.

    PR:             265974
    Reviewed by:    alc
    MFC after:      2 weeks
    Differential Revision:  https://reviews.freebsd.org/D36370

 sys/sys/smr.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2022-10-11 14:11:19 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=65641500dbb7a93de4a6195de90c507f9d629b84

commit 65641500dbb7a93de4a6195de90c507f9d629b84
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-09-24 13:18:04 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-10-11 13:37:15 +0000

    smr: Fix synchronization in smr_enter()

    smr_enter() must publish its observed read sequence number before
    issuing any subsequent memory operations.  The ordering provided by
    atomic_add_acq_int() is insufficient on some platforms, at least on
    arm64, because it permits reordering of subsequent loads with the store
    to c_seq.

    Thus, use atomic_thread_fence_seq_cst() to issue a store-load barrier
    after publishing the read sequence number.

    On x86, take advantage of the fact that memory operations are not
    reordered with locked instructions to improve code density: we can store
    the observed read sequence and provide a store-load barrier with a
    single operation.

    Based on a patch from Pierre Habouzit <pierre@habouzit.net>.

    PR:             265974
    Reviewed by:    alc

    (cherry picked from commit 8694fd333556addb97acfff1feca6a1e389201ce)

 sys/sys/smr.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)
Comment 15 Mark Johnston freebsd_committer freebsd_triage 2022-10-12 15:54:44 UTC
Thank you for the report and litmus tests.