Bug 264867 - kevent EVFILT_TIMER, timer expiration with twice the period
Summary: kevent EVFILT_TIMER, timer expiration with twice the period
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.1-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-24 09:38 UTC by felice.tufo
Modified: 2022-08-01 14:21 UTC (History)
1 user (show)

See Also:


Attachments
example reproducing the problem (4.04 KB, text/plain)
2022-06-24 09:38 UTC, felice.tufo
no flags Details
possible bug fix (708 bytes, patch)
2022-06-30 18:26 UTC, Mark Johnston
no flags Details | Diff
patched file (23.35 KB, text/plain)
2022-07-01 15:30 UTC, felice.tufo
no flags Details
13.1 patch (29.92 KB, patch)
2022-07-11 20:15 UTC, Mark Johnston
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description felice.tufo 2022-06-24 09:38:28 UTC
Created attachment 234909 [details]
example reproducing the problem

I'm trying to use kevent to implement a periodic (50ms) event generator, the application can tolerate a 10% error on the programmed period. I know that FreeBSD is not a hard realtime OS, so just trying to evaluate how good it is for my specific needs.

The problem is that every now and then (after some hours running) the timer expires well after the programmed period and what makes me think I'm running into an issue is that it happens *very* often with twice the programmed period (100ms for a 50 ms programmed period... 40ms for a 20ms programmed period... etc.).

Bug 264131 seems to be something similar, but that was deterministic behaviour. Moreover I see the same problem on FreeBSD 12.2.

I attach the example I use to reproduce the problem (using multithreading to make the issue happen faster).
The example is compiled with clang on an updated FreeBSD 13.1 machine, on a skylake platform with 8-cores and 32Gb RAM, no other applications running (not even sendmail).

Thanks
Comment 1 Mark Johnston freebsd_committer freebsd_triage 2022-06-24 13:51:58 UTC
Have you been able to reproduce the problem on multiple different hardware systems?  What does "sysctl kern.eventtimer" show on an affected system?

I tried running it locally on a desktop running 14.0-CURRENT but did not see any problems after a few minutes.
Comment 2 felice.tufo 2022-06-24 14:02:57 UTC
Hi Mark

thanks for the answer.

The output of the sysctl command is as follows:

kern.eventtimer.choice: LAPIC(600) HPET(550) i8254(100) RTC(0)
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.HPET.quality: 550
kern.eventtimer.et.HPET.frequency: 24000000
kern.eventtimer.et.HPET.flags: 7
kern.eventtimer.et.LAPIC.quality: 600
kern.eventtimer.et.LAPIC.frequency: 2903994335
kern.eventtimer.et.LAPIC.flags: 7
kern.eventtimer.periodic: 0
kern.eventtimer.timer: LAPIC
kern.eventtimer.idletick: 0
kern.eventtimer.singlemul: 2

I did not try other systems because the HW cannot be changed (the application has to work on that industrial pc), but I could run some tests on other systems just to see if it is reproducible also on other platforms.

I'm sorry but a test of just few minutes is not enough, I usually let it run for hours; in one run the issue happened after nearly 1 day and half from the start.

Also, I never tried 14.0-CURRENT, only 12.2 and 13.1.

Thanks,
Felice
Comment 3 Mark Johnston freebsd_committer freebsd_triage 2022-06-28 00:56:46 UTC
I'm able to reproduce what looks like a similar problem on a different test system.  In my case, the extra delay is always close to a multiple of 20ms (the kqueue timer period).  Usually 40ms or 60ms.  It often occurs soon after the program starts up.  It happens with or without setting the scheduling policy.

I suspect the problem is related to some kind of race in the EVFILT_TIMER code.  If I add EV_ONESHOT to the event, the problem is not reproducible anymore.  In the test case, when kevent() returns the timer is still armed, and it's supposed to be reconfigured by the next kevent() call.
Comment 4 felice.tufo 2022-06-28 09:11:21 UTC
Hi Mark,
I tried running my example setting EV_ONESHOT too; in my case the problem was still showing but it could be only a matter of test duration.

I also tried to write a kernel module using the callout framework directly: I noticed small extra delays (1-2 ms) every now and then, but never big delays (twice or more of the programmed period) as in userspace, so I agree that there could be something "strange" going on in the kevent layer.

I'm waiting to have some hw available to launch the test on different systems.
Comment 5 Mark Johnston freebsd_committer freebsd_triage 2022-06-29 17:25:24 UTC
Are you able to reproduce the problem if all threads are pinned to a single CPU?  For instance,

# cpuset -l 10 ./a.out
Comment 6 Mark Johnston freebsd_committer freebsd_triage 2022-06-29 21:14:03 UTC
To expand on my question a bit, I think there is a race somewhere in the eventtimer code which occurs when a callout is scheduled on a different CPU.  In particular, when an EVFILT_TIMER event is created, it is "pinned" to the current CPU, and all future callouts will fire on the CPU.  I am not sure why kevent does this, it doesn't make much sense to me.

If I change filt_timertouch() to set kc->cpuid = PCPU_GET(cpuid) each time the event is re-added, then the bug is much harder to trigger.  Alternately, if the test is changed so that each thread pins itself to a different CPU, then the problem is not reproducible for me.

(In reply to felice.tufo from comment #4)
Does your test module specify the target CPU when arming a callout?
Comment 7 felice.tufo 2022-06-30 09:10:34 UTC
Hi Mark, unfortunately I have to share the system with other engineers, so I'll try to run the test with the cpuset command as soon as possibile (hopefully during the weekend).
In the meantime I can add some info:

- I didn't specify the target cpu when arming a callout, nor when using kevent

- the original problem fired on a multithreaded program, but only one of the threads was calling kevent(), the others were receiving data on sockets. The program run at priority 0, but not pinned to any cpu. I somehow suspected that multithreading could fire the problem more frequently, so I wrote the multithreaded test program I attached to this bug report, but the original problem fired while a single thread was calling kevent with a 50ms period.
Comment 8 Mark Johnston freebsd_committer freebsd_triage 2022-06-30 18:26:47 UTC
Created attachment 235005 [details]
possible bug fix

Could you please give this patch a try?  It solves the problem for me.  I think testing this would be more useful than the other things I suggested so far (pinning threads, etc.).

I think it'll apply to 13.1 and 12.3 without any adjustment.

Basically, when scheduling a callout on a different CPU, there is a small race which can cause the event timer to be programmed incorrectly, resulting in delayed processing of events.  Your excellent test program triggers this readily on one of my systems with many CPUs; over time the scheduler migrates threads away from the "home" CPU selected during knote initialization, so the window for this race is opened.
Comment 9 felice.tufo 2022-07-01 09:03:30 UTC
Thank you for the patch Mark, I'm going to apply it to the 13.1 kernel, launch the test and keep it running for the whole weekend.

Fingers crossed... :-)
Comment 10 felice.tufo 2022-07-01 15:30:53 UTC
Created attachment 235023 [details]
patched file
Comment 11 felice.tufo 2022-07-01 15:36:01 UTC
Comment on attachment 235023 [details]
patched file

After some hours running, unfortunately the problem fired again (programmed 50ms period, slept 100ms)

Cicle 255600 - Thread 1240 OK
Cicle 255600 - Thread 1241 OK
Cicle 255600 - Thread 1236 OK
Cicle 255600 - Thread 1242 OK
Thread 1239
start sec 13352 nsec 453408127
stop sec 13352 nsec 553448989
wait sec 0 nsec 100040862

Could you please double check that the file I've just attached was patched correctly?
After patching, I just copied (renaming) the GENERIC amd64 configuration, did a 'make buildkernel KERNELCONF=xxx', 'make installkernel KERNELCONF=xxx' and rebooted. 

I'm relaunching the test pinning the all the threads to a cpu... let's see if this helps.
Comment 12 Mark Johnston freebsd_committer freebsd_triage 2022-07-01 15:43:07 UTC
(In reply to felice.tufo from comment #11)
The variable is KERNCONF, not KERNELCONFIG, but I don't see why you would need to create a new kernel configuration for a patch.

The patched kern_clocksource.c looks right to me.
Comment 13 felice.tufo 2022-07-01 15:53:46 UTC
(In reply to Mark Johnston from comment #12)
Yes, you're right, I just followed this guide to be sure to do everything ok (patching and building a freebsd kernel is not my daily job, sorry :-)): 
https://docs.freebsd.org/en/books/handbook/kernelconfig/#kernelconfig-building

So now the test is running again on the patched system, but pinned to a cpu...
Comment 14 Mark Johnston freebsd_committer freebsd_triage 2022-07-01 19:06:13 UTC
(In reply to felice.tufo from comment #13)
Just to double check, what does "uname -a" report when running the patched kernel?
Comment 15 felice.tufo 2022-07-01 20:21:59 UTC
(In reply to Mark Johnston from comment #14)

$ uname -a
FreeBSD schrodinger 13.1-RELEASE FreeBSD 13.1-RELEASE FELIX amd64

FELIX is the name of the configuration used (a copy of the GENERIC configuration).

Also the /boot dir seems to be ok, with the kernel subdir created this morning after compiling and installing the patched kernel
 
$ ls -l /boot/
total 10045
[CUT]
drwxr-xr-x  2 root  wheel   18432 Jul  1 11:47 kernel
drwxr-xr-x  2 root  wheel   18432 Jun 15 12:03 kernel.old

After 5 hours, the test is still running fine (patched kernel and process pinned to cpu 1). I'll check it again tomorrow morning.
Comment 16 Mark Johnston freebsd_committer freebsd_triage 2022-07-02 17:21:07 UTC
Ok, I'm still able to reproduce the problem with my patch.  But, the frequency of the late wakeups is much much lower, and I'm fairly sure that the patch solves a legitimate problem on its own.

Specifically, getnextcpuevent() contains the following bit of code:

/* Handle callout events. */
if (event > state->nextcall)
    event = state->nextcall;

It gets called when a CPU goes idle, and the idle thread is scheduling the next timer interrupt.  "nextcall" is not volatile, and interrupts are enabled, so state->nextcall can be modified between the check and the load, and the CPU will end up using a stale value.

It might be that there is another, similar race in handleevents(), since a different CPU might be updating state->nextcall concurrently.  But I think the race is harmless since callout_process() should be able to find the next callout...
Comment 17 felice.tufo 2022-07-02 17:47:49 UTC
(In reply to Mark Johnston from comment #16)
I can confirm that, with your patch applied, it's harder to make the late wakeup fire on a virtual machine on which I'm running the test with FreeBSD 12.2 .

I've just checked also the industrial pc running FreeBSD 13.1: the test is still running with no late wakeups after nearly 27 hours, so probably your patch + cpuset did the trick. 
I'll let it run until Monday and, if it's still ok, I'll stop the multithreaded test and start the original application which was failing, to see how it behaves with the patch + cpuset.

In the meantime, I'm fully available to run any test may help to find the culprit... thanks for your timely support Mark!
Comment 18 felice.tufo 2022-07-04 08:35:30 UTC
Hi Mark, just checked the test, this time it failed with a different pattern: it seems that after cicle 3865200 (that is exactly 53 hours from the start) all the timers failed with a delay of 400-450 ms (programmed period: 50ms).

Cicle 3865200 - Thread 1239 OK
Cicle 3865200 - Thread 1242 OK
Cicle 3865200 - Thread 1234 OK
Cicle 3865200 - Thread 1230 OK
Cicle 3865200 - Thread 1244 OK
Cicle 3865200 - Thread 1236 OK
Cicle 3865200 - Thread 1240 OK
Cicle 3865200 - Thread 1232 OK
Cicle 3865200 - Thread 1235 OK
Cicle 3865200 - Thread 1238 OK
Cicle 3865200 - Thread 1243 OK
Cicle 3865200 - Thread 1237 OK
Cicle 3865200 - Thread 1231 OK
Cicle 3865200 - Thread 1233 OK
Cicle 3865200 - Thread 1241 OK

Thread 1242
start sec 213158 nsec 215834696
stop sec 213158 nsec 670142788
wait sec 0 nsec 454308092

Thread 1233
start sec 213158 nsec 265846605
stop sec 213158 nsec 670278496
wait sec 0 nsec 404431891

Thread 1232
start sec 213158 nsec 215851127
stop sec 213158 nsec 670164764
wait sec 0 nsec 454313637

Thread 1244
start sec 213158 nsec 215864424
stop sec 213158 nsec 670169963
wait sec 0 nsec 454305539

Thread 1238
start sec 213158 nsec 215877180
stop sec 213158 nsec 670174861
wait sec 0 nsec 454297681

Thread 1236
start sec 213158 nsec 215883699
stop sec 213158 nsec 670179996
wait sec 0 nsec 454296297

Thread 1243
start sec 213158 nsec 215889243
stop sec 213158 nsec 670184644
wait sec 0 nsec 454295401

Thread 1237
start sec 213158 nsec 215894354
stop sec 213158 nsec 670189292
wait sec 0 nsec 454294938

Thread 1240
start sec 213158 nsec 215899542
stop sec 213158 nsec 670194408
wait sec 0 nsec 454294866

Thread 1231
start sec 213158 nsec 215904826
stop sec 213158 nsec 670199160
wait sec 0 nsec 454294334

Thread 1241
start sec 213158 nsec 215910387
stop sec 213158 nsec 670203783
wait sec 0 nsec 454293396

Thread 1235
start sec 213158 nsec 265824915
stop sec 213158 nsec 670208342
wait sec 0 nsec 404383427

Thread 1239
start sec 213158 nsec 265854966
stop sec 213158 nsec 670213146
wait sec 0 nsec 404358180

Thread 1234
start sec 213158 nsec 265862375
stop sec 213158 nsec 670288335
wait sec 0 nsec 404425960

Thread 1230
start sec 213158 nsec 215839589
stop sec 213158 nsec 670157348
wait sec 0 nsec 454317759
Comment 19 Mark Johnston freebsd_committer freebsd_triage 2022-07-04 22:34:44 UTC
(In reply to felice.tufo from comment #17)
I've found a couple of other bugs, including one in the scheduler (but this one should only be in 13.1).  I'll post a patch once some more testing is done, probably tomorrow.  BTW, I don't think it's worth spending any time testing with pinned threads anymore, I'm sure pinning will hide all of these bugs.

(In reply to felice.tufo from comment #18)
Hum.  That sounds like an overflow bug, but I can't think of anything that'd overflow at 53 hours of uptime.  I haven't run the test program long enough to reproduce it, I'll try it later this week once the easier-to-reproduce issues are ironed out.
Comment 20 felice.tufo 2022-07-11 07:45:04 UTC
Hi Mark, any news? Can I help with some tests?
Comment 21 Mark Johnston freebsd_committer freebsd_triage 2022-07-11 16:03:46 UTC
(In reply to felice.tufo from comment #20)
Sorry for the delayed reply, this has turned into quite a rabbit hole.

I've posted patches for a few problems that I found:
https://reviews.freebsd.org/D35744
https://reviews.freebsd.org/D35736
https://reviews.freebsd.org/D35735
https://reviews.freebsd.org/D35738

but there is at least one other bug which causes test failures, and while I have a patch which fixes it, it also causes some weird, errant behaviour in the CPU idle loop that I don't yet understand.  Once that's sorted out, I *think* I'll have fixed all of the problems found with the test program and will post a combined patch for you to test.
Comment 22 commit-hook freebsd_committer freebsd_triage 2022-07-11 19:59:54 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a889a65ba36985dfb31111ac1607be35ca2b2c8c

commit a889a65ba36985dfb31111ac1607be35ca2b2c8c
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-06-30 18:27:07 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-11 19:58:43 +0000

    eventtimer: Fix several races in the timer reload code

    In handleevents(), lock the timer state before fetching the time for the
    next event.  A concurrent callout_cc_add() call might be changing the
    next event time, and the race can cause handleevents() to program an
    out-of-date time, causing the callout to run later (by an unbounded
    period, up to the idle hardclock period of 1s) than requested.

    In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
    held, for similar reasons.  In particular, cpu_idleclock() runs with
    interrupts enabled, so an untimely timer interrupt can result in a stale
    next event time being programmed.  Further, an interrupt can cause
    cpu_idleclock() to use a stale value for "now".

    In cpu_activeclock(), disable interrupts before loading "now", so as to
    avoid going backwards in time when calling handleevents().  It's ok to
    leave interrupts enabled when checking "state->idle", since the race at
    worst will cause handleevents() to be called unnecessarily.  But use an
    atomic load to indicate that the test is racy.

    PR:             264867
    Reviewed by:    mav, jhb, kib
    MFC after:      2 weeks
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D35735

 sys/kern/kern_clocksource.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)
Comment 23 Mark Johnston freebsd_committer freebsd_triage 2022-07-11 20:01:18 UTC
(In reply to Mark Johnston from comment #21)
Ok, last patch posted here: https://reviews.freebsd.org/D35777

I'll try combining them into a patch against 13.1.
Comment 24 Mark Johnston freebsd_committer freebsd_triage 2022-07-11 20:15:40 UTC
Created attachment 235195 [details]
13.1 patch

Here's a combined diff against 13.1.  Please give it a try if you can, without any CPU pinning.

There is a lingering source of delayed scheduling, and that occurs when a callout takes "too long".  I fixed one instance of that in UMA (https://cgit.FreeBSD.org/src/commit/?id=93cd28ea82bb7b5e9489c8febfe1ae7ed2c044fb) but didn't include it here.  Such delays should only be by several milliseconds, though.
Comment 25 felice.tufo 2022-07-12 09:40:45 UTC
Thank you Mark, I've just patched the kernel and restarted the multithreaded test. Let's see what happens in the next 3-4 days...
Comment 26 commit-hook freebsd_committer freebsd_triage 2022-07-14 14:48:15 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0927ff78147b4d00a75054bbad299946208e1e91

commit 0927ff78147b4d00a75054bbad299946208e1e91
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:23:43 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-14 14:27:58 +0000

    sched_ule: Enable preemption of curthread in the load balancer

    The load balancer executes from statclock and periodically tries to move
    threads among CPUs in order to balance load.  It may move a thread to
    the current CPU (the loader balancer always runs on CPU 0).  When it
    does so, it may need to schedule preemption of the interrupted thread.
    Use sched_setpreempt() to do so, same as sched_add().

    PR:             264867
    Reviewed by:    mav, kib, jhb
    MFC after:      1 month
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D35744

 sys/kern/sched_ule.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)
Comment 27 commit-hook freebsd_committer freebsd_triage 2022-07-14 14:48:16 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=6d3f74a14a83b867c273c6be2599da182a9b9ec7

commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:21:28 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-14 14:27:51 +0000

    sched_ule: Fix racy loads of pc_curthread

    Thread switching used to be atomic with respect to the current CPU's
    tdq lock.  Since commit 686bcb5c14ab that is no longer the case.  Now
    sched_switch() does this:

    1.  lock tdq (might already be locked)
    2.  maybe put the current thread in the tdq, choose a new thread to run
    2a. update tdq_lowpri
    3.  unlock tdq
    4.  switch CPU context, update curthread

    Some code paths in ULE will load pc_curthread from a remote CPU with
    that CPU's tdq lock held, usually to inspect its priority.  But, as of
    the aforementioned commit this is racy.

    The problem I noticed is in tdq_notify(), which optionally sends an IPI
    to a remote CPU when a new thread is added to its runqueue.  If the new
    thread's priority is higher (lower) than the currently running thread's
    priority, then we deliver an IPI.  But inspecting
    pc_curthread->td_priority doesn't work, since pc_curthread might be
    between steps 3 and 4 above.  If pc_curthread's priority is higher than
    that of the newly added thread, but pc_curthread is switching to a
    lower-priority thread, then tdq_notify() might fail to deliever an IPI,
    leaving a high priority thread stuck on the runqueue for longer than it
    should.  This can cause multi-millisecond stalls in
    interactive/ithread/realtime threads.

    Fix this problem by modifying tdq_add() and tdq_move() to return the
    value of tdq_lowpri before the addition of the new thread.  This ensures
    that tdq_notify() has the correct priority value to compare against.

    The other two uses of pc_curthread are susceptible to the same race.  To
    fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
    value for curthread.  Thus, introduce a new tdq_curthread field to the
    tdq which gets updated any time a new thread is selected to run on the
    CPU.  Because this field is synchronized by the thread lock, its
    priority reflects the correct lowpri value for the tdq.

    PR:             264867
    Fixes:          686bcb5c14ab ("schedlock 4/4")
    Reviewed by:    mav, kib, jhb
    MFC after:      1 month
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D35736

 sys/kern/sched_ule.c | 143 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 84 insertions(+), 59 deletions(-)
Comment 28 commit-hook freebsd_committer freebsd_triage 2022-07-14 14:48:17 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=03f868b163ad46d6f7cb03dc46fb83ca01fb8f69

commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:24:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-14 14:28:01 +0000

    x86: Add a required store-load barrier in cpu_idle()

    ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread.
    In particular, it tries to detect whether the idle thread is running.
    There are two mechanisms for this:
    - tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle().  If
      tdq_cpu_idle == 0, then no IPI is needed;
    - idle_state, an x86-specific state flag which is updated after
      cpu_idleclock() is called.

    The implementation of the second mechanism is racy; the race can cause a
    CPU to go to sleep with pending work.  Specifically, cpu_idle_*() set
    idle_state = STATE_SLEEPING, then check for pending work by loading the
    tdq_load field of the CPU's runqueue.  These operations can be reordered
    so that the idle thread observes tdq_load == 0, and tdq_notify()
    observes idle_state == STATE_RUNNING.

    Some counters indicate that the idle_state check in tdq_notify()
    frequently elides an IPI.  So, fix the problem by inserting a fence
    after the store to idle_state, immediately before idling the CPU.

    PR:             264867
    Reviewed by:    mav, kib, jhb
    MFC after:      1 month
    Sponsored by:   The FreeBSD Foundation
    Differential Revision:  https://reviews.freebsd.org/D35777

 sys/x86/x86/cpu_machdep.c | 103 ++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 41 deletions(-)
Comment 29 Mark Johnston freebsd_committer freebsd_triage 2022-07-14 14:50:08 UTC
(In reply to felice.tufo from comment #25)
I am hoping that no news is good news. :)

In any case, I committed a number of bug fixes that I'm confident in.  It's certainly possible that I did not catch everything, but the test is very reliable for me now.
Comment 30 felice.tufo 2022-07-14 15:50:41 UTC
Hi Mark, the current test cicle is 3870000, more than 53hours, the longest run withous errors since starting to evaluate FreeBSD, so your patch seems very promising.

I'll keep it running for the whole weekend and, on Monday, if everything is still ok, I'll replace the multithreaded test with the application that fired the issue the first time, so let's have an update on next Tuesday...

In the meantime let me give you some details about the application: it is a "2 out of 2" railway vital system, expected to run 24/7/365 (ideally for years) without problems; it is composed by multiple (multithreaded) processes communicating on sockets and shared memory; the clock of the industrial PC running FreeBSD is synchronized by the Precision Time Protocol (by the external ptpd daemon).
The kevent is used as a clock with 50ms period, we have different jobs running in each 50ms phase and read CLOCK_MONOTONIC at the start and the end of each phase to verify that no job lasted longer than 50ms.
I know it is a generic description, but could you foresee any problems in using FreeBSD for such application? I mean, things like: "expect to reboot FreeBSD every 6 months because <reason>"

Thanks a lot!
Comment 31 Mark Johnston freebsd_committer freebsd_triage 2022-07-14 16:12:16 UTC
(In reply to felice.tufo from comment #30)
There is no technical need to reboot FreeBSD periodically.  Of course, there is never a guarantee, bugs are always possible.
Comment 32 felice.tufo 2022-07-18 13:36:34 UTC
Hi Mark, good news from the test, it hasn't fired any errors in 6 days, the longest run ever.
Now I'm restarting the application which fired the error the first time... let's wait some days and see what happens!
Comment 33 Mark Johnston freebsd_committer freebsd_triage 2022-07-18 13:41:27 UTC
(In reply to felice.tufo from comment #32)
Nice!  Glad to hear it.
Comment 34 felice.tufo 2022-07-21 08:43:49 UTC
Hi Mark, after some days of testing with no issues firing, I think we can reasonably say that this bug can be closed. Thanks for all of your support!
Comment 35 Mark Johnston freebsd_committer freebsd_triage 2022-07-21 13:19:22 UTC
(In reply to felice.tufo from comment #34)
Great, glad to hear it.  I'll keep this PR open until the patches are merged to stable/13.
Comment 36 commit-hook freebsd_committer freebsd_triage 2022-07-25 13:55:14 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=3be1eaa32b95857cf848295e10ffc25e2852d2e6

commit 3be1eaa32b95857cf848295e10ffc25e2852d2e6
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-06-30 18:27:07 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-07-25 13:45:51 +0000

    eventtimer: Fix several races in the timer reload code

    In handleevents(), lock the timer state before fetching the time for the
    next event.  A concurrent callout_cc_add() call might be changing the
    next event time, and the race can cause handleevents() to program an
    out-of-date time, causing the callout to run later (by an unbounded
    period, up to the idle hardclock period of 1s) than requested.

    In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
    held, for similar reasons.  In particular, cpu_idleclock() runs with
    interrupts enabled, so an untimely timer interrupt can result in a stale
    next event time being programmed.  Further, an interrupt can cause
    cpu_idleclock() to use a stale value for "now".

    In cpu_activeclock(), disable interrupts before loading "now", so as to
    avoid going backwards in time when calling handleevents().  It's ok to
    leave interrupts enabled when checking "state->idle", since the race at
    worst will cause handleevents() to be called unnecessarily.  But use an
    atomic load to indicate that the test is racy.

    PR:             264867
    Reviewed by:    mav, jhb, kib
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit a889a65ba36985dfb31111ac1607be35ca2b2c8c)

 sys/kern/kern_clocksource.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)
Comment 37 commit-hook freebsd_committer freebsd_triage 2022-08-01 14:19:43 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=20abb8075086dd31fae4ad39e84a44ce429a357e

commit 20abb8075086dd31fae4ad39e84a44ce429a357e
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:23:43 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-08-01 14:12:36 +0000

    sched_ule: Enable preemption of curthread in the load balancer

    The load balancer executes from statclock and periodically tries to move
    threads among CPUs in order to balance load.  It may move a thread to
    the current CPU (the loader balancer always runs on CPU 0).  When it
    does so, it may need to schedule preemption of the interrupted thread.
    Use sched_setpreempt() to do so, same as sched_add().

    PR:             264867
    Reviewed by:    mav, kib, jhb
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 0927ff78147b4d00a75054bbad299946208e1e91)

 sys/kern/sched_ule.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)
Comment 38 commit-hook freebsd_committer freebsd_triage 2022-08-01 14:19:44 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=84a0d34d10aca36b1e7f9d00d0c4883f3355883b

commit 84a0d34d10aca36b1e7f9d00d0c4883f3355883b
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:24:25 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-08-01 14:12:49 +0000

    x86: Add a required store-load barrier in cpu_idle()

    ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread.
    In particular, it tries to detect whether the idle thread is running.
    There are two mechanisms for this:
    - tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle().  If
      tdq_cpu_idle == 0, then no IPI is needed;
    - idle_state, an x86-specific state flag which is updated after
      cpu_idleclock() is called.

    The implementation of the second mechanism is racy; the race can cause a
    CPU to go to sleep with pending work.  Specifically, cpu_idle_*() set
    idle_state = STATE_SLEEPING, then check for pending work by loading the
    tdq_load field of the CPU's runqueue.  These operations can be reordered
    so that the idle thread observes tdq_load == 0, and tdq_notify()
    observes idle_state == STATE_RUNNING.

    Some counters indicate that the idle_state check in tdq_notify()
    frequently elides an IPI.  So, fix the problem by inserting a fence
    after the store to idle_state, immediately before idling the CPU.

    PR:             264867
    Reviewed by:    mav, kib, jhb
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69)

 sys/x86/x86/cpu_machdep.c | 103 ++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 41 deletions(-)
Comment 39 commit-hook freebsd_committer freebsd_triage 2022-08-01 14:19:44 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a869b8c5f0ba01b8c7fbc4b486e7b8720b8525ee

commit a869b8c5f0ba01b8c7fbc4b486e7b8720b8525ee
Author:     Mark Johnston <markj@FreeBSD.org>
AuthorDate: 2022-07-14 14:21:28 +0000
Commit:     Mark Johnston <markj@FreeBSD.org>
CommitDate: 2022-08-01 14:12:12 +0000

    sched_ule: Fix racy loads of pc_curthread

    Thread switching used to be atomic with respect to the current CPU's
    tdq lock.  Since commit 686bcb5c14ab that is no longer the case.  Now
    sched_switch() does this:

    1.  lock tdq (might already be locked)
    2.  maybe put the current thread in the tdq, choose a new thread to run
    2a. update tdq_lowpri
    3.  unlock tdq
    4.  switch CPU context, update curthread

    Some code paths in ULE will load pc_curthread from a remote CPU with
    that CPU's tdq lock held, usually to inspect its priority.  But, as of
    the aforementioned commit this is racy.

    The problem I noticed is in tdq_notify(), which optionally sends an IPI
    to a remote CPU when a new thread is added to its runqueue.  If the new
    thread's priority is higher (lower) than the currently running thread's
    priority, then we deliver an IPI.  But inspecting
    pc_curthread->td_priority doesn't work, since pc_curthread might be
    between steps 3 and 4 above.  If pc_curthread's priority is higher than
    that of the newly added thread, but pc_curthread is switching to a
    lower-priority thread, then tdq_notify() might fail to deliever an IPI,
    leaving a high priority thread stuck on the runqueue for longer than it
    should.  This can cause multi-millisecond stalls in
    interactive/ithread/realtime threads.

    Fix this problem by modifying tdq_add() and tdq_move() to return the
    value of tdq_lowpri before the addition of the new thread.  This ensures
    that tdq_notify() has the correct priority value to compare against.

    The other two uses of pc_curthread are susceptible to the same race.  To
    fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
    value for curthread.  Thus, introduce a new tdq_curthread field to the
    tdq which gets updated any time a new thread is selected to run on the
    CPU.  Because this field is synchronized by the thread lock, its
    priority reflects the correct lowpri value for the tdq.

    PR:             264867
    Fixes:          686bcb5c14ab ("schedlock 4/4")
    Reviewed by:    mav, kib, jhb
    Sponsored by:   The FreeBSD Foundation

    (cherry picked from commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7)

 sys/kern/sched_ule.c | 143 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 84 insertions(+), 59 deletions(-)
Comment 40 Mark Johnston freebsd_committer freebsd_triage 2022-08-01 14:21:21 UTC
Thanks again for the report and for your help tracking down these bugs.