Summary: | kevent EVFILT_TIMER, timer expiration with twice the period | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | felice.tufo | ||||||||||
Component: | kern | Assignee: | Mark Johnston <markj> | ||||||||||
Status: | Closed FIXED | ||||||||||||
Severity: | Affects Only Me | CC: | markj | ||||||||||
Priority: | --- | ||||||||||||
Version: | 13.1-STABLE | ||||||||||||
Hardware: | amd64 | ||||||||||||
OS: | Any | ||||||||||||
Attachments: |
|
Description
felice.tufo
2022-06-24 09:38:28 UTC
Have you been able to reproduce the problem on multiple different hardware systems? What does "sysctl kern.eventtimer" show on an affected system? I tried running it locally on a desktop running 14.0-CURRENT but did not see any problems after a few minutes. Hi Mark thanks for the answer. The output of the sysctl command is as follows: kern.eventtimer.choice: LAPIC(600) HPET(550) i8254(100) RTC(0) kern.eventtimer.et.i8254.quality: 100 kern.eventtimer.et.i8254.frequency: 1193182 kern.eventtimer.et.i8254.flags: 1 kern.eventtimer.et.RTC.quality: 0 kern.eventtimer.et.RTC.frequency: 32768 kern.eventtimer.et.RTC.flags: 17 kern.eventtimer.et.HPET.quality: 550 kern.eventtimer.et.HPET.frequency: 24000000 kern.eventtimer.et.HPET.flags: 7 kern.eventtimer.et.LAPIC.quality: 600 kern.eventtimer.et.LAPIC.frequency: 2903994335 kern.eventtimer.et.LAPIC.flags: 7 kern.eventtimer.periodic: 0 kern.eventtimer.timer: LAPIC kern.eventtimer.idletick: 0 kern.eventtimer.singlemul: 2 I did not try other systems because the HW cannot be changed (the application has to work on that industrial pc), but I could run some tests on other systems just to see if it is reproducible also on other platforms. I'm sorry but a test of just few minutes is not enough, I usually let it run for hours; in one run the issue happened after nearly 1 day and half from the start. Also, I never tried 14.0-CURRENT, only 12.2 and 13.1. Thanks, Felice I'm able to reproduce what looks like a similar problem on a different test system. In my case, the extra delay is always close to a multiple of 20ms (the kqueue timer period). Usually 40ms or 60ms. It often occurs soon after the program starts up. It happens with or without setting the scheduling policy. I suspect the problem is related to some kind of race in the EVFILT_TIMER code. If I add EV_ONESHOT to the event, the problem is not reproducible anymore. In the test case, when kevent() returns the timer is still armed, and it's supposed to be reconfigured by the next kevent() call. Hi Mark, I tried running my example setting EV_ONESHOT too; in my case the problem was still showing but it could be only a matter of test duration. I also tried to write a kernel module using the callout framework directly: I noticed small extra delays (1-2 ms) every now and then, but never big delays (twice or more of the programmed period) as in userspace, so I agree that there could be something "strange" going on in the kevent layer. I'm waiting to have some hw available to launch the test on different systems. Are you able to reproduce the problem if all threads are pinned to a single CPU? For instance, # cpuset -l 10 ./a.out To expand on my question a bit, I think there is a race somewhere in the eventtimer code which occurs when a callout is scheduled on a different CPU. In particular, when an EVFILT_TIMER event is created, it is "pinned" to the current CPU, and all future callouts will fire on the CPU. I am not sure why kevent does this, it doesn't make much sense to me. If I change filt_timertouch() to set kc->cpuid = PCPU_GET(cpuid) each time the event is re-added, then the bug is much harder to trigger. Alternately, if the test is changed so that each thread pins itself to a different CPU, then the problem is not reproducible for me. (In reply to felice.tufo from comment #4) Does your test module specify the target CPU when arming a callout? Hi Mark, unfortunately I have to share the system with other engineers, so I'll try to run the test with the cpuset command as soon as possibile (hopefully during the weekend). In the meantime I can add some info: - I didn't specify the target cpu when arming a callout, nor when using kevent - the original problem fired on a multithreaded program, but only one of the threads was calling kevent(), the others were receiving data on sockets. The program run at priority 0, but not pinned to any cpu. I somehow suspected that multithreading could fire the problem more frequently, so I wrote the multithreaded test program I attached to this bug report, but the original problem fired while a single thread was calling kevent with a 50ms period. Created attachment 235005 [details]
possible bug fix
Could you please give this patch a try? It solves the problem for me. I think testing this would be more useful than the other things I suggested so far (pinning threads, etc.).
I think it'll apply to 13.1 and 12.3 without any adjustment.
Basically, when scheduling a callout on a different CPU, there is a small race which can cause the event timer to be programmed incorrectly, resulting in delayed processing of events. Your excellent test program triggers this readily on one of my systems with many CPUs; over time the scheduler migrates threads away from the "home" CPU selected during knote initialization, so the window for this race is opened.
Thank you for the patch Mark, I'm going to apply it to the 13.1 kernel, launch the test and keep it running for the whole weekend. Fingers crossed... :-) Created attachment 235023 [details]
patched file
Comment on attachment 235023 [details]
patched file
After some hours running, unfortunately the problem fired again (programmed 50ms period, slept 100ms)
Cicle 255600 - Thread 1240 OK
Cicle 255600 - Thread 1241 OK
Cicle 255600 - Thread 1236 OK
Cicle 255600 - Thread 1242 OK
Thread 1239
start sec 13352 nsec 453408127
stop sec 13352 nsec 553448989
wait sec 0 nsec 100040862
Could you please double check that the file I've just attached was patched correctly?
After patching, I just copied (renaming) the GENERIC amd64 configuration, did a 'make buildkernel KERNELCONF=xxx', 'make installkernel KERNELCONF=xxx' and rebooted.
I'm relaunching the test pinning the all the threads to a cpu... let's see if this helps.
(In reply to felice.tufo from comment #11) The variable is KERNCONF, not KERNELCONFIG, but I don't see why you would need to create a new kernel configuration for a patch. The patched kern_clocksource.c looks right to me. (In reply to Mark Johnston from comment #12) Yes, you're right, I just followed this guide to be sure to do everything ok (patching and building a freebsd kernel is not my daily job, sorry :-)): https://docs.freebsd.org/en/books/handbook/kernelconfig/#kernelconfig-building So now the test is running again on the patched system, but pinned to a cpu... (In reply to felice.tufo from comment #13) Just to double check, what does "uname -a" report when running the patched kernel? (In reply to Mark Johnston from comment #14) $ uname -a FreeBSD schrodinger 13.1-RELEASE FreeBSD 13.1-RELEASE FELIX amd64 FELIX is the name of the configuration used (a copy of the GENERIC configuration). Also the /boot dir seems to be ok, with the kernel subdir created this morning after compiling and installing the patched kernel $ ls -l /boot/ total 10045 [CUT] drwxr-xr-x 2 root wheel 18432 Jul 1 11:47 kernel drwxr-xr-x 2 root wheel 18432 Jun 15 12:03 kernel.old After 5 hours, the test is still running fine (patched kernel and process pinned to cpu 1). I'll check it again tomorrow morning. Ok, I'm still able to reproduce the problem with my patch. But, the frequency of the late wakeups is much much lower, and I'm fairly sure that the patch solves a legitimate problem on its own. Specifically, getnextcpuevent() contains the following bit of code: /* Handle callout events. */ if (event > state->nextcall) event = state->nextcall; It gets called when a CPU goes idle, and the idle thread is scheduling the next timer interrupt. "nextcall" is not volatile, and interrupts are enabled, so state->nextcall can be modified between the check and the load, and the CPU will end up using a stale value. It might be that there is another, similar race in handleevents(), since a different CPU might be updating state->nextcall concurrently. But I think the race is harmless since callout_process() should be able to find the next callout... (In reply to Mark Johnston from comment #16) I can confirm that, with your patch applied, it's harder to make the late wakeup fire on a virtual machine on which I'm running the test with FreeBSD 12.2 . I've just checked also the industrial pc running FreeBSD 13.1: the test is still running with no late wakeups after nearly 27 hours, so probably your patch + cpuset did the trick. I'll let it run until Monday and, if it's still ok, I'll stop the multithreaded test and start the original application which was failing, to see how it behaves with the patch + cpuset. In the meantime, I'm fully available to run any test may help to find the culprit... thanks for your timely support Mark! Hi Mark, just checked the test, this time it failed with a different pattern: it seems that after cicle 3865200 (that is exactly 53 hours from the start) all the timers failed with a delay of 400-450 ms (programmed period: 50ms). Cicle 3865200 - Thread 1239 OK Cicle 3865200 - Thread 1242 OK Cicle 3865200 - Thread 1234 OK Cicle 3865200 - Thread 1230 OK Cicle 3865200 - Thread 1244 OK Cicle 3865200 - Thread 1236 OK Cicle 3865200 - Thread 1240 OK Cicle 3865200 - Thread 1232 OK Cicle 3865200 - Thread 1235 OK Cicle 3865200 - Thread 1238 OK Cicle 3865200 - Thread 1243 OK Cicle 3865200 - Thread 1237 OK Cicle 3865200 - Thread 1231 OK Cicle 3865200 - Thread 1233 OK Cicle 3865200 - Thread 1241 OK Thread 1242 start sec 213158 nsec 215834696 stop sec 213158 nsec 670142788 wait sec 0 nsec 454308092 Thread 1233 start sec 213158 nsec 265846605 stop sec 213158 nsec 670278496 wait sec 0 nsec 404431891 Thread 1232 start sec 213158 nsec 215851127 stop sec 213158 nsec 670164764 wait sec 0 nsec 454313637 Thread 1244 start sec 213158 nsec 215864424 stop sec 213158 nsec 670169963 wait sec 0 nsec 454305539 Thread 1238 start sec 213158 nsec 215877180 stop sec 213158 nsec 670174861 wait sec 0 nsec 454297681 Thread 1236 start sec 213158 nsec 215883699 stop sec 213158 nsec 670179996 wait sec 0 nsec 454296297 Thread 1243 start sec 213158 nsec 215889243 stop sec 213158 nsec 670184644 wait sec 0 nsec 454295401 Thread 1237 start sec 213158 nsec 215894354 stop sec 213158 nsec 670189292 wait sec 0 nsec 454294938 Thread 1240 start sec 213158 nsec 215899542 stop sec 213158 nsec 670194408 wait sec 0 nsec 454294866 Thread 1231 start sec 213158 nsec 215904826 stop sec 213158 nsec 670199160 wait sec 0 nsec 454294334 Thread 1241 start sec 213158 nsec 215910387 stop sec 213158 nsec 670203783 wait sec 0 nsec 454293396 Thread 1235 start sec 213158 nsec 265824915 stop sec 213158 nsec 670208342 wait sec 0 nsec 404383427 Thread 1239 start sec 213158 nsec 265854966 stop sec 213158 nsec 670213146 wait sec 0 nsec 404358180 Thread 1234 start sec 213158 nsec 265862375 stop sec 213158 nsec 670288335 wait sec 0 nsec 404425960 Thread 1230 start sec 213158 nsec 215839589 stop sec 213158 nsec 670157348 wait sec 0 nsec 454317759 (In reply to felice.tufo from comment #17) I've found a couple of other bugs, including one in the scheduler (but this one should only be in 13.1). I'll post a patch once some more testing is done, probably tomorrow. BTW, I don't think it's worth spending any time testing with pinned threads anymore, I'm sure pinning will hide all of these bugs. (In reply to felice.tufo from comment #18) Hum. That sounds like an overflow bug, but I can't think of anything that'd overflow at 53 hours of uptime. I haven't run the test program long enough to reproduce it, I'll try it later this week once the easier-to-reproduce issues are ironed out. Hi Mark, any news? Can I help with some tests? (In reply to felice.tufo from comment #20) Sorry for the delayed reply, this has turned into quite a rabbit hole. I've posted patches for a few problems that I found: https://reviews.freebsd.org/D35744 https://reviews.freebsd.org/D35736 https://reviews.freebsd.org/D35735 https://reviews.freebsd.org/D35738 but there is at least one other bug which causes test failures, and while I have a patch which fixes it, it also causes some weird, errant behaviour in the CPU idle loop that I don't yet understand. Once that's sorted out, I *think* I'll have fixed all of the problems found with the test program and will post a combined patch for you to test. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a889a65ba36985dfb31111ac1607be35ca2b2c8c commit a889a65ba36985dfb31111ac1607be35ca2b2c8c Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-06-30 18:27:07 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-07-11 19:58:43 +0000 eventtimer: Fix several races in the timer reload code In handleevents(), lock the timer state before fetching the time for the next event. A concurrent callout_cc_add() call might be changing the next event time, and the race can cause handleevents() to program an out-of-date time, causing the callout to run later (by an unbounded period, up to the idle hardclock period of 1s) than requested. In cpu_idleclock(), call getnextcpuevent() with the timer state mutex held, for similar reasons. In particular, cpu_idleclock() runs with interrupts enabled, so an untimely timer interrupt can result in a stale next event time being programmed. Further, an interrupt can cause cpu_idleclock() to use a stale value for "now". In cpu_activeclock(), disable interrupts before loading "now", so as to avoid going backwards in time when calling handleevents(). It's ok to leave interrupts enabled when checking "state->idle", since the race at worst will cause handleevents() to be called unnecessarily. But use an atomic load to indicate that the test is racy. PR: 264867 Reviewed by: mav, jhb, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35735 sys/kern/kern_clocksource.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (In reply to Mark Johnston from comment #21) Ok, last patch posted here: https://reviews.freebsd.org/D35777 I'll try combining them into a patch against 13.1. Created attachment 235195 [details] 13.1 patch Here's a combined diff against 13.1. Please give it a try if you can, without any CPU pinning. There is a lingering source of delayed scheduling, and that occurs when a callout takes "too long". I fixed one instance of that in UMA (https://cgit.FreeBSD.org/src/commit/?id=93cd28ea82bb7b5e9489c8febfe1ae7ed2c044fb) but didn't include it here. Such delays should only be by several milliseconds, though. Thank you Mark, I've just patched the kernel and restarted the multithreaded test. Let's see what happens in the next 3-4 days... A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0927ff78147b4d00a75054bbad299946208e1e91 commit 0927ff78147b4d00a75054bbad299946208e1e91 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:23:43 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-07-14 14:27:58 +0000 sched_ule: Enable preemption of curthread in the load balancer The load balancer executes from statclock and periodically tries to move threads among CPUs in order to balance load. It may move a thread to the current CPU (the loader balancer always runs on CPU 0). When it does so, it may need to schedule preemption of the interrupted thread. Use sched_setpreempt() to do so, same as sched_add(). PR: 264867 Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35744 sys/kern/sched_ule.c | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=6d3f74a14a83b867c273c6be2599da182a9b9ec7 commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:21:28 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-07-14 14:27:51 +0000 sched_ule: Fix racy loads of pc_curthread Thread switching used to be atomic with respect to the current CPU's tdq lock. Since commit 686bcb5c14ab that is no longer the case. Now sched_switch() does this: 1. lock tdq (might already be locked) 2. maybe put the current thread in the tdq, choose a new thread to run 2a. update tdq_lowpri 3. unlock tdq 4. switch CPU context, update curthread Some code paths in ULE will load pc_curthread from a remote CPU with that CPU's tdq lock held, usually to inspect its priority. But, as of the aforementioned commit this is racy. The problem I noticed is in tdq_notify(), which optionally sends an IPI to a remote CPU when a new thread is added to its runqueue. If the new thread's priority is higher (lower) than the currently running thread's priority, then we deliver an IPI. But inspecting pc_curthread->td_priority doesn't work, since pc_curthread might be between steps 3 and 4 above. If pc_curthread's priority is higher than that of the newly added thread, but pc_curthread is switching to a lower-priority thread, then tdq_notify() might fail to deliever an IPI, leaving a high priority thread stuck on the runqueue for longer than it should. This can cause multi-millisecond stalls in interactive/ithread/realtime threads. Fix this problem by modifying tdq_add() and tdq_move() to return the value of tdq_lowpri before the addition of the new thread. This ensures that tdq_notify() has the correct priority value to compare against. The other two uses of pc_curthread are susceptible to the same race. To fix the one in sched_rem()->tdq_setlowpri() we need to have an exact value for curthread. Thus, introduce a new tdq_curthread field to the tdq which gets updated any time a new thread is selected to run on the CPU. Because this field is synchronized by the thread lock, its priority reflects the correct lowpri value for the tdq. PR: 264867 Fixes: 686bcb5c14ab ("schedlock 4/4") Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35736 sys/kern/sched_ule.c | 143 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 84 insertions(+), 59 deletions(-) A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=03f868b163ad46d6f7cb03dc46fb83ca01fb8f69 commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:24:25 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-07-14 14:28:01 +0000 x86: Add a required store-load barrier in cpu_idle() ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread. In particular, it tries to detect whether the idle thread is running. There are two mechanisms for this: - tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle(). If tdq_cpu_idle == 0, then no IPI is needed; - idle_state, an x86-specific state flag which is updated after cpu_idleclock() is called. The implementation of the second mechanism is racy; the race can cause a CPU to go to sleep with pending work. Specifically, cpu_idle_*() set idle_state = STATE_SLEEPING, then check for pending work by loading the tdq_load field of the CPU's runqueue. These operations can be reordered so that the idle thread observes tdq_load == 0, and tdq_notify() observes idle_state == STATE_RUNNING. Some counters indicate that the idle_state check in tdq_notify() frequently elides an IPI. So, fix the problem by inserting a fence after the store to idle_state, immediately before idling the CPU. PR: 264867 Reviewed by: mav, kib, jhb MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35777 sys/x86/x86/cpu_machdep.c | 103 ++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 41 deletions(-) (In reply to felice.tufo from comment #25) I am hoping that no news is good news. :) In any case, I committed a number of bug fixes that I'm confident in. It's certainly possible that I did not catch everything, but the test is very reliable for me now. Hi Mark, the current test cicle is 3870000, more than 53hours, the longest run withous errors since starting to evaluate FreeBSD, so your patch seems very promising. I'll keep it running for the whole weekend and, on Monday, if everything is still ok, I'll replace the multithreaded test with the application that fired the issue the first time, so let's have an update on next Tuesday... In the meantime let me give you some details about the application: it is a "2 out of 2" railway vital system, expected to run 24/7/365 (ideally for years) without problems; it is composed by multiple (multithreaded) processes communicating on sockets and shared memory; the clock of the industrial PC running FreeBSD is synchronized by the Precision Time Protocol (by the external ptpd daemon). The kevent is used as a clock with 50ms period, we have different jobs running in each 50ms phase and read CLOCK_MONOTONIC at the start and the end of each phase to verify that no job lasted longer than 50ms. I know it is a generic description, but could you foresee any problems in using FreeBSD for such application? I mean, things like: "expect to reboot FreeBSD every 6 months because <reason>" Thanks a lot! (In reply to felice.tufo from comment #30) There is no technical need to reboot FreeBSD periodically. Of course, there is never a guarantee, bugs are always possible. Hi Mark, good news from the test, it hasn't fired any errors in 6 days, the longest run ever. Now I'm restarting the application which fired the error the first time... let's wait some days and see what happens! (In reply to felice.tufo from comment #32) Nice! Glad to hear it. Hi Mark, after some days of testing with no issues firing, I think we can reasonably say that this bug can be closed. Thanks for all of your support! (In reply to felice.tufo from comment #34) Great, glad to hear it. I'll keep this PR open until the patches are merged to stable/13. A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=3be1eaa32b95857cf848295e10ffc25e2852d2e6 commit 3be1eaa32b95857cf848295e10ffc25e2852d2e6 Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-06-30 18:27:07 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-07-25 13:45:51 +0000 eventtimer: Fix several races in the timer reload code In handleevents(), lock the timer state before fetching the time for the next event. A concurrent callout_cc_add() call might be changing the next event time, and the race can cause handleevents() to program an out-of-date time, causing the callout to run later (by an unbounded period, up to the idle hardclock period of 1s) than requested. In cpu_idleclock(), call getnextcpuevent() with the timer state mutex held, for similar reasons. In particular, cpu_idleclock() runs with interrupts enabled, so an untimely timer interrupt can result in a stale next event time being programmed. Further, an interrupt can cause cpu_idleclock() to use a stale value for "now". In cpu_activeclock(), disable interrupts before loading "now", so as to avoid going backwards in time when calling handleevents(). It's ok to leave interrupts enabled when checking "state->idle", since the race at worst will cause handleevents() to be called unnecessarily. But use an atomic load to indicate that the test is racy. PR: 264867 Reviewed by: mav, jhb, kib Sponsored by: The FreeBSD Foundation (cherry picked from commit a889a65ba36985dfb31111ac1607be35ca2b2c8c) sys/kern/kern_clocksource.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=20abb8075086dd31fae4ad39e84a44ce429a357e commit 20abb8075086dd31fae4ad39e84a44ce429a357e Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:23:43 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-08-01 14:12:36 +0000 sched_ule: Enable preemption of curthread in the load balancer The load balancer executes from statclock and periodically tries to move threads among CPUs in order to balance load. It may move a thread to the current CPU (the loader balancer always runs on CPU 0). When it does so, it may need to schedule preemption of the interrupted thread. Use sched_setpreempt() to do so, same as sched_add(). PR: 264867 Reviewed by: mav, kib, jhb Sponsored by: The FreeBSD Foundation (cherry picked from commit 0927ff78147b4d00a75054bbad299946208e1e91) sys/kern/sched_ule.c | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=84a0d34d10aca36b1e7f9d00d0c4883f3355883b commit 84a0d34d10aca36b1e7f9d00d0c4883f3355883b Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:24:25 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-08-01 14:12:49 +0000 x86: Add a required store-load barrier in cpu_idle() ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread. In particular, it tries to detect whether the idle thread is running. There are two mechanisms for this: - tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle(). If tdq_cpu_idle == 0, then no IPI is needed; - idle_state, an x86-specific state flag which is updated after cpu_idleclock() is called. The implementation of the second mechanism is racy; the race can cause a CPU to go to sleep with pending work. Specifically, cpu_idle_*() set idle_state = STATE_SLEEPING, then check for pending work by loading the tdq_load field of the CPU's runqueue. These operations can be reordered so that the idle thread observes tdq_load == 0, and tdq_notify() observes idle_state == STATE_RUNNING. Some counters indicate that the idle_state check in tdq_notify() frequently elides an IPI. So, fix the problem by inserting a fence after the store to idle_state, immediately before idling the CPU. PR: 264867 Reviewed by: mav, kib, jhb Sponsored by: The FreeBSD Foundation (cherry picked from commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69) sys/x86/x86/cpu_machdep.c | 103 ++++++++++++++++++++++++++++------------------ 1 file changed, 62 insertions(+), 41 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a869b8c5f0ba01b8c7fbc4b486e7b8720b8525ee commit a869b8c5f0ba01b8c7fbc4b486e7b8720b8525ee Author: Mark Johnston <markj@FreeBSD.org> AuthorDate: 2022-07-14 14:21:28 +0000 Commit: Mark Johnston <markj@FreeBSD.org> CommitDate: 2022-08-01 14:12:12 +0000 sched_ule: Fix racy loads of pc_curthread Thread switching used to be atomic with respect to the current CPU's tdq lock. Since commit 686bcb5c14ab that is no longer the case. Now sched_switch() does this: 1. lock tdq (might already be locked) 2. maybe put the current thread in the tdq, choose a new thread to run 2a. update tdq_lowpri 3. unlock tdq 4. switch CPU context, update curthread Some code paths in ULE will load pc_curthread from a remote CPU with that CPU's tdq lock held, usually to inspect its priority. But, as of the aforementioned commit this is racy. The problem I noticed is in tdq_notify(), which optionally sends an IPI to a remote CPU when a new thread is added to its runqueue. If the new thread's priority is higher (lower) than the currently running thread's priority, then we deliver an IPI. But inspecting pc_curthread->td_priority doesn't work, since pc_curthread might be between steps 3 and 4 above. If pc_curthread's priority is higher than that of the newly added thread, but pc_curthread is switching to a lower-priority thread, then tdq_notify() might fail to deliever an IPI, leaving a high priority thread stuck on the runqueue for longer than it should. This can cause multi-millisecond stalls in interactive/ithread/realtime threads. Fix this problem by modifying tdq_add() and tdq_move() to return the value of tdq_lowpri before the addition of the new thread. This ensures that tdq_notify() has the correct priority value to compare against. The other two uses of pc_curthread are susceptible to the same race. To fix the one in sched_rem()->tdq_setlowpri() we need to have an exact value for curthread. Thus, introduce a new tdq_curthread field to the tdq which gets updated any time a new thread is selected to run on the CPU. Because this field is synchronized by the thread lock, its priority reflects the correct lowpri value for the tdq. PR: 264867 Fixes: 686bcb5c14ab ("schedlock 4/4") Reviewed by: mav, kib, jhb Sponsored by: The FreeBSD Foundation (cherry picked from commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7) sys/kern/sched_ule.c | 143 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 84 insertions(+), 59 deletions(-) Thanks again for the report and for your help tracking down these bugs. |