Created attachment 194354 [details] code changes to greately increase likelihood of the race I must state upfront that I discovered the issue through code review and I had to make special arrangements to provoke the problem. The core of the issue is that intr_event_handle iterates the list of handlers, ie_handlers, without any protection whatsoever. Also, removal and installation of a filter-only handler does not make any attempt to synchronize with with intr_event_handle. As such, it is possible (although very improbable) that intr_event_handle may iterate into an element just before it is removed and derefence its pointer to a next element after the former element is freed and the pointer is overwritten. This problem is only for a shared interrupts. When an interrupt is not shared, then it should be disabled before its handler is torn down. Here is a stack trace of the crash: fault virtual address = 0xffffffffffffffff fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80b64ff0 stack pointer = 0x28:0xfffffe0000434970 frame pointer = 0x28:0xfffffe00004349b0 code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 11 (idle: cpu2) trap number = 12 panic: page fault cpuid = 2 time = 1529319165 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000434630 vpanic() at vpanic+0x1a3/frame 0xfffffe0000434690 panic() at panic+0x43/frame 0xfffffe00004346f0 trap_fatal() at trap_fatal+0x35f/frame 0xfffffe0000434740 trap_pfault() at trap_pfault+0x62/frame 0xfffffe0000434790 trap() at trap+0x2ba/frame 0xfffffe00004348a0 calltrap() at calltrap+0x8/frame 0xfffffe00004348a0 --- trap 0xc, rip = 0xffffffff80b64ff0, rsp = 0xfffffe0000434970, rbp = 0xfffffe00004349b0 --- intr_event_handle() at intr_event_handle+0xa0/frame 0xfffffe00004349b0 intr_execute_handlers() at intr_execute_handlers+0x58/frame 0xfffffe00004349e0 lapic_handle_intr() at lapic_handle_intr+0x6d/frame 0xfffffe0000434a20 Xapic_isr1() at Xapic_isr1+0xd0/frame 0xfffffe0000434a20 --- interrupt, rip = 0xffffffff80bd3b49, rsp = 0xfffffe0000434af0, rbp = 0xfffffe0000434bb0 --- sched_idletd() at sched_idletd+0x4a9/frame 0xfffffe0000434bb0 fork_exit() at fork_exit+0x84/frame 0xfffffe0000434bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000434bf0 This is what I did to get the crash. 1. Found hardware with shared interrupts. Specifically, I had three USB OHCI controllers sharing PCI IRQ 18. 2. Modified the ohci driver, so that it installed a dummy filter instead of its ithread handler. This made the driver non-functional, of course. 3. Modified IO-APIC code, so that it kept re-raising the interrupt thus increasing the chances of getting the race within a reasonable time frame. 4. Re-compiled kern_intr.c with QUEUE_MACRO_DEBUG_TRASH to make the race more probable by immediately corrupting a removed handler. 5. Triggered the interrupt storm for IRQ 18. 6. Ran a continuous loop of devctl detach followed by devctl attach for ohci driver instances sharing the interrupt. All the code modifications are in the attachment. The devctl command line was: while true ; do devctl detach ohci3 && devctl attach pci0:0:19:1 ; devctl detach ohci4 && devctl attach pci0:0:20:5 ; done The rate of interrupts was about 570K per second: 569k ohci2 ohci The stack trace in kgdb: (kgdb) bt #0 __curthread () at ./machine/pcpu.h:231 #1 doadump (textdump=1) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:366 #2 0xffffffff80ba33e2 in kern_reboot (howto=260) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:446 #3 0xffffffff80ba39c3 in vpanic (fmt=<optimized out>, ap=0xfffffe00004346d0) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:863 #4 0xffffffff80ba3a13 in panic (fmt=<unavailable>) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:790 #5 0xffffffff8107c6ff in trap_fatal (frame=0xfffffe00004348b0, eva=18446744073709551615) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:892 #6 0xffffffff8107c772 in trap_pfault (frame=0xfffffe00004348b0, usermode=<optimized out>) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:728 #7 0xffffffff8107bd7a in trap (frame=0xfffffe00004348b0) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:427 #8 <signal handler called> #9 intr_event_handle (ie=0xfffff80003349300, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/kern/kern_intr.c:1180 #10 0xffffffff811f2118 in intr_execute_handlers (isrc=0xfffff800033845b0, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/intr_machdep.c:285 #11 0xffffffff811f841d in lapic_handle_intr (vector=49, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/local_apic.c:1270 #12 <signal handler called> #13 sched_idletd (dummy=<optimized out>) at /usr/devel/svn/head/sys/kern/sched_ule.c:2803 #14 0xffffffff80b62204 in fork_exit (callout=0xffffffff80bd36a0 <sched_idletd>, arg=0x0, frame=0xfffffe0000434c00) at /usr/devel/svn/head/sys/kern/kern_fork.c:1039
This is an old race. I have a very old p4 branch that dedicates a spin lock to protecting the list of handlers (and then uses that as the thread_lock of idle interrupt threads so that the number of spin locks taken for an interrupt remains the same) that fixes this.
Yes, it's an old one. I think that using a spinlock is perfectly fine. I also have a work-in-progress that tries to solve the problem with some atomic magic so that intr_event_handle is completely wait / spin free (if the interrupt thread does not need to run, of course). I guess that a difference between those two approaches would be negligible.
Just FYI for next time, if you're making code changes, fail_point(9) can make this kind of scenario much easier to hit. In particular, you can expand races by adding in an empty KFAIL_POINT_CODE() and then inducing a delay with sysctl my.failpoint="50%delay(1000)" (i.e., DELAY(1000) 50% of times we hit that failpoint; the 50% can help avoid excessive foot-shooting).
Is this list a good candidate for epoch(9)-based reclamation? :-)
I suspect you cannot use epoch(9) in primary interrupt context (e.g. filters) just as you can't use regular mutexes there.
But a similar approach might work.
Where does epoch(9) take a regular mtx? I may just be missing it, but I see only spin mutexes.
(In reply to Conrad Meyer from comment #7) I don't know epoch(9) / ck_epoch implementation details, but given the constraints[*] of interrupt handling, I think that it could be an overkill. I mean, it probably has some overhead that's avoidable. And for that latency sensitive code it could matter. [*] The interrupt handler list can be iterated only by a single thread / context at a time. Also, we reduce modifications to a single thread at a time using a mutex. Performance of adding and removing interrupt handlers is not critical.
(In reply to Andriy Gapon from comment #8) Yes, performance of add/remove doesn't matter. But performance of intr_event_handle() matters very much, and adding a contested spin lock to it seems like a bad idea. I suppose it only needs to be taken for shared interrupts, not MSI, so maybe the impact is not so bad.
To be clear, we are not talking about adding a spin mutex. We are talking about replacing the existing sched_lock mutex used to schedule the ithread anyway with a dedicated spin mutex that protects the list of handlers and can be used as the THREAD_LOCK when scheduling the ithread since ithreads are changed in this patch to set td_lock to the new mutex when they are idle. You can't get away from the mutex to call sched_add(), but this lets you change what lock is now used for that in essence.
Sure. It'd still be a single global spin lock, right?
No, it is kind of per-IRQ, not global.
Oh, that's much better.
I have created https://reviews.freebsd.org/D15905 It's a hybrid between the approach I mentioned in comment #2 and what John mentioned (comment #1 and others) and what the current code has.
A commit references this bug: Author: avg Date: Fri Aug 3 14:27:29 UTC 2018 New revision: 337255 URL: https://svnweb.freebsd.org/changeset/base/337255 Log: safer wait-free iteration of shared interrupt handlers The code that iterates a list of interrupt handlers for a (shared) interrupt, whether in the ISR context or in the context of an interrupt thread, does so in a lock-free fashion. Thus, the routines that modify the list need to take special steps to ensure that the iterating code has a consistent view of the list. Previously, those routines tried to play nice only with the code running in the ithread context. The iteration in the ISR context was left to a chance. After commit r336635 atomic operations and memory fences are used to ensure that ie_handlers list is always safe to navigate with respect to inserting and removal of list elements. There is still a question of when it is safe to actually free a removed element. The idea of this change is somewhat similar to the idea of the epoch based reclamation. There are some simplifications comparing to the general epoch based reclamation. All writers are serialized using a mutex, so we do not need to worry about concurrent modifications. Also, all read accesses from the open context are serialized too. So, we can get away just two epochs / phases. When a thread removes an element it switches the global phase from the current phase to the other and then drains the previous phase. Only after the draining the removed element gets actually freed. The code that iterates the list in the ISR context takes a snapshot of the global phase and then increments the use count of that phase before iterating the list. The use count (in the same phase) is decremented after the iteration. This should ensure that there should be no iteration over the removed element when its gets freed. This commit also simplifies the coordination with the interrupt thread context. Now we always schedule the interrupt thread when removing one of handlers for its interrupt. This makes the code both simpler and safer as the interrupt thread masks the interrupt thus ensuring that there is no interaction with the ISR context. P.S. This change matters only for shared interrupts and I realize that those are becoming a thing of the past (and quickly). I also understand that the problem that I am trying to solve is extremely rare. PR: 229106 Reviewed by: cem Discussed with: Samy Al Bahra MFC after: 5 weeks Differential Revision: https://reviews.freebsd.org/D15905 Changes: head/sys/kern/kern_intr.c head/sys/sys/interrupt.h