Created attachment 194354 [details]
code changes to greately increase likelihood of the race
I must state upfront that I discovered the issue through code review and I had to make special arrangements to provoke the problem.
The core of the issue is that intr_event_handle iterates the list of handlers, ie_handlers, without any protection whatsoever. Also, removal and installation of a filter-only handler does not make any attempt to synchronize with with intr_event_handle.
As such, it is possible (although very improbable) that intr_event_handle may iterate into an element just before it is removed and derefence its pointer to a next element after the former element is freed and the pointer is overwritten.
This problem is only for a shared interrupts. When an interrupt is not shared, then it should be disabled before its handler is torn down.
Here is a stack trace of the crash:
fault virtual address = 0xffffffffffffffff
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80b64ff0
stack pointer = 0x28:0xfffffe0000434970
frame pointer = 0x28:0xfffffe00004349b0
code segment = base rx0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (idle: cpu2)
trap number = 12
panic: page fault
cpuid = 2
time = 1529319165
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000434630
vpanic() at vpanic+0x1a3/frame 0xfffffe0000434690
panic() at panic+0x43/frame 0xfffffe00004346f0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe0000434740
trap_pfault() at trap_pfault+0x62/frame 0xfffffe0000434790
trap() at trap+0x2ba/frame 0xfffffe00004348a0
calltrap() at calltrap+0x8/frame 0xfffffe00004348a0
--- trap 0xc, rip = 0xffffffff80b64ff0, rsp = 0xfffffe0000434970, rbp = 0xfffffe00004349b0 ---
intr_event_handle() at intr_event_handle+0xa0/frame 0xfffffe00004349b0
intr_execute_handlers() at intr_execute_handlers+0x58/frame 0xfffffe00004349e0
lapic_handle_intr() at lapic_handle_intr+0x6d/frame 0xfffffe0000434a20
Xapic_isr1() at Xapic_isr1+0xd0/frame 0xfffffe0000434a20
--- interrupt, rip = 0xffffffff80bd3b49, rsp = 0xfffffe0000434af0, rbp = 0xfffffe0000434bb0 ---
sched_idletd() at sched_idletd+0x4a9/frame 0xfffffe0000434bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe0000434bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000434bf0
This is what I did to get the crash.
1. Found hardware with shared interrupts. Specifically, I had three USB OHCI controllers sharing PCI IRQ 18.
2. Modified the ohci driver, so that it installed a dummy filter instead of its ithread handler. This made the driver non-functional, of course.
3. Modified IO-APIC code, so that it kept re-raising the interrupt thus increasing the chances of getting the race within a reasonable time frame.
4. Re-compiled kern_intr.c with QUEUE_MACRO_DEBUG_TRASH to make the race more probable by immediately corrupting a removed handler.
5. Triggered the interrupt storm for IRQ 18.
6. Ran a continuous loop of devctl detach followed by devctl attach for ohci driver instances sharing the interrupt.
All the code modifications are in the attachment.
The devctl command line was:
while true ; do devctl detach ohci3 && devctl attach pci0:0:19:1 ; devctl detach ohci4 && devctl attach pci0:0:20:5 ; done
The rate of interrupts was about 570K per second:
569k ohci2 ohci
The stack trace in kgdb:
#0 __curthread () at ./machine/pcpu.h:231
#1 doadump (textdump=1) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:366
#2 0xffffffff80ba33e2 in kern_reboot (howto=260) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:446
#3 0xffffffff80ba39c3 in vpanic (fmt=<optimized out>, ap=0xfffffe00004346d0) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:863
#4 0xffffffff80ba3a13 in panic (fmt=<unavailable>) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:790
#5 0xffffffff8107c6ff in trap_fatal (frame=0xfffffe00004348b0, eva=18446744073709551615) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:892
#6 0xffffffff8107c772 in trap_pfault (frame=0xfffffe00004348b0, usermode=<optimized out>) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:728
#7 0xffffffff8107bd7a in trap (frame=0xfffffe00004348b0) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:427
#8 <signal handler called>
#9 intr_event_handle (ie=0xfffff80003349300, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/kern/kern_intr.c:1180
#10 0xffffffff811f2118 in intr_execute_handlers (isrc=0xfffff800033845b0, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/intr_machdep.c:285
#11 0xffffffff811f841d in lapic_handle_intr (vector=49, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/local_apic.c:1270
#12 <signal handler called>
#13 sched_idletd (dummy=<optimized out>) at /usr/devel/svn/head/sys/kern/sched_ule.c:2803
#14 0xffffffff80b62204 in fork_exit (callout=0xffffffff80bd36a0 <sched_idletd>, arg=0x0, frame=0xfffffe0000434c00) at /usr/devel/svn/head/sys/kern/kern_fork.c:1039
This is an old race. I have a very old p4 branch that dedicates a spin lock to protecting the list of handlers (and then uses that as the thread_lock of idle interrupt threads so that the number of spin locks taken for an interrupt remains the same) that fixes this.
Yes, it's an old one.
I think that using a spinlock is perfectly fine.
I also have a work-in-progress that tries to solve the problem with some atomic magic so that intr_event_handle is completely wait / spin free (if the interrupt thread does not need to run, of course). I guess that a difference between those two approaches would be negligible.
Just FYI for next time, if you're making code changes, fail_point(9) can make this kind of scenario much easier to hit. In particular, you can expand races by adding in an empty KFAIL_POINT_CODE() and then inducing a delay with sysctl my.failpoint="50%delay(1000)" (i.e., DELAY(1000) 50% of times we hit that failpoint; the 50% can help avoid excessive foot-shooting).
Is this list a good candidate for epoch(9)-based reclamation? :-)
I suspect you cannot use epoch(9) in primary interrupt context (e.g. filters) just as you can't use regular mutexes there.
But a similar approach might work.
Where does epoch(9) take a regular mtx? I may just be missing it, but I see only spin mutexes.
(In reply to Conrad Meyer from comment #7)
I don't know epoch(9) / ck_epoch implementation details, but given the constraints[*] of interrupt handling, I think that it could be an overkill. I mean, it probably has some overhead that's avoidable. And for that latency sensitive code it could matter.
[*] The interrupt handler list can be iterated only by a single thread / context at a time. Also, we reduce modifications to a single thread at a time using a mutex. Performance of adding and removing interrupt handlers is not critical.
(In reply to Andriy Gapon from comment #8)
Yes, performance of add/remove doesn't matter. But performance of intr_event_handle() matters very much, and adding a contested spin lock to it seems like a bad idea. I suppose it only needs to be taken for shared interrupts, not MSI, so maybe the impact is not so bad.
To be clear, we are not talking about adding a spin mutex. We are talking about replacing the existing sched_lock mutex used to schedule the ithread anyway with a dedicated spin mutex that protects the list of handlers and can be used as the THREAD_LOCK when scheduling the ithread since ithreads are changed in this patch to set td_lock to the new mutex when they are idle. You can't get away from the mutex to call sched_add(), but this lets you change what lock is now used for that in essence.
Sure. It'd still be a single global spin lock, right?
No, it is kind of per-IRQ, not global.
Oh, that's much better.
I have created https://reviews.freebsd.org/D15905
It's a hybrid between the approach I mentioned in comment #2 and what John mentioned (comment #1 and others) and what the current code has.
A commit references this bug:
Date: Fri Aug 3 14:27:29 UTC 2018
New revision: 337255
safer wait-free iteration of shared interrupt handlers
The code that iterates a list of interrupt handlers for a (shared)
interrupt, whether in the ISR context or in the context of an interrupt
thread, does so in a lock-free fashion. Thus, the routines that modify
the list need to take special steps to ensure that the iterating code
has a consistent view of the list. Previously, those routines tried to
play nice only with the code running in the ithread context. The
iteration in the ISR context was left to a chance.
After commit r336635 atomic operations and memory fences are used to
ensure that ie_handlers list is always safe to navigate with respect to
inserting and removal of list elements.
There is still a question of when it is safe to actually free a removed
The idea of this change is somewhat similar to the idea of the epoch
based reclamation. There are some simplifications comparing to the
general epoch based reclamation. All writers are serialized using a
mutex, so we do not need to worry about concurrent modifications. Also,
all read accesses from the open context are serialized too.
So, we can get away just two epochs / phases. When a thread removes an
element it switches the global phase from the current phase to the other
and then drains the previous phase. Only after the draining the removed
element gets actually freed. The code that iterates the list in the ISR
context takes a snapshot of the global phase and then increments the use
count of that phase before iterating the list. The use count (in the
same phase) is decremented after the iteration. This should ensure that
there should be no iteration over the removed element when its gets
This commit also simplifies the coordination with the interrupt thread
context. Now we always schedule the interrupt thread when removing one
of handlers for its interrupt. This makes the code both simpler and
safer as the interrupt thread masks the interrupt thus ensuring that
there is no interaction with the ISR context.
P.S. This change matters only for shared interrupts and I realize that
those are becoming a thing of the past (and quickly). I also understand
that the problem that I am trying to solve is extremely rare.
Reviewed by: cem
Discussed with: Samy Al Bahra
MFC after: 5 weeks
Differential Revision: https://reviews.freebsd.org/D15905