Bug 229106 - intr_event_handle is unsafe with respect to interrupt handler list
Summary: intr_event_handle is unsafe with respect to interrupt handler list
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: Andriy Gapon
Depends on:
Reported: 2018-06-18 11:36 UTC by Andriy Gapon
Modified: 2019-07-18 11:15 UTC (History)
4 users (show)

See Also:

code changes to greately increase likelihood of the race (3.34 KB, patch)
2018-06-18 11:36 UTC, Andriy Gapon
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andriy Gapon freebsd_committer 2018-06-18 11:36:38 UTC
Created attachment 194354 [details]
code changes to greately increase likelihood of the race

I must state upfront that I discovered the issue through code review and I had to make special arrangements to provoke the problem.
The core of the issue is that intr_event_handle iterates the list of handlers, ie_handlers, without any protection whatsoever.  Also, removal and installation of a filter-only handler does not make any attempt to synchronize with with intr_event_handle.

As such, it is possible (although very improbable) that intr_event_handle may iterate into an element just before it is removed and derefence its pointer to a next element after the former element is freed and the pointer is overwritten.

This problem is only for a shared interrupts. When an interrupt is not shared, then it should be disabled before its handler is torn down.

Here is a stack trace of the crash:
fault virtual address   = 0xffffffffffffffff
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80b64ff0
stack pointer           = 0x28:0xfffffe0000434970
frame pointer           = 0x28:0xfffffe00004349b0
code segment            = base rx0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 11 (idle: cpu2)
trap number             = 12
panic: page fault
cpuid = 2
time = 1529319165
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0000434630
vpanic() at vpanic+0x1a3/frame 0xfffffe0000434690
panic() at panic+0x43/frame 0xfffffe00004346f0
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe0000434740
trap_pfault() at trap_pfault+0x62/frame 0xfffffe0000434790
trap() at trap+0x2ba/frame 0xfffffe00004348a0
calltrap() at calltrap+0x8/frame 0xfffffe00004348a0
--- trap 0xc, rip = 0xffffffff80b64ff0, rsp = 0xfffffe0000434970, rbp = 0xfffffe00004349b0 ---
intr_event_handle() at intr_event_handle+0xa0/frame 0xfffffe00004349b0
intr_execute_handlers() at intr_execute_handlers+0x58/frame 0xfffffe00004349e0
lapic_handle_intr() at lapic_handle_intr+0x6d/frame 0xfffffe0000434a20
Xapic_isr1() at Xapic_isr1+0xd0/frame 0xfffffe0000434a20
--- interrupt, rip = 0xffffffff80bd3b49, rsp = 0xfffffe0000434af0, rbp = 0xfffffe0000434bb0 ---
sched_idletd() at sched_idletd+0x4a9/frame 0xfffffe0000434bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe0000434bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000434bf0

This is what I did to get the crash.
1. Found hardware with shared interrupts. Specifically, I had three USB OHCI controllers sharing PCI IRQ 18.
2. Modified the ohci driver, so that it installed a dummy filter instead of its ithread handler.  This made the driver non-functional, of course.
3. Modified IO-APIC code, so that it kept re-raising the interrupt thus increasing the chances of getting the race within a reasonable time frame.
4. Re-compiled kern_intr.c with QUEUE_MACRO_DEBUG_TRASH to make the race more probable by immediately corrupting a removed handler.
5. Triggered the interrupt storm for IRQ 18.
6. Ran a continuous loop of devctl detach followed by devctl attach for ohci driver instances sharing the interrupt.

All the code modifications are in the attachment.
The devctl command line was:
    while true ; do devctl detach ohci3 && devctl attach pci0:0:19:1 ; devctl detach ohci4 && devctl attach pci0:0:20:5 ; done

The rate of interrupts was about 570K per second:
    569k ohci2 ohci

The stack trace in kgdb:
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:231
#1  doadump (textdump=1) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:366
#2  0xffffffff80ba33e2 in kern_reboot (howto=260) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:446
#3  0xffffffff80ba39c3 in vpanic (fmt=<optimized out>, ap=0xfffffe00004346d0) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:863
#4  0xffffffff80ba3a13 in panic (fmt=<unavailable>) at /usr/devel/svn/head/sys/kern/kern_shutdown.c:790
#5  0xffffffff8107c6ff in trap_fatal (frame=0xfffffe00004348b0, eva=18446744073709551615) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:892
#6  0xffffffff8107c772 in trap_pfault (frame=0xfffffe00004348b0, usermode=<optimized out>) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:728
#7  0xffffffff8107bd7a in trap (frame=0xfffffe00004348b0) at /usr/devel/svn/head/sys/amd64/amd64/trap.c:427
#8  <signal handler called>
#9  intr_event_handle (ie=0xfffff80003349300, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/kern/kern_intr.c:1180
#10 0xffffffff811f2118 in intr_execute_handlers (isrc=0xfffff800033845b0, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/intr_machdep.c:285
#11 0xffffffff811f841d in lapic_handle_intr (vector=49, frame=0xfffffe0000434a30) at /usr/devel/svn/head/sys/x86/x86/local_apic.c:1270
#12 <signal handler called>
#13 sched_idletd (dummy=<optimized out>) at /usr/devel/svn/head/sys/kern/sched_ule.c:2803
#14 0xffffffff80b62204 in fork_exit (callout=0xffffffff80bd36a0 <sched_idletd>, arg=0x0, frame=0xfffffe0000434c00) at /usr/devel/svn/head/sys/kern/kern_fork.c:1039
Comment 1 John Baldwin freebsd_committer freebsd_triage 2018-06-18 14:35:42 UTC
This is an old race.  I have a very old p4 branch that dedicates a spin lock to protecting the list of handlers (and then uses that as the thread_lock of idle interrupt threads so that the number of spin locks taken for an interrupt remains the same) that fixes this.
Comment 2 Andriy Gapon freebsd_committer 2018-06-18 14:40:34 UTC
Yes, it's an old one.
I think that using a spinlock is perfectly fine.
I also have a work-in-progress that tries to solve the problem with some atomic magic so that intr_event_handle is completely wait / spin free (if the interrupt thread does not need to run, of course).  I guess that a difference between those two approaches would be negligible.
Comment 3 Conrad Meyer freebsd_committer 2018-06-18 15:37:00 UTC
Just FYI for next time, if you're making code changes, fail_point(9) can make this kind of scenario much easier to hit.  In particular, you can expand races by adding in an empty KFAIL_POINT_CODE() and then inducing a delay with sysctl my.failpoint="50%delay(1000)" (i.e., DELAY(1000) 50% of times we hit that failpoint; the 50% can help avoid excessive foot-shooting).
Comment 4 Conrad Meyer freebsd_committer 2018-06-18 15:38:37 UTC
Is this list a good candidate for epoch(9)-based reclamation? :-)
Comment 5 John Baldwin freebsd_committer freebsd_triage 2018-06-18 17:10:14 UTC
I suspect you cannot use epoch(9) in primary interrupt context (e.g. filters) just as you can't use regular mutexes there.
Comment 6 Andriy Gapon freebsd_committer 2018-06-18 17:12:01 UTC
But a similar approach might work.
Comment 7 Conrad Meyer freebsd_committer 2018-06-18 17:43:22 UTC
Where does epoch(9) take a regular mtx?  I may just be missing it, but I see only spin mutexes.
Comment 8 Andriy Gapon freebsd_committer 2018-06-18 19:46:12 UTC
(In reply to Conrad Meyer from comment #7)
I don't know epoch(9) / ck_epoch implementation details, but given the constraints[*] of interrupt handling, I think that it could be an overkill.  I mean, it probably has some overhead that's avoidable.  And for that latency sensitive code it could matter.

[*] The interrupt handler list can be iterated only by a single thread / context at a time.  Also, we reduce modifications to a single thread at a time using a mutex.  Performance of adding and removing interrupt handlers is not critical.
Comment 9 Conrad Meyer freebsd_committer 2018-06-18 20:11:45 UTC
(In reply to Andriy Gapon from comment #8)
Yes, performance of add/remove doesn't matter.  But performance of intr_event_handle() matters very much, and adding a contested spin lock to it seems like a bad idea.  I suppose it only needs to be taken for shared interrupts, not MSI, so maybe the impact is not so bad.
Comment 10 John Baldwin freebsd_committer freebsd_triage 2018-06-18 22:56:10 UTC
To be clear, we are not talking about adding a spin mutex.  We are talking about replacing the existing sched_lock mutex used to schedule the ithread anyway with a dedicated spin mutex that protects the list of handlers and can be used as the THREAD_LOCK when scheduling the ithread since ithreads are changed in this patch to set td_lock to the new mutex when they are idle.  You can't get away from the mutex to call sched_add(), but this lets you change what lock is now used for that in essence.
Comment 11 Conrad Meyer freebsd_committer 2018-06-18 23:20:12 UTC
Sure.  It'd still be a single global spin lock, right?
Comment 12 John Baldwin freebsd_committer freebsd_triage 2018-06-19 00:47:11 UTC
No, it is kind of per-IRQ, not global.
Comment 13 Conrad Meyer freebsd_committer 2018-06-19 01:48:10 UTC
Oh, that's much better.
Comment 14 Andriy Gapon freebsd_committer 2018-06-19 15:27:36 UTC
I have created https://reviews.freebsd.org/D15905
It's a hybrid between the approach I mentioned in comment #2 and what John mentioned (comment #1 and others) and what the current code has.
Comment 15 commit-hook freebsd_committer 2018-08-03 14:27:40 UTC
A commit references this bug:

Author: avg
Date: Fri Aug  3 14:27:29 UTC 2018
New revision: 337255
URL: https://svnweb.freebsd.org/changeset/base/337255

  safer wait-free iteration of shared interrupt handlers

  The code that iterates a list of interrupt handlers for a (shared)
  interrupt, whether in the ISR context or in the context of an interrupt
  thread, does so in a lock-free fashion.   Thus, the routines that modify
  the list need to take special steps to ensure that the iterating code
  has a consistent view of the list.  Previously, those routines tried to
  play nice only with the code running in the ithread context.  The
  iteration in the ISR context was left to a chance.

  After commit r336635 atomic operations and memory fences are used to
  ensure that ie_handlers list is always safe to navigate with respect to
  inserting and removal of list elements.

  There is still a question of when it is safe to actually free a removed

  The idea of this change is somewhat similar to the idea of the epoch
  based reclamation.  There are some simplifications comparing to the
  general epoch based reclamation.  All writers are serialized using a
  mutex, so we do not need to worry about concurrent modifications.  Also,
  all read accesses from the open context are serialized too.

  So, we can get away just two epochs / phases.  When a thread removes an
  element it switches the global phase from the current phase to the other
  and then drains the previous phase.  Only after the draining the removed
  element gets actually freed. The code that iterates the list in the ISR
  context takes a snapshot of the global phase and then increments the use
  count of that phase before iterating the list.  The use count (in the
  same phase) is decremented after the iteration.  This should ensure that
  there should be no iteration over the removed element when its gets

  This commit also simplifies the coordination with the interrupt thread
  context.  Now we always schedule the interrupt thread when removing one
  of handlers for its interrupt.  This makes the code both simpler and
  safer as the interrupt thread masks the interrupt thus ensuring that
  there is no interaction with the ISR context.

  P.S.  This change matters only for shared interrupts and I realize that
  those are becoming a thing of the past (and quickly).  I also understand
  that the problem that I am trying to solve is extremely rare.

  PR:		229106
  Reviewed by:	cem
  Discussed with:	Samy Al Bahra
  MFC after:	5 weeks
  Differential Revision: https://reviews.freebsd.org/D15905