Bug 200493 - Killing pid 11 (idle) wedges the disk IO
Summary: Killing pid 11 (idle) wedges the disk IO
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-05-28 14:25 UTC by Edward Tomasz Napierala
Modified: 2015-11-24 15:13 UTC (History)
2 users (show)

See Also:


Attachments
Do not bump idle thread priority (560 bytes, patch)
2015-05-29 09:58 UTC, Konstantin Belousov
no flags Details | Diff
Do not bump idle thread priority (883 bytes, patch)
2015-05-29 10:01 UTC, Konstantin Belousov
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-05-28 14:25:07 UTC
Killing pid 11, as root, like this:

# kill -KILL 11

... results in wedged io, until reboot - "diskinfo -t /dev/ada0" shows access times from 0.1 msec immediately go to 44 msec.  Pid 11 is "idle".
Comment 1 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-28 15:10:31 UTC
(In reply to Edward Tomasz Napierala from comment #0)
It is very _weird_ bug report from developer.  You did not invesigated what is the state of the idle (?) process after the kill, obviously pid value does not matter since kernel starts variable number of the kprocs during the boot (on my kernel idle has pid 10). And last and most important, kernel processes do not process signals: there is no place where cursig()/postsig() pair is called, since there is no return from kernel to user mode, and no ast handler called.

E.g. after I do kill -9 10 (pid 10 is my idle process), I see
sandy% sudo procstat -i 10
  PID COMM             SIG     FLAGS
  10 idle             KILL     P--
I.e. SIGKILL was put into the queue, but nothing processed it.  And I do not observe any weirdness in the system behaviour afterward.

If your system consumed the SIGKILL, there should be some code which called postsig() in the context of the idle threads.

FYI, the idle loop is sched_idletd(), private for the given scheduler implementation.
Comment 2 Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-05-28 17:49:01 UTC
The report is weird, because the bug is.  I have no idea what could be causing this, and how to approach debugging it.  More data points: the slowdown is there also for md(4).  You can make things go faster by holding "enter" at the terminal; basically try this:

kill -KILL 11
mdconfig -s1g
while :; do dd if=/dev/md0 of=/dev/null bs=1m count=256; done

and then see what happens when you hold down the enter key, and otherwise.
Comment 3 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-28 18:12:26 UTC
(In reply to Edward Tomasz Napierala from comment #2)
No, bug report is weird because it does not contain even an attempt to get an overview of the system state after it enters the un-understandable condition.  ps auxwwH, procstat, looking for the threads eating 100% cpus, stopped clocks, whatever.

As I said, I killed my idle process and did not see any change in the system behavior.

The 'enter' key ticking some processing might indicate problems with stopped event clocks or clock interrupts.  E.g. network interrupts coming in could have similar effect ?

Anyway, collect as much as possible of the overview information for the system before and after the killing, and see what changed.
Comment 4 Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-05-29 09:38:05 UTC
The reason I didn't provide more information is that I believed it to be trivially reproducible - and also because I can't tell much about it.  That is, everything looks normal: the system is almost idle, vmstat/iostat/top etc don't show anything weird...  It's just that "diskinfo -t" shows values orders of magnitude worse than usual.

Actually, I've noticed something: idle threads priority.  Before:

# procstat -t 11
  PID    TID COMM             TDNAME           CPU  PRI STATE   WCHAN    
   11 100003 idle             idle: cpu0        -1  255 run     -         
   11 100004 idle             idle: cpu1         1  255 run     -  

# kill -KILL 11
[root@brick:/home/trasz]# procstat -t 11
  PID    TID COMM             TDNAME           CPU  PRI STATE   WCHAN    
   11 100003 idle             idle: cpu0         0  120 run     -         
   11 100004 idle             idle: cpu1        -1  255 run     -      

Bumping the idle thread's priority would explain what's happening.  But why does it get bumped?
Comment 5 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-29 09:55:03 UTC
(In reply to Edward Tomasz Napierala from comment #4)
Great, you nailed it down.

Look at the tdsigwakeup(), which is called from the main code to send a signal to a thread tdsendsignal():

	/*
	 * Bring the priority of a thread up if we want it to get
	 * killed in this lifetime.
	 */
	if (action == SIG_DFL && (prop & SA_KILL) && td->td_priority > PUSER)
		sched_prio(td, PUSER);
Comment 6 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-29 09:58:08 UTC
Created attachment 157247 [details]
Do not bump idle thread priority
Comment 7 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-29 10:01:34 UTC
Created attachment 157248 [details]
Do not bump idle thread priority

Slight improvement, add a safety brake for the unreachable code.
Comment 8 Konstantin Belousov freebsd_committer freebsd_triage 2015-05-29 10:35:24 UTC
(In reply to Konstantin Belousov from comment #7)
And to express my opinion on the whole stuff.

The signal delivery to the kernel threads must be opt-in feature.  Kernel thread should explicitely declare the ability to handle signals directed to it.  E.g., nfsd threads check for signal as an indication of exit request.

Most threads do not handle signals at all, and queuing the signal to them causes odd side-effects.  Most innocent consequence is the memory leak due to queued ksiginfo, which is never deleted from the sigqueue.  Or, the priority bump for the idle thread, as you discovered.

Code to prevent even queuing signals to the kernel threads is trivial, but it requires careful examination of each call to kproc/kthread creation to decide should the signalling be allowed.  The patch I provided IMO is good stop-gap measure which fixes the immediate ugly case.
Comment 9 Ed Maste freebsd_committer freebsd_triage 2015-05-29 15:36:06 UTC
The patch looks good to me as an interim solution to avoid the bizarre behaviour. I think we should put a comment explaining it though (unless there's a plan to implement the opt-in signal delivery in the short term).
Comment 10 Edward Tomasz Napierala freebsd_committer freebsd_triage 2015-05-29 16:07:06 UTC
The patch fixes the problem, thanks!  I agree about the long-term strategy too.
Comment 11 commit-hook freebsd_committer freebsd_triage 2015-05-29 16:27:05 UTC
A commit references this bug:

Author: kib
Date: Fri May 29 16:26:08 UTC 2015
New revision: 283745
URL: https://svnweb.freebsd.org/changeset/base/283745

Log:
  When delivering a signal with default disposition to the thread,
  tdsigwakeup() increases the priority of the low-priority threads, to
  give them a chance to be terminated timely.  Also, kernel allows user
  to signal kernel processes.  The combined effect is that signalling
  idle process bump a priority of the selected delivery thread, which
  starts eating CPU.

  Check for the delivery thread be an idle thread and do not raise its
  priority then.

  The signal delivery to the kernel threads must be opt-in feature.
  Kernel thread should explicitely declare the ability to handle signals
  directed to it.  E.g., nfsd threads check for signal as an indication
  of exit request.

  Most threads do not handle signals at all, and queuing the signal to
  them causes odd side-effects.  Most innocent consequence is the memory
  leak due to queued ksiginfo, which is never deleted from the sigqueue.
  Code to prevent even queuing signals to the kernel threads is trivial,
  but it requires careful examination of each call to kproc/kthread
  creation to decide should the signalling be allowed.  The commit is a
  stop-gap measure which fixes the immediate case for now.

  PR:	200493
  Reported and tested by:	trasz
  Discussed with:	trasz, emaste
  Sponsored by:	The FreeBSD Foundation
  MFC after:	1 week

Changes:
  head/sys/kern/kern_sig.c