158802 – amd(8) ICMP storm and unkillable process.

Bug 158802 - amd(8) ICMP storm and unkillable process.

Summary: amd(8) ICMP storm and unkillable process.

Status:	Closed Unable to Reproduce

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	Unspecified
Hardware:	Any Any

Importance:	Normal Affects Only Me
Assignee:	freebsd-fs (Nobody)

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-07-11 15:00 UTC by Martin Birgmeier
Modified:	2014-08-27 18:39 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
file.txt (36.52 KB, text/plain) 2011-07-11 15:00 UTC, Martin Birgmeier	no flags	Details
x2.txt (66.12 KB, text/plain) 2011-07-11 16:04 UTC, Martin Birgmeier	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Birgmeier 2011-07-11 15:00:20 UTC

From time to time, when using the automount daemon amd(8), both the daemon and another process will get stuck in a tight loop and become (nearly) unkillable. In the log, the following message is printed:

Jul 11 14:45:26 testhost kernel: Limiting icmp unreach response from 24424 to 200 packets/sec
Jul 11 14:45:27 testhost kernel: Limiting icmp unreach response from 24218 to 200 packets/sec
Jul 11 14:45:28 testhost kernel: Limiting icmp unreach response from 24343 to 200 packets/sec
Jul 11 14:45:29 testhost kernel: Limiting icmp unreach response from 18753 to 200 packets/sec
Jul 11 14:45:30 testhost kernel: Limiting icmp unreach response from 19173 to 200 packets/sec

I managed to get a tcpdump of lo0 during this condition which is included below (in lieu of the patch file). For this, I set sysctl net.inet.icmp.icmplim=20000, making the above kernel log messages go away. In the log, one would still from time to time see messages like the following:

Jul 11 14:48:25 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:48:29 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:48:29 testhost kernel: nfs server pid2251@atpcdvvc:/users: is alive again
Jul 11 14:48:43 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:48:43 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:48:43 testhost kernel: nfs server pid2251@atpcdvvc:/d/auto: not responding
Jul 11 14:48:57 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:49:01 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:49:14 testhost kernel: nfs server pid2251@atpcdvvc:/d/auto: not responding
Jul 11 14:49:15 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:49:15 testhost kernel: nfs server pid2251@atpcdvvc:/users: not responding
Jul 11 14:49:15 testhost kernel: nfs server pid2251@atpcdvvc:/users: is alive again

One more point: I managed to "/etc/rc.d/amd onestop" amd during one such excursion, which cleared the storm and brought down my KDE session but otherwise left the system running. After logging in again, the erroneous behavior started nearly immediately again.

(The following is copied from a mailing list mail to which I replied.)

On 07/06/11 13:50, Martin Birgmeier wrote:
> Hi Artem,
>
> I have exactly the same problem as you are describing below, also with quite
> a number of amd mounts.
>
> In addition to the scenario you describe, another way this happens here
> is when downloading a file via firefox to a directory currently open in
> dolphin (KDE file manager). This will almost surely trigger the symptoms
> you describe.
>
> I've had 7.4 running on the box before, now with 8.2 this has started to happen.
>
> Alas, I don't have a solution.
>
> We should probably file a PR, but I don't even know where to assign it to.
> Amd does not seem much maintained, it's probably using some old-style
> mounts (it never mounts anything via IPv6, for example).
>
> Regards,
>
> Martin
>
> > Hi,
> >
> > I wonder if someone else ran into this issue before and, maybe, have a solution.
> >
> > I've been running into a problem where access to filesystems mouted
> > with amd wedges processes in an unkillable state and produces ICMP
> > storm on loopback interface.I've managed to narrow down to NFS
> > reconnect, but that's when I ran out of ideas.
> >
> > Usually the problem happens when I abort a parallel build job in an
> > i386 jail on FreeBSD-8/amd64 (r223055). When the build job is killed
> > now and then I end up with one process consuming 100% of CPU time on
> > one of the cores. At the same time I get a lot of messages on the
> > console saying "Limiting icmp unreach response from 49837 to 200
> > packets/sec" and the loopback traffic goes way up.
> >
> > As far as I can tell here's what's happening:
> >
> > * My setup uses a lot of filesystems mounted by amd.
> > * amd itself pretends to be an NFS server running on the localhost and
> > serving requests for amd mounts.
> > * Now and then amd seems to change the ports it uses. Beats me why.
> > * the problem seems to happen when some process is about to access amd
> > mountpoint when amd instance disappears from the port it used to
> > listen on. In my case it does correlate with interrupted builds, but I
> > have no clue why.
> > * NFS client detects disconnect and tries to reconnect using the same
> > destination port.
> > * That generates ICMP response as port is unreachable and it reconnect
> > call returns almost immediatelly.
> > * We try to reconnect again, and again, and again....
> > * the process in this state is unkillable
> >
> > Here's what the stack of the 'stuck' process looks like in those rare
> > moments when it gets to sleep:
> > 18779 100511 collect2         -                mi_switch+0x176
> > turnstile_wait+0x1cb _mtx_lock_sleep+0xe1 sleepq_catch_signals+0x386
> > sleepq_timedwait_sig+0x19 _sleep+0x1b1 clnt_dg_call+0x7e6
> > clnt_reconnect_call+0x12e nfs_request+0x212 nfs_getattr+0x2e4
> > VOP_GETATTR_APV+0x44 nfs_bioread+0x42a VOP_READLINK_APV+0x4a
> > namei+0x4f9 kern_statat_vnhook+0x92 kern_statat+0x15
> > freebsd32_stat+0x2e syscallenter+0x23d
> >
> > * Usually some timeout expires in few minutes, the process dies, ICMP
> > storm stops and the system is usable again.
> > * On occasion the process is stuck forever and I have to reboot the box.
> >
> > I'm not sure who's to blame here.
> >
> > Is the automounter at fault for disappearing from the port it was
> > supposed to listen to?
> > If NFS guilty of trying blindly to reconnect on the same port and not
> > giving up sooner?
> > Should I flog the operator (ALA myself) for misconfiguring something
> > (what?) in amd or NFS?
> >
> > More importantly -- how do I fix it?
> > Any suggestions on fixing/debugging this issue?
> >
> > --Artem

Fix: Patch attached with submission follows:
How-To-Repeat: See above for a way to sometimes trigger the problem.

Comment 1 Martin Birgmeier 2011-07-11 16:04:13 UTC

I noticed that I cut the tcpdump output such that it does not include a 
single ICMP reply. Therefore, here another edit of this output.

There seem to be requests to amd at a rate of about once per 20 
microseconds. Assuming that amd listens on port "exp2" (1022 acc. to 
/etc/services), it seems to send replies back to the wrong port. This 
seems to be because for some reason, the NFS client (of amd) keeps 
sending away with ever changing new port numbers, and by the time amd 
replies it's already waiting for a reply at a completely different port 
number.

Regards,

Martin

p.s. Now, after a reboot, I have the following "lsof -p" for amd:

COMMAND  PID USER   FD   TYPE     DEVICE SIZE/OFF    NODE NAME
amd     2234 root  cwd   VDIR       0,89     1024       2 /
amd     2234 root  rtd   VDIR       0,89     1024       2 /
amd     2234 root  txt   VREG       0,91   162212 4182086 /usr/sbin/amd
amd     2234 root  txt   VREG       0,89   219900   11614 
/libexec/ld-elf.so.1
amd     2234 root  txt   VREG       0,91    28268 2062376 
/usr/lib/libwrap.so.6
amd     2234 root  txt   VREG       0,89  1151844   57937 /lib/libc.so.7
amd     2234 root    0u  VCHR       0,27      0t0      27 /dev/null
amd     2234 root    1u  VCHR       0,27      0t0      27 /dev/null
amd     2234 root    2u  VCHR       0,27      0t0      27 /dev/null
amd     2234 root    3u  unix 0xc7282000      0t0         ->0xc5cf4000
amd     2234 root    4u  IPv4 0xc5cf7570      0t0     UDP *:1023
amd     2234 root    5u  IPv4 0xc5cf7560      0t0     UDP *:exp2
amd     2234 root    6u  VBAD                             (revoked)
amd     2234 root    7u  IPv4 0xc60039e0      0t0     TCP *:882 (LISTEN)
amd     2234 root    8u  IPv4 0xc5cf7550      0t0     UDP *:sco-dtmgr
amd     2234 root    9u  IPv4 0xc5cf7530      0t0     UDP *:netconf-beep

So it seems to be listening on exp2 (= 1022) after all.

Comment 2 Mark Linimon freebsd_committer

2011-07-12 23:13:58 UTC

Responsible Changed
From-To: freebsd-bugs->freebsd-fs

assign.

Comment 3 Martin Birgmeier 2014-08-15 07:45:39 UTC

Please close this PR. It does not happen any more on 9.2 and 10.0.

-- Martin

Comment 4 John Baldwin freebsd_committer

2014-08-27 18:39:23 UTC

Submitter reports issue no longer occurs with 9.2 and 10.0.