Bug 46116

Summary: EHOSTDOWN when ARP replies get drowned by data following ARP pre-requests
Product: Base System Reporter: Diomidis Spinellis <dds>
Component: kernAssignee: orion
Status: Closed FIXED    
Severity: Affects Only Me CC: kouvelas
Priority: Normal    
Version: 4.7-RELEASE   
Hardware: Any   
OS: Any   

Description Diomidis Spinellis 2002-12-09 08:40:03 UTC
The change introduced by PR-25517
http://www.freebsd.org/cgi/query-pr.cgi?pr=25517
has arp_rtrequest (netinet/if_ether.c) start making ARP queries 
arpt_maxtries * arpt_down seconds before the ARP entry expiry time
in an effort to avoid the ARP entry's expiry.  In such a case the ARP reply
can get drowned by the actual data the host will send following the ARP
request.  The original change has the la_asked counter get incremented
after each pre-request.  As a result, when no reply is received
after all pre-requests are made, the host (unfairly) gets marked as down. 

As an example, a client playing MP3 music from an NFS server will pause 
its playback for 20 seconds (host_down_time) every 20 minutes (max_age).

Fix: The problem can be fixed in three different ways:
1. Remove the ARP pre-request code
2. Hold the packet in an mbuf when performing a pre-request
3. Do not mark a host that does not reply to ARP pre-requests as down

Solutions 1 and 2 can affect performance, so the third solution was adopted.
An additional counter la_preask now counts queries before the ARP entry expiry
time.  This counter is not used for marking a host down; the original
counter la_asked is now only used to count retries for expired ARP entries
and mark a host down if needed.  The fix also corrects the data type and
comment used for la_asked.

*** if_ether.c	2002/12/08 14:31:40	1.1
--- if_ether.c	2002/12/08 16:11:26
***************


The following packets demonstrate the fixed behavior:

19:05:20.022221 arp who-has client (88:2:8:1:3:f6) tell server
19:05:20.023796 server.nfs > client.1622712534: reply ok 1472 read (frag 10734:1480@0+)
19:05:20.025042 server > client: (frag 10734:1480@1480+)
// [...] no ARP reply
19:05:24.844843 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
19:05:29.959532 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
19:05:35.072951 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
19:05:39.824958 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply

// ARP entry has expired; server now holds the packet and WAITS for a reply
19:05:43.988615 arp who-has client (88:2:8:1:3:f6) tell server
19:05:43.988966 arp reply client is-at 0:0:e8:32:81:e8

// Reply-received, communication resumes without interruption:
19:05:43.990042 server > client: (frag 10827:928@7400)
19:05:44.095164 client.1622712606 > server.nfs: 112 read [|nfs]
19:05:44.097995 server.nfs > client.1622712606: reply ok 1472 read (frag 10828:1480@0+)
19:05:44.099237 server > client: (frag 10828:1480@1480+)--1tYxvvFuBOQPwbx6IInMngCdnDafqg1Js31horQU2lUpXlOu
Content-Type: text/plain; name="file.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="file.diff"

*** 95,101 ****
  	LIST_ENTRY(llinfo_arp) la_le;
  	struct	rtentry *la_rt;
  	struct	mbuf *la_hold;		/* last packet until resolved/timeout */
! 	long	la_asked;		/* last time we QUERIED for this addr */
  #define la_timer la_rt->rt_rmx.rmx_expire /* deletion time in seconds */
  };
  
--- 95,102 ----
  	LIST_ENTRY(llinfo_arp) la_le;
  	struct	rtentry *la_rt;
  	struct	mbuf *la_hold;		/* last packet until resolved/timeout */
! 	int	la_preask;		/* #times we preQUERIED for this addr */
! 	int	la_asked;		/* #times we QUERIED for this addr */
  #define la_timer la_rt->rt_rmx.rmx_expire /* deletion time in seconds */
  };
  
***************
*** 438,450 ****
  		 * arpt_down interval.
  		 */
  		if ((rt->rt_expire != 0) &&
! 		    (time_second + (arp_maxtries - la->la_asked) * arpt_down >
  		     rt->rt_expire)) {
  			arprequest(ifp,
  				   &SIN(rt->rt_ifa->ifa_addr)->sin_addr,
  				   &SIN(dst)->sin_addr,
  				   IF_LLADDR(ifp));
! 			la->la_asked++;
  		} 
  
  		bcopy(LLADDR(sdl), desten, sdl->sdl_alen);
--- 439,451 ----
  		 * arpt_down interval.
  		 */
  		if ((rt->rt_expire != 0) &&
! 		    (time_second + (arp_maxtries - la->la_preask) * arpt_down >
  		     rt->rt_expire)) {
  			arprequest(ifp,
  				   &SIN(rt->rt_ifa->ifa_addr)->sin_addr,
  				   &SIN(dst)->sin_addr,
  				   IF_LLADDR(ifp));
! 			la->la_preask++;
  		} 
  
  		bcopy(LLADDR(sdl), desten, sdl->sdl_alen);
***************
*** 480,486 ****
  			else {
  				rt->rt_flags |= RTF_REJECT;
  				rt->rt_expire += arpt_down;
! 				la->la_asked = 0;
  			}
  
  		}
--- 481,487 ----
  			else {
  				rt->rt_flags |= RTF_REJECT;
  				rt->rt_expire += arpt_down;
! 				la->la_preask = la->la_asked = 0;
  			}
  
  		}
***************
*** 734,740 ****
  		if (rt->rt_expire)
  			rt->rt_expire = time_second + arpt_keep;
  		rt->rt_flags &= ~RTF_REJECT;
! 		la->la_asked = 0;
  		if (la->la_hold) {
  			(*ifp->if_output)(ifp, la->la_hold,
  				rt_key(rt), rt);
--- 735,741 ----
  		if (rt->rt_expire)
  			rt->rt_expire = time_second + arpt_keep;
  		rt->rt_flags &= ~RTF_REJECT;
! 		la->la_preask = la->la_asked = 0;
  		if (la->la_hold) {
  			(*ifp->if_output)(ifp, la->la_hold,
  				rt_key(rt), rt);
***************
*** 858,864 ****
  	if (rt->rt_refcnt > 0 && (sdl = SDL(rt->rt_gateway)) &&
  	    sdl->sdl_family == AF_LINK) {
  		sdl->sdl_alen = 0;
! 		la->la_asked = 0;
  		rt->rt_flags &= ~RTF_REJECT;
  		return;
  	}
--- 859,865 ----
  	if (rt->rt_refcnt > 0 && (sdl = SDL(rt->rt_gateway)) &&
  	    sdl->sdl_family == AF_LINK) {
  		sdl->sdl_alen = 0;
! 		la->la_preask = la->la_asked = 0;
  		rt->rt_flags &= ~RTF_REJECT;
  		return;
  	}
How-To-Repeat: To avoid waiting for 20 minutes and make the packets more observable set:

sysctl net.link.ether.inet.max_age=60
sysctl net.link.ether.inet.host_down_time=5

On a 10Mb/s ethernet have a 150MHz Pentium client running FreeBSD play
192Mb/s MP3 data from an NFS-mounted server file using mpg123.  Monitor
the packet exchange from a third host and install a kernel breakpoint in
in_arpinput to verify that the client does send ARP replies.

The problem will appear as follows:

16:05:02.953682 client.1622700147 > server.nfs: 112 read [|nfs]
16:05:02.955113 arp who-has client (88:2:8:1:3:f6) tell server
// Server drowns the client's reply with data it immediately sends
16:05:02.956720 server.nfs > client.1622700147: reply ok 1472 read (frag 54854:1480@0+)
16:05:02.957972 server > client: (frag 54854:1480@1480+)
// [...] no ARP reply
16:05:08.141564 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
16:05:12.891851 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
16:05:18.006188 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
16:05:23.121545 arp who-has client (88:2:8:1:3:f6) tell server
// [...] no ARP reply
// ARP entry expires and arpt_maxtries is reached
// server logs /kernel: nfsd send error 64

// client stops getting replies from the server:
16:05:26.964593 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:27.014233 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:27.104235 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:27.213220 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:27.274168 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:27.304173 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:27.494200 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:27.604169 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:27.864192 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:28.254128 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:28.594132 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:29.544090 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:30.044069 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:32.113932 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:32.933903 client.1622700219 > server.nfs: 112 read [|nfs]

// After host_down seconds the ARP request-reply succeeds,
// the serves sends the - now mbuf-held - packet, and communication resumes:
16:05:32.935318 arp who-has client (88:2:8:1:3:f6) tell server
16:05:32.935651 arp reply client is-at 0:0:e8:32:81:e8
16:05:32.936735 server > client: (frag 54997:928@7400)
16:05:37.243660 client.1622700218 > server.nfs: 96 access [|nfs]
16:05:37.244381 server.nfs > client.1622700218: reply ok 120 access c 60b870bc
16:05:37.245177 client.1622700220 > server.nfs: 96 access [|nfs]
16:05:37.245883 server.nfs > client.1622700220: reply ok 120 access c e523ff20
16:05:38.703607 client.1622700219 > server.nfs: 112 read [|nfs]
16:05:38.707055 server.nfs > client.1622700219: reply ok 1472 read (frag 55020:1480@0+)
16:05:38.708308 server > client: (frag 55020:1480@1480+)
16:05:38.709566 server > client: (frag 55020:1480@2960+)
16:05:38.710804 server > client: (frag 55020:1480@4440+)

Repeating the above procedure with a breakpoint in in_arpinput will 
prove that the client does indeed send a reply, bu will not
result in a problem, since the network is clear by the time you type
continue in the kernel debugger.
Comment 1 orion freebsd_committer freebsd_triage 2002-12-09 16:34:42 UTC
Responsible Changed
From-To: freebsd-bugs->orion

Assignment to responsible party.
Comment 2 orion freebsd_committer freebsd_triage 2002-12-13 00:27:54 UTC
State Changed
From-To: open->closed

Patch applied with minor edits.  Thanks!