Bug 75122 - [netinet] [patch] Incorrect inflight bandwidth calculation on first packet
Summary: [netinet] [patch] Incorrect inflight bandwidth calculation on first packet
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 5.3-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2004-12-15 18:30 UTC by Dan Nelson
Modified: 2015-11-14 02:20 UTC (History)
2 users (show)

See Also:


Attachments
file.diff (3.06 KB, patch)
2004-12-15 18:30 UTC, Dan Nelson
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Nelson 2004-12-15 18:30:31 UTC
Matt, I'm CC'ing you because it looks like the bug is also in
Dragonfly, so I think you'll want to commit something similar.

The inflight window scaling algorithm keeps a decaying average of a
socket's bandwidth, but on the first call to tcp_xmit_bandwidth_limit,
the sequence number of the previous packet is not known, so the
(ack_seq - tp->t_bw_rtseq) clause just gives you the raw sequence
number, resulting in a random value for the calculated bandwidth.

Until enough packets have traveled so the average has decayed to the
correct value, the calculated window is large enough that it's not even
used.  On a dialup, for example, it never gets a chance.

Fix: The minimal fix is to check for (tp->t_bw_rtttime == 0 || tp->t_bw_rtseq
== 0 || (int)bw < 0), and if it evalutes true, store the current values
and return.  I recommend committing the whole patch though.  It changed
the debugging output to print the instantaneous bw, and the current and
previous average bw.
How-To-Repeat: 	

Run net.inet.tcp.hostcache.list, and take a look at the BANDWIDTH
column.  Note that there is a secondary bug in this sysctl output that
multiplies the bandwidth by 8, even though the original value is
already in octets/sec.
Comment 1 Matthew Dillon 2004-12-16 03:33:43 UTC
   Hmm.  Well, looking at the original code it looks like t_bw_rtttime
   is properly initialized in tcp_newtcpcb(), so it will never be 0.
   t_bw_rtseq is initialized for outgoing connections, but not properly
   initialized for incoming connections.

   The original code was rather messy and really needs some reorganization,
   so I would recommend doing some reorganizing rather then just patching
   more mess on top.

   * Remove initialization of t_bw_rttime so it can be tested against 0
     in the bandwidth limiting code.

   * Auto-initialize t_bw_rtttime in tcp_xmit_bandwidth_limit by testing
     it against 0.

   * Add an idle check.

   * Assign an initial bandwidth of tcp_inflight_min so high speed
     connections do not have to ramp-up the exponential average all
     the way from 0.

   The patch below is against DragonFly and will not apply to FreeBSD, but
   is included to demonstrate my recommendation.

				-Matt
				Matthew Dillon 
				<dillon@backplane.com>
Comment 2 Dan Nelson 2004-12-16 19:06:08 UTC
Updated patch including Matt's recommended fix:

Index: tcp_hostcache.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_hostcache.c,v
retrieving revision 1.7
diff -u -p -r1.7 tcp_hostcache.c
--- tcp_hostcache.c	16 Aug 2004 18:32:07 -0000	1.7
+++ tcp_hostcache.c	15 Dec 2004 17:42:02 -0000
@@ -175,7 +175,7 @@ SYSCTL_INT(_net_inet_tcp_hostcache, OID_
      &tcp_hostcache.purgeall, 0, "Expire all entires on next purge run");
 
 SYSCTL_PROC(_net_inet_tcp_hostcache, OID_AUTO, list,
-	CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_SKIP, 0, 0,
+	CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
 	sysctl_tcp_hc_list, "A", "List of all hostcache entries");
 
 
@@ -676,7 +676,7 @@ sysctl_tcp_hc_list(SYSCTL_HANDLER_ARGS)
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
 			    msec(hc_entry->rmx_rttvar *
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
-			    hc_entry->rmx_bandwidth * 8,
+			    hc_entry->rmx_bandwidth,
 			    hc_entry->rmx_cwnd,
 			    hc_entry->rmx_sendpipe,
 			    hc_entry->rmx_recvpipe,
Index: tcp_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.201.2.4
diff -u -p -r1.201.2.4 tcp_subr.c
--- tcp_subr.c	1 Dec 2004 05:37:28 -0000	1.201.2.4
+++ tcp_subr.c	16 Dec 2004 06:38:48 -0000
@@ -631,7 +631,6 @@ tcp_newtcpcb(inp)
 	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->t_rcvtime = ticks;
-	tp->t_bw_rtttime = ticks;
 	/*
 	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
 	 * because the socket may be bound to an IPv6 wildcard address,
@@ -1907,9 +1906,10 @@ tcp_twrespond(struct tcptw *tw, int flag
 void
 tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
 {
-	u_long bw;
+	u_long bw, avgbw;
 	u_long bwnd;
 	int save_ticks;
+	int delta_ticks;
 
 	/*
 	 * If inflight_enable is disabled in the middle of a tcp connection,
@@ -1922,29 +1922,42 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
 	}
 
 	/*
+	  * Validate the delta time.  If a connection is new or has been idle
+	  * a long time we have to reset the bandwidth calculator.
+	  */
+	save_ticks = ticks;
+	delta_ticks = save_ticks - tp->t_bw_rtttime;
+	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
+		tp->t_bw_rtttime = ticks;
+		tp->t_bw_rtseq = ack_seq;
+		if (tp->snd_bandwidth == 0)
+			tp->snd_bandwidth = tcp_inflight_min;
+		return;
+	}
+	if (delta_ticks == 0)
+		return;
+	
+	/*
+	 * Sanity check, plus ignore pure window update acks.
+	 */
+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
+		return;
+	
+	/*
 	 * Figure out the bandwidth.  Due to the tick granularity this
 	 * is a very rough number and it MUST be averaged over a fairly
 	 * long period of time.  XXX we need to take into account a link
 	 * that is not using all available bandwidth, but for now our
 	 * slop will ramp us up if this case occurs and the bandwidth later
 	 * increases.
-	 *
-	 * Note: if ticks rollover 'bw' may wind up negative.  We must
-	 * effectively reset t_bw_rtttime for this case.
 	 */
-	save_ticks = ticks;
-	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
-		return;
 
-	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz /
-	    (save_ticks - tp->t_bw_rtttime);
+	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / delta_ticks;
+
 	tp->t_bw_rtttime = save_ticks;
 	tp->t_bw_rtseq = ack_seq;
-	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
-		return;
-	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
-	tp->snd_bandwidth = bw;
+	avgbw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
 	/*
 	 * Calculate the semi-static bandwidth delay product, plus two maximal
@@ -1975,22 +1988,25 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
 	 *	    no other choice.
 	 */
 #define USERTT	((tp->t_srtt + tp->t_rttbest) / 2)
-	bwnd = (int64_t)bw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
+	bwnd = (int64_t)avgbw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
 #undef USERTT
 
 	if (tcp_inflight_debug > 0) {
 		static int ltime;
 		if ((u_int)(ticks - ltime) >= hz / tcp_inflight_debug) {
 			ltime = ticks;
-			printf("%p bw %ld rttbest %d srtt %d bwnd %ld\n",
+			printf("%p bw %lu(%lu,%lu) rttbest %d srtt %d bwnd %ld\n",
 			    tp,
-			    bw,
+			    avgbw, tp->snd_bandwidth, bw,
 			    tp->t_rttbest,
 			    tp->t_srtt,
 			    bwnd
 			);
 		}
 	}
+	
+	tp->snd_bandwidth = avgbw;
+	
 	if ((long)bwnd < tcp_inflight_min)
 		bwnd = tcp_inflight_min;
 	if (bwnd > tcp_inflight_max)
Index: tcp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.107
diff -u -p -r1.107 tcp_usrreq.c
--- tcp_usrreq.c	16 Aug 2004 18:32:07 -0000	1.107
+++ tcp_usrreq.c	16 Dec 2004 06:38:59 -0000
@@ -864,7 +864,6 @@ tcp_connect(tp, nam, td)
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*
@@ -959,7 +958,6 @@ tcp6_connect(tp, nam, td)
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*


-- 
	Dan Nelson
	dnelson@allantgroup.com
Comment 3 gemini 2004-12-18 12:09:14 UTC
Dan Nelson wrote:
> Updated patch including Matt's recommended fix:
> 
> [...]
> diff -u -p -r1.201.2.4 tcp_subr.c
> --- tcp_subr.c	1 Dec 2004 05:37:28 -0000	1.201.2.4
> +++ tcp_subr.c	16 Dec 2004 06:38:48 -0000
> [...]
> @@ -1922,29 +1922,42 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
>  	}
>  
>  	/*
> +	  * Validate the delta time.  If a connection is new or has been idle
> +	  * a long time we have to reset the bandwidth calculator.
> +	  */
> +	save_ticks = ticks;
> +	delta_ticks = save_ticks - tp->t_bw_rtttime;
> +	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
> +		tp->t_bw_rtttime = ticks;
> +		tp->t_bw_rtseq = ack_seq;
> +		if (tp->snd_bandwidth == 0)
> +			tp->snd_bandwidth = tcp_inflight_min;
> +		return;
> +	}
> +	if (delta_ticks == 0)
> +		return;
> +	
> +	/*
> +	 * Sanity check, plus ignore pure window update acks.
> +	 */
> +	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
> +		return;

I wonder, isn't there a flaw in the logic with regard to the sequence 
number handling?  If the sequence number wraps around 't_bw_rtseq' no 
longer gets set and therefore the bandwidth calculation stops until 
'ack_seq' either catches up with 't_bw_rtseq' again (which would take 
quite a while), or 'ticks' wraps around as well, or there is inactivity 
for more than 10 seconds.  This is probably not the intended behavior.

I think the correct approach would be to separate the handling of the 
two cases "sequence number wrap-around" and "window update acks".  I 
suggest to define another variable 'delta_bytes' and initialize it like 
this:

   int delta_bytes;
   [...]
   delta_bytes = ack_seq - tp->t_bw_rtseq;

Then compare it with zero where you compare 'delta_ticks' with zero, and 
check for less than zero (wrap-around) where you check 'delta_ticks' for 
less than zero.  Combine the respective test results with '||'.  Of 
course, you can then use 'delta_bytes' for the bandwidth calculation 
further down in the code, too.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net
Comment 4 Dan Nelson 2004-12-21 06:55:31 UTC
In the last episode (Dec 18), Uwe Doering said:
> Dan Nelson wrote:
> >Updated patch including Matt's recommended fix:
> >
> >+	/*
> >+	 * Sanity check, plus ignore pure window update acks.
> >+	 */
> >+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
> >+		return;
> 
> I wonder, isn't there a flaw in the logic with regard to the sequence
> number handling?  If the sequence number wraps around 't_bw_rtseq' no
> longer gets set and therefore the bandwidth calculation stops until
> 'ack_seq' either catches up with 't_bw_rtseq' again (which would take
> quite a while), or 'ticks' wraps around as well, or there is
> inactivity for more than 10 seconds.  This is probably not the
> intended behavior.

I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
transparently when your variables are unsigned and your sequence space
covers all possible values.  It's the magic of mod(2^32) arithmetic :)
The (int) cast just makes the if simpler.  Without the cast it would
read

 if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)

The sanity check is probably not even necessary, as any really invalid
sequence numbers would have caused the packet to be dropped before it
got this far.


-- 
	Dan Nelson
	dnelson@allantgroup.com
Comment 5 gemini 2004-12-21 09:50:14 UTC
Dan Nelson wrote:
> In the last episode (Dec 18), Uwe Doering said:
>>Dan Nelson wrote:
>>
>>>Updated patch including Matt's recommended fix:
>>>
>>>+	/*
>>>+	 * Sanity check, plus ignore pure window update acks.
>>>+	 */
>>>+	if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
>>>+		return;
>>
>>I wonder, isn't there a flaw in the logic with regard to the sequence
>>number handling?  If the sequence number wraps around 't_bw_rtseq' no
>>longer gets set and therefore the bandwidth calculation stops until
>>'ack_seq' either catches up with 't_bw_rtseq' again (which would take
>>quite a while), or 'ticks' wraps around as well, or there is
>>inactivity for more than 10 seconds.  This is probably not the
>>intended behavior.
> 
> I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
> type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
> transparently when your variables are unsigned and your sequence space
> covers all possible values.  It's the magic of mod(2^32) arithmetic :)
> The (int) cast just makes the if simpler.  Without the cast it would
> read
> 
>  if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)
> 
> The sanity check is probably not even necessary, as any really invalid
> sequence numbers would have caused the packet to be dropped before it
> got this far.

You are correct.  Unlike 'ticks', which is of type 'int', a sequence 
number wrap-around has no effect on the calculation of the byte count 
due to using 'u_int32_t'.  Tricky. ;-)

Therefore, I propose testing for equality only (window update acks). 
Checking for byte counts of more than 2147483648U and declaring that 
bogus is a little arbitrary.  If we are concerned about invalid sequence 
numbers, they can result in a lower byte count as well.  The way it is 
now (with the patch installed) it looks too much like a wrap-around 
check and is therefore misleading.  So I suggest to either get rid of it 
or add a comment that explains the situation.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net
Comment 6 gemini 2004-12-21 11:20:28 UTC
Uwe Doering wrote:
> Dan Nelson wrote:
>> In the last episode (Dec 18), Uwe Doering said:
>>> Dan Nelson wrote:
>>>
>>>> Updated patch including Matt's recommended fix:
>>>>
>>>> +    /*
>>>> +     * Sanity check, plus ignore pure window update acks.
>>>> +     */
>>>> +    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
>>>> +        return;
>>>
>>> I wonder, isn't there a flaw in the logic with regard to the sequence
>>> number handling?  If the sequence number wraps around 't_bw_rtseq' no
>>> longer gets set and therefore the bandwidth calculation stops until
>>> 'ack_seq' either catches up with 't_bw_rtseq' again (which would take
>>> quite a while), or 'ticks' wraps around as well, or there is
>>> inactivity for more than 10 seconds.  This is probably not the
>>> intended behavior.
>>
>> I think the code works as-is.  ack_seq and tp->t_bw_rtseq are both of
>> type "tcp_seq" which is a u_int32_t.  Wrap-around is handled
>> transparently when your variables are unsigned and your sequence space
>> covers all possible values.  It's the magic of mod(2^32) arithmetic :)
>> The (int) cast just makes the if simpler.  Without the cast it would
>> read
>>
>>  if (ack_seq - tp->t_bw_rtseq > 2147483648U || ack_seq == tp->t_bw_rtseq)
>>
>> The sanity check is probably not even necessary, as any really invalid
>> sequence numbers would have caused the packet to be dropped before it
>> got this far.
> 
> You are correct.  Unlike 'ticks', which is of type 'int', a sequence 
> number wrap-around has no effect on the calculation of the byte count 
> due to using 'u_int32_t'.  Tricky. ;-)
> 
> Therefore, I propose testing for equality only (window update acks). 
> Checking for byte counts of more than 2147483648U and declaring that 
> bogus is a little arbitrary.  If we are concerned about invalid sequence 
> numbers, they can result in a lower byte count as well.  The way it is 
> now (with the patch installed) it looks too much like a wrap-around 
> check and is therefore misleading.  So I suggest to either get rid of it 
> or add a comment that explains the situation.

On second thought, checking for less than zero (by means of the 'int' 
cast) might have its merits.  This may be a protection against 
out-of-order ACKs, which could in fact be valid, but calculating a byte 
count from an ACK for an earlier packet when we've already processed a 
later ACK would be bogus.

Hard to tell what Matt's intentions were.  In the end the patch appears 
to be okay as it is.  Nevertheless, my suggestion to comment this 
sufficiently still stands.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net
Comment 7 Dan Nelson 2004-12-22 22:24:52 UTC
In the last episode (Dec 21), Uwe Doering said:
> Uwe Doering wrote:
> >Dan Nelson wrote:
> >>In the last episode (Dec 18), Uwe Doering said:
> >>>Dan Nelson wrote:
> >>>
> >>>>Updated patch including Matt's recommended fix:
> >>>>
> >>>>+    /*
> >>>>+     * Sanity check, plus ignore pure window update acks.
> >>>>+     */
> >>>>+    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
> >>>>+        return;
>
> On second thought, checking for less than zero (by means of the 'int' 
> cast) might have its merits.  This may be a protection against 
> out-of-order ACKs, which could in fact be valid, but calculating a byte 
> count from an ACK for an earlier packet when we've already processed a 
> later ACK would be bogus.

I am pretty sure only useful acks get to this point.  I've changed that
bit of code to just be

+       /*
+        * Ignore pure window update acks.
+        */
+       if (ack_seq == tp->t_bw_rtseq)
+               return;

and added another check for negative sequences that increments a sysctl
counter.  So far netstat -s has counted 567 out-of-order packets but my
counter is still at 0.

-- 
	Dan Nelson
	dnelson@allantgroup.com
Comment 8 Dan Nelson 2004-12-22 22:24:52 UTC
In the last episode (Dec 21), Uwe Doering said:
> Uwe Doering wrote:
> >Dan Nelson wrote:
> >>In the last episode (Dec 18), Uwe Doering said:
> >>>Dan Nelson wrote:
> >>>
> >>>>Updated patch including Matt's recommended fix:
> >>>>
> >>>>+    /*
> >>>>+     * Sanity check, plus ignore pure window update acks.
> >>>>+     */
> >>>>+    if ((int)(ack_seq - tp->t_bw_rtseq) <= 0)
> >>>>+        return;
>
> On second thought, checking for less than zero (by means of the 'int' 
> cast) might have its merits.  This may be a protection against 
> out-of-order ACKs, which could in fact be valid, but calculating a byte 
> count from an ACK for an earlier packet when we've already processed a 
> later ACK would be bogus.

I am pretty sure only useful acks get to this point.  I've changed that
bit of code to just be

+       /*
+        * Ignore pure window update acks.
+        */
+       if (ack_seq == tp->t_bw_rtseq)
+               return;

and added another check for negative sequences that increments a sysctl
counter.  So far netstat -s has counted 567 out-of-order packets but my
counter is still at 0.

-- 
	Dan Nelson
	dnelson@allantgroup.com
_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscribe@freebsd.org"
Comment 9 Dan Nelson 2005-02-08 07:15:22 UTC
In the last episode (Dec 22), Dan Nelson said:
> I am pretty sure only useful acks get to this point.  I've changed that
> bit of code to just be
> 
> +       /*
> +        * Ignore pure window update acks.
> +        */
> +       if (ack_seq == tp->t_bw_rtseq)
> +               return;
> 
> and added another check for negative sequences that increments a sysctl
> counter.  So far netstat -s has counted 567 out-of-order packets but my
> counter is still at 0.

And after 40 days of uptime on my mail/squid servers, the counter is
still at zero, so the code works perfectly for me.  It'd be nice if
this was committed before 5.4.  Final patch follows:

Index: tcp_hostcache.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_hostcache.c,v
retrieving revision 1.7.2.1
diff -u -p -r1.7.2.1 tcp_hostcache.c
--- tcp_hostcache.c	31 Jan 2005 23:26:36 -0000	1.7.2.1
+++ tcp_hostcache.c	8 Feb 2005 07:09:31 -0000
@@ -175,7 +175,7 @@ SYSCTL_INT(_net_inet_tcp_hostcache, OID_
      &tcp_hostcache.purgeall, 0, "Expire all entires on next purge run");
 
 SYSCTL_PROC(_net_inet_tcp_hostcache, OID_AUTO, list,
-	CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_SKIP, 0, 0,
+	CTLTYPE_STRING | CTLFLAG_RD, 0, 0,
 	sysctl_tcp_hc_list, "A", "List of all hostcache entries");
 
 
@@ -676,7 +676,7 @@ sysctl_tcp_hc_list(SYSCTL_HANDLER_ARGS)
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
 			    msec(hc_entry->rmx_rttvar *
 				(RTM_RTTUNIT / (hz * TCP_RTT_SCALE))),
-			    hc_entry->rmx_bandwidth * 8,
+			    hc_entry->rmx_bandwidth,
 			    hc_entry->rmx_cwnd,
 			    hc_entry->rmx_sendpipe,
 			    hc_entry->rmx_recvpipe,
Index: tcp_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_subr.c,v
retrieving revision 1.201.2.12
diff -u -p -r1.201.2.12 tcp_subr.c
--- tcp_subr.c	31 Jan 2005 23:26:36 -0000	1.201.2.12
+++ tcp_subr.c	8 Feb 2005 07:09:33 -0000
@@ -628,7 +632,6 @@ tcp_newtcpcb(inp)
 	tp->snd_bwnd = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->snd_ssthresh = TCP_MAXWIN << TCP_MAX_WINSHIFT;
 	tp->t_rcvtime = ticks;
-	tp->t_bw_rtttime = ticks;
 	/*
 	 * IPv4 TTL initialization is necessary for an IPv6 socket as well,
 	 * because the socket may be bound to an IPv6 wildcard address,
@@ -1934,9 +1937,10 @@ tcp_twrespond(struct tcptw *tw, int flag
 void
 tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq)
 {
-	u_long bw;
+	u_long bw, avgbw;
 	u_long bwnd;
 	int save_ticks;
+	int delta_ticks;
 
 	INP_LOCK_ASSERT(tp->t_inpcb);
 
@@ -1951,29 +1955,50 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
 	}
 
 	/*
+	  * Validate the delta time.  If a connection is new or has been idle
+	  * a long time we have to reset the bandwidth calculator.
+	  */
+	save_ticks = ticks;
+	delta_ticks = save_ticks - tp->t_bw_rtttime;
+	if (tp->t_bw_rtttime == 0 || delta_ticks < 0 || delta_ticks > hz * 10) {
+		tp->t_bw_rtttime = ticks;
+		tp->t_bw_rtseq = ack_seq;
+		if (tp->snd_bandwidth == 0)
+			tp->snd_bandwidth = tcp_inflight_min;
+		return;
+	}
+	if (delta_ticks == 0)
+		return;
+	
+	/*
+	 * Ignore pure window update acks.
+	 */
+	if (ack_seq == tp->t_bw_rtseq)
+		return;
+
+	/* Sanity check */
+	if (SEQ_LT(ack_seq, tp->t_bw_rtseq)) {
+		if (tcp_inflight_debug > 0) {
+			printf ("%p invalid sequence: %u vs %u",
+				tp, ack_seq, tp->t_bw_rtseq);
+		}
+	}
+	
+	/*
 	 * Figure out the bandwidth.  Due to the tick granularity this
 	 * is a very rough number and it MUST be averaged over a fairly
 	 * long period of time.  XXX we need to take into account a link
 	 * that is not using all available bandwidth, but for now our
 	 * slop will ramp us up if this case occurs and the bandwidth later
 	 * increases.
-	 *
-	 * Note: if ticks rollover 'bw' may wind up negative.  We must
-	 * effectively reset t_bw_rtttime for this case.
 	 */
-	save_ticks = ticks;
-	if ((u_int)(save_ticks - tp->t_bw_rtttime) < 1)
-		return;
 
-	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz /
-	    (save_ticks - tp->t_bw_rtttime);
+	bw = (int64_t)(ack_seq - tp->t_bw_rtseq) * hz / delta_ticks;
+
 	tp->t_bw_rtttime = save_ticks;
 	tp->t_bw_rtseq = ack_seq;
-	if (tp->t_bw_rtttime == 0 || (int)bw < 0)
-		return;
-	bw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
-	tp->snd_bandwidth = bw;
+	avgbw = ((int64_t)tp->snd_bandwidth * 15 + bw) >> 4;
 
 	/*
 	 * Calculate the semi-static bandwidth delay product, plus two maximal
@@ -2004,22 +2030,25 @@ tcp_xmit_bandwidth_limit(struct tcpcb *t
 	 *	    no other choice.
 	 */
 #define USERTT	((tp->t_srtt + tp->t_rttbest) / 2)
-	bwnd = (int64_t)bw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
+	bwnd = (int64_t)avgbw * USERTT / (hz << TCP_RTT_SHIFT) + tcp_inflight_stab * tp->t_maxseg / 10;
 #undef USERTT
 
 	if (tcp_inflight_debug > 0) {
 		static int ltime;
 		if ((u_int)(ticks - ltime) >= hz / tcp_inflight_debug) {
 			ltime = ticks;
-			printf("%p bw %ld rttbest %d srtt %d bwnd %ld\n",
+			printf("%p bw %lu(%lu,%lu) rttbest %d srtt %d bwnd %ld\n",
 			    tp,
-			    bw,
+			    avgbw, tp->snd_bandwidth, bw,
 			    tp->t_rttbest,
 			    tp->t_srtt,
 			    bwnd
 			);
 		}
 	}
+	
+	tp->snd_bandwidth = avgbw;
+	
 	if ((long)bwnd < tcp_inflight_min)
 		bwnd = tcp_inflight_min;
 	if (bwnd > tcp_inflight_max)
Index: tcp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.107.2.1
diff -u -p -r1.107.2.1 tcp_usrreq.c
--- tcp_usrreq.c	31 Jan 2005 23:26:37 -0000	1.107.2.1
+++ tcp_usrreq.c	8 Feb 2005 07:09:34 -0000
@@ -864,7 +864,6 @@ tcp_connect(tp, nam, td)
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*
@@ -959,7 +958,6 @@ tcp6_connect(tp, nam, td)
 	tp->t_state = TCPS_SYN_SENT;
 	callout_reset(tp->tt_keep, tcp_keepinit, tcp_timer_keep, tp);
 	tp->iss = tcp_new_isn(tp);
-	tp->t_bw_rtseq = tp->iss;
 	tcp_sendseqinit(tp);
 
 	/*


-- 
	Dan Nelson
	dnelson@allantgroup.com
Comment 10 Robert Watson freebsd_committer freebsd_triage 2005-07-12 15:54:21 UTC
Responsible Changed
From-To: freebsd-bugs->silby

Assign ownership to silby since he's been spending quality time with TCP 
lately.
Comment 11 Andre Oppermann freebsd_committer freebsd_triage 2006-01-24 16:39:33 UTC
Responsible Changed
From-To: silby->andre

take over.
Comment 12 Nick Johnson 2007-04-04 01:22:27 UTC
I wonder if this could be related to why my 6.2-STABLE machine reports 
completely unreasonable numbers for BANDWIDTH, which often exceed the 
speed of the uplink on my machine, sometimes by an order of magnitude.

It looks like the patch still will apply cleanly to 6.2.  Once I can get a 
console on my box, I'll give it a spin.
Comment 13 K. Macy freebsd_committer freebsd_triage 2007-11-15 23:19:00 UTC
Responsible Changed
From-To: andre->silby


silby is currently the most active... 
silby - if you don't want to deal with it (or others) assign it  
to me rather than let it languish
Comment 14 Mark Linimon freebsd_committer freebsd_triage 2015-11-12 01:28:33 UTC
Reassign to the wild with permission of assignee.

To submitter: is this report still valid?
Comment 15 Dan Nelson 2015-11-12 03:03:04 UTC
It looks like the inflight bandwidth limiter and all the code calculating the bandwidth was removed back in 2010 ( base r212765 ), so there's no need for this bugreport any more.  I've closed it.
Comment 16 Hiren Panchasara freebsd_committer freebsd_triage 2015-11-12 09:56:27 UTC
As far as I can tell, a lot (almost all) of the code around the discussion has been change/deleted/reworked. 

Only weird thing I noticed was that "BANDWIDTH" column is pretty useless now. I think when r212765 removed snd_bwnd and snd_bandwidth, hc_entry->rmx_bandwidth should also have been removed but I guess it was kept for backwards compat was set to 0. Which is still true. I've tried a couple head and couple stable-10 boxes and bandwidth reported in sysctl net.inet.tcp.hostcache.list is always 0.

I am wondering if it's time for us to just delete that BANDWIDTH column.
Comment 17 Hiren Panchasara freebsd_committer freebsd_triage 2015-11-12 10:04:07 UTC
(In reply to Dan Nelson from comment #15)
Grr. totally didn't see this comment before my comment #16. Would have saved some archeology that I had to do to dig out the rev number :-)

Anyways. should we now just remove the BANDWIDTH column?
Comment 18 Dan Nelson 2015-11-12 15:53:44 UTC
I doubt anyone is trying to parse the output of net.inet.tcp.hostcache.list, so it could be removed.  At some point the spare fields in struct tcpcb could be removed, too.  There are still a lot of padding bytes left at the end of the struct, though, so there's no real hurry to reclaim space which would potentially break the ABI for modules using struct tcpcb.
Comment 19 Kubilay Kocak freebsd_committer freebsd_triage 2015-11-12 15:58:19 UTC
There's at least one "well known" script out there that does, and was used to optimize certain kernel parameters based on average RTT, etc values that came from hostcachelist. It doesn't appear to use the (unreliable?) BANDWIDTH column though:

http://spatula.net/blog/2007/04/freebsd-network-performance-tuning.html
Comment 20 Hiren Panchasara freebsd_committer freebsd_triage 2015-11-13 19:16:16 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204528 to take care of it.