Change window size to a large value (for excample 128KB). Expect to get 40-80 Mb/s. Instead FreeBSD yields about 1 MB/s. BSDI and other BSD or *ix flavors yields 40-80 Mb/s as expected. A bit part of the problem is the function tcp_mss in tcp_input.c which sets the window size back to a small value. Fix: The email message below sums it up. I never did look into the reason why increasing MCLSHIFT would result in an unusable kernel so I'm sending this in as one bug report. If I get a chance I'll look at the MCLSHIFT problem and also try to figure out why setting NMBCLUSTERS to 2048 or above was a problem. You can do whatever you want with these patches. This is simply a performance issue. Subject: FreeBSD performance problem solved Date: Thu, 19 Feb 1998 22:23:07 -0500 From: Curtis Villamizar <curtis@brookfield.ans.net> The FreeBSD performance problem we had run into previously has now been solved. It may not be the best way for the general FreeBSD audience but it is completely solved for our puposes. The executive summary is: - the kernel no longer resets the window size back to a small value for no apparent reason (see below) - we now can use just under a 1MB window (about the same as BSDI) - some kernel tuning (page buffer size, number of clusters) was done to make FDDI MTU work slightly faster - we get 20 Mb/s with 192 KB window and 70 msec RTT - we get 77 Mb/s with 896 KB window and 70 msec RTT (6.7 sec transfer) - we get 88 Mb/s with 896 KB window and 70 msec RTT (47 sec transfer) - we get 89 Mb/s with 896 KB window and 70 msec RTT (184 sec transfer) - these are slightly better than the BSDI figures (I think? Bill?) The 2GB transfer in just over 3 minutes is getting quite close to FDDI line rate. The gory details are listed below. I'll be sending separate bug reports to the FreeBSD team on the tcp_mss issue and the inability to change MCLSHIFT or increase NMBCLUSTERS to 2048. Curtis All the kernel stuff is in /sys which is really a symbolic link to /usr/src/sys. Some of the key directories are netinet where all the ip, udp, and tcp code is, kern where all the socket code is, vm where the virtual memory code is, and sys where system header files are. The main culprit was the function tcp_mss in tcp_input.c. This function is called when a TCP SYN or SYN ACK arrives. Its purpose in life is to adjust the initial MSS and when doing so also adjust the buffer size if appropriate. One of the new "features" of tcp_mss is that it now looks up the route that would be used for the socket return path and unconditionally reset the send and recv buffer size if there is a sendspace or recvspace parameter on the route even if the buffer sizes had been set by a setsockopt. When I found this in the code my first reaction was to not touch the source and just explicitly set the sendspace or recvspace on the route to 10/8. This effort was foiled by the fact that tcp_mss seems to have picked up the wrong route. I then decided to get rid of the problem for good and just change the code so it will only increase the buffer sizes according to the route, but never decrease them. The patch is: Another change is the change to SB_MAX (which can also be changed with sysctl). The change to the page size makes a full MTU packet fit within a page and allows the kernel code to do less copying. options NMBCLUSTERS=1024 We could increase this to something over 1024. At the POC lab it would take 2048. This is sort of odd since that would have only been 4 MB dedicated to clusters on a 64 MB machine. This could be a magical power of two boundary for some other reason that I wasn't able to locate in the source code. I was never successful in increasing the cluster size from 2048 to 8192 (increase MCLSHIFT from 11 to 13). Again, there are dependencies on the relative size of some things in the kernel that aren't documented (and might be regarded as bugs). Increasing NMBCLUSTERS to 2048 or more or increasing MCLSHIFT from 11 to 13 will have to be exercises for a later date. These are tuning beyond what we really need. Fooling with these latter optimization gave us unusable kernels in the POC lab so I didn't want to play with this unless I was within walking distance of the reset button and had a console and keyboard.--JMBO0fiGssEWvmPHzz5VzwmfcAk77Q16U3WQlCFMav5t0EHV Content-Type: text/plain; name="file.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="file.diff" *** tcp_input.c.orig Thu Feb 19 21:56:49 1998 --- tcp_input.c Thu Feb 19 21:56:14 1998 *************** *** 2075,2080 **** --- 2075,2082 ---- if ((bufsize = rt->rt_rmx.rmx_sendpipe) == 0) #endif bufsize = so->so_snd.sb_hiwat; + if (bufsize < so->so_snd.sb_hiwat) + bufsize = so->so_snd.sb_hiwat; if (bufsize < mss) mss = bufsize; else { *************** *** 2089,2094 **** --- 2091,2098 ---- if ((bufsize = rt->rt_rmx.rmx_recvpipe) == 0) #endif bufsize = so->so_rcv.sb_hiwat; + if (bufsize < so->so_rcv.sb_hiwat) + bufsize = so->so_rcv.sb_hiwat; if (bufsize > mss) { bufsize = roundup(bufsize, mss); if (bufsize > sb_max) How-To-Repeat: Run ttcp or netperf over long delay path with 128KB window. Source for ttcp is freely available (source on request if you don't have it).
State Changed From-To: open->closed Jlemon will judge this one.