| Summary: | Network defaults are absurdly low. | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Leo Bicknell <bicknell> |
| Component: | conf | Assignee: | freebsd-bugs (Nobody) <bugs> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 4.2-RELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
|
Description
Leo Bicknell
2001-07-10 22:10:00 UTC
> kern.ipc.maxsockbuf=16777216 > net.inet.tcp.sendspace=4194304 > net.inet.tcp.recvspace=4194304 > > I suspect the FreeBSD authors will want tobe more conservative to > allow lower memory machines to operate properly, so I suggest the > following system defaults: > > kern.ipc.maxsockbuf=1048576 > net.inet.tcp.sendspace=524288 > net.inet.tcp.recvspace=524288 This is potentially a "Very Bad Thing" (tm). These numbers set the default maximums of the socket buffers for every TCP socket on the system. With a maxsockbuf set that high, a malicious user could consume all system memory and most likely provoke a crash, by setting their sendspace to 16MB, connecting to a slow host, and dumping 16MB of data into the kernel. Even non-malicious uses can result in very bad behavior, for example any one using those defaults on a system which opens any number of sockets (for example a http server which could dump large files into the send buffer unnecessarily, or an irc server with many thousands of ports) could quickly find their machine crashing due to mbuf exaustion. Admins who tweak the global defaults for these settings do so at their own risk, and it is assumed they understand that they will only be able to run applications which use a very small number of high bandwidth TCP streams. A better solution for an application which will function under those conditions is to raise the buffers on a per-socket basis with setsockopt, hence why the default maxsockbuf is 256k (a reasonably high number which supports a fairly large amount of bandwidth, and this can be tuned higher if so desired). Your numbers are not an appropriate default for most users. > The following sysctl variables show FreeBSD defaults: > > kern.ipc.maxsockbuf: 262144 > net.inet.tcp.sendspace: 16384 > net.inet.tcp.recvspace: 16384 > > These are absurd. The tcp.sendspace/recvspace limit the window size > of a TCP connection, which in turn limit throughput. On a 50ms coast > to coast path, it imposes a limit of 16384 Byes * 1000ms/sec / 50ms = > 327 KBytes/sec. This is a third of what a 10Mbps/sec cable modem > should be able to deliver, say nothing of a 100Meg FE connected host > (eg server) at an ISP. Go further, 155ms to Japan from the east coast > of the US and you're down to under 100 KBytes/sec, all due to a poor > software limit. Actually you're working on the wrong cure of the right problem. TCP window size does limit thruput over high bandwidth high latency connections, and under the current implementation the TCP window which is advertised is limited by the size of the socket buffer, but the correct solution is not to increase the socket buffer, but instead to dynamically allocate them based on performance of the TCP session. The socket buffers are not chunks of memory which are allocated when the socket is created, they are a number which is used to check against the validity of appending a new mbuf with additional data. The advertised receive window is used as a form of host-memory congestion control, and is presently advertised based on the fixed number set for the socket buffer. This is stupid and limiting, since a connection could be limited to 16kb in flight when no congestion has been encountered and the system is capable of receiving larger amounts of data. This should be changed to be based off a global memory availability status, most likely a reserved amount for all network socket buffers. The demand from this global "pool" can be determined by the TCP sessions actual congestion window, improving the thruput performance of all sockets automatically without the potential danger of impacting the entire system negatively. This memory is actually only allocated during packet loss recovery, when you are holding data with gaps and waiting for retransmission, or holding data which has not been acknowledged and must be retransmitted. This makes it particularly tricky to reach optimium levels of performance while still being able to gracefully handle a network situation which introduces recovery on all network sessions simultaniously. Work is currently being done designing this system, which will result in significantly better performance and stability then just "making the numbers bigger". > tcp_extensions="YES" > > should be the default, as they are highly unlikey to break anything in > this day and age, and are necessary to use windows over 64K. An > interesting compromise would be to only set the settings I suggest if > tcp_extensions is turned on. This was just recently set as the default, in the hopes that the ancient terminal servers which broke when they received a TCP option they did not understand have all been retired. Other possible options for improving the versatility of this option (which Windows 2000 implements) are seperation of the two major components of RFC1323 (Timestamping and Window Scaling), and a "passive" mode which would respond to TCP SYNs which include these options, but would not initiate them on outbound connections. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6) On Tue, Jul 10, 2001 at 06:52:36PM -0400, Richard A. Steenbergen wrote: > This is potentially a "Very Bad Thing" (tm). These numbers set the default > maximums of the socket buffers for every TCP socket on the system. With a > maxsockbuf set that high, a malicious user could consume all system memory > and most likely provoke a crash, by setting their sendspace to 16MB, > connecting to a slow host, and dumping 16MB of data into the kernel. If filling the kernel buffers creates a crash, it will be easier to do with smaller values, and there are lots of other bug reports to be opened as to why the kernel can't allocate memory. > Even non-malicious uses can result in very bad behavior, for example any > one using those defaults on a system which opens any number of sockets > (for example a http server which could dump large files into the send > buffer unnecessarily, or an irc server with many thousands of ports) could > quickly find their machine crashing due to mbuf exaustion. This is incorrect. The system will only buffer the current window size in memory. In order for the window to get to the full size, several things must happen: 1) The sender must set higher values. 2) The receiver must set higher values. 3) Network conditions let the window scale that large. It will result, on average on more data being in the kernel, but that's because on most systems today users are being severly rate limited by these values. If the buffers actually get to full size, it's because both ends have tuned systems _and_ the network conditions permit. > Admins who tweak the global defaults for these settings do so at their own > risk, and it is assumed they understand that they will only be able to run > applications which use a very small number of high bandwidth TCP streams. > > A better solution for an application which will function under those > conditions is to raise the buffers on a per-socket basis with setsockopt, > hence why the default maxsockbuf is 256k (a reasonably high number which > supports a fairly large amount of bandwidth, and this can be tuned higher > if so desired). I suggest that hard coding large defaults into software, where the admins don't see them, and can't change them easily is a much worse problem. 256k is far from practial as well. It's easy to show cross country paths where a user ftp'ing from a cable modem is rate limited on say, ftp downlaods. That said, today no application uses 256k because they don't, in general, code larger sizes with setsockopt, so they are in fact rate limited to the 16k default, a pathetic number. Note, under "real world" conditions you need 2 * bandwidth * delay to get full bandwidth, which has been well documented by many research groups. Using that metrics, 16k doesn't even let you get full speed off a T1, or T1 speed DSL when going to a host across country. I wonder how many users out there think 80k/sec is a great transfer rate from a web server just because neither system is tuned to let them get the 170k/sec they should be able to get. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org Also, FWIW, there is a related bug that keeps Richard's suggestion from working, it would appear: http://www.FreeBSD.org/cgi/query-pr.cgi?pr=11966 So even if his idea is deemed the best solution, something else needs to be fixed to get network performance up to par. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org On Tue, 10 Jul 2001, Leo Bicknell wrote: > > Even non-malicious uses can result in very bad behavior, for example any > > one using those defaults on a system which opens any number of sockets > > (for example a http server which could dump large files into the send > > buffer unnecessarily, or an irc server with many thousands of ports) > > could quickly find their machine crashing due to mbuf exaustion. > > This is incorrect. The system will only buffer the current window > size in memory. No, the send buffer will fill to its maximium size if the user process continues to write to it (which it will because write events will be generated). This makes it very dangerous to blindly increase the socket buffers to the numbers necessary to get optimium performance. If your socket buffer is based on the TCP congestion window (the 2x is covered below), this problem would be solved. On the receive side, the amount of memory advertised is also based on the TCP congestion window. This is used to obtain a reading of how much this particular sessions "wants", and is compared against a fixed memory limit (which can be raised on high-memory systems to intelligently improve performance without crashing systems) to determine how much the socket "gets". > I suggest that hard coding large defaults into software, where the > admins don't see them, and can't change them easily is a much worse > problem. 256k is far from practial as well. It's easy to show cross > country paths where a user ftp'ing from a cable modem is rate limited > on say, ftp downlaods. That said, today no application uses 256k > because they don't, in general, code larger sizes with setsockopt, so > they are in fact rate limited to the 16k default, a pathetic number. I wouldn't suggest that either, it's a bad idea to hardcode anything like that into the application. Maybe it would make sense for the ftp client to be easily configurable to increase it's window size, but at any rate that is still a hack. When you change the settings globally there is no distinction between an ftp client and a web server. What will make your ftp client faster for your 5 ftp sessions on a fast 100Mbit link will make other people's web servers with hundreds of clients on dialups keel over at significantly lower loads. This is why the defaults are shipped the way they are. The best solution is to design intelligence into the system, so you are not taking guesses on your client type in order to best optimize performance. > Note, under "real world" conditions you need 2 * bandwidth * delay to > get full bandwidth, which has been well documented by many research > groups. Using that metrics, 16k doesn't even let you get full speed > off a T1, or T1 speed DSL when going to a host across country. I > wonder how many users out there think 80k/sec is a great transfer rate > from a web server just because neither system is tuned to let them get > the 170k/sec they should be able to get. Correct, this is to accommodate maintaining optimum window size through the process of loss recovery (which happens in real world systems). Under conditions with any amount of loss, and depending upon the type of TCP stack you are running, you will probably be better off using multiple TCP sessions to gain higher thruput, even with infinite window sizes. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6) On Tue, 10 Jul 2001, Leo Bicknell wrote: > Also, FWIW, there is a related bug that keeps Richard's suggestion > from working, it would appear: > > http://www.FreeBSD.org/cgi/query-pr.cgi?pr=11966 > > So even if his idea is deemed the best solution, something else needs > to be fixed to get network performance up to par. Note the date and version of this PR. May 31 1999, FreeBSD 2.2.6. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6) On Tue, Jul 10, 2001 at 09:21:05PM -0400, Richard A. Steenbergen wrote: > No, the send buffer will fill to its maximium size if the user process > continues to write to it (which it will because write events will be > generated). This makes it very dangerous to blindly increase the socket > buffers to the numbers necessary to get optimium performance. If your > socket buffer is based on the TCP congestion window (the 2x is covered > below), this problem would be solved. This doesn't hold true, at least by vmstat's output. I created a 30 meg file. And had 25 clients try to connect from a dial-up host with the modifications as well. With a 4 Meg max limit, one would expect 100 Meg of additional kernel memory usage. However, vmstat reported only minor kernel increases, and no signficant change in available memory. The system was fine, and a 26th client could connect from acrouss country and get 8 MBytes/sec on a single transfer. MBuff usage, per netstat -M went up by about 150 (for the 25 clients), which seems reasonable. Without spending hours looking at the source I can't say my explantion is correct, but the result you suggest doesn't happen either. I have infact used my (admittedly large) values on some busy web servers and noticed no significant memory usage increases, or other performance degredataion. There is no reason for the kernel to unblock a process until it's ready to send data. The kernel does not buffer data from a process to the kernel, it simply must save the outgoing cwin until there is an ack. I can see no way the kernel would let a process send data with a full cwin, or when there are no available network resources. > I wouldn't suggest that either, it's a bad idea to hardcode anything like > that into the application. Maybe it would make sense for the ftp client to > be easily configurable to increase it's window size, but at any rate > that is still a hack. Ick, it can't be per client. Users will never remember "hey, that's a good host, up the value", it will make it into an alias for users who care (with the maximum setting) instantly. > When you change the settings globally there is no distinction between an > ftp client and a web server. What will make your ftp client faster for > your 5 ftp sessions on a fast 100Mbit link will make other people's web > servers with hundreds of clients on dialups keel over at significantly > lower loads. This is why the defaults are shipped the way they are. The > best solution is to design intelligence into the system, so you are not > taking guesses on your client type in order to best optimize performance. Again, I'm not seeing this on moderately loaded web servers (5-10 connections/sec peak usage), in fact I'm seeing no major changes, and can now (where network and remote host allow) get much faster transfers. Note that according to http://www.psc.edu/networking/perf_tune.html FreeBSD has smaller defaults than Digital Unix, HP-UX, Linux, MacOS, Netware, and IRIX. At least moving to values at the high end of the current set of defaults would be a huge improvement. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org On Tue, Jul 10, 2001 at 09:22:01PM -0400, Richard A. Steenbergen wrote: > Note the date and version of this PR. May 31 1999, FreeBSD 2.2.6. Being that it's still open after a pr closing spree, I would suggest it's likely still broken, too. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org Ah ha, I believe the reason I had no MBUF problems is because I had
raised a number of other defaults, which may be a good idea anyway.
Currently the number of MBUF's is dependant on the number of users,
which really makes no sense as the max number needed can be calculated
by the socket size.
It also shows a minor misunderstanding in my original suggestion. So
I will revise it. Rather than:
kern.ipc.maxsockbuf=16777216
net.inet.tcp.sendspace=4194304
net.inet.tcp.recvspace=4194304
I now suggest _defaults_ of:
kern.ipc.maxsockbuf=8388608
net.inet.tcp.sendspace=419430
net.inet.tcp.recvspace=419430
There was no reason to let the max get that big. Doing calculations
on the normal defaults (as those who write programs to exceed them are
on their own) we get a worst case need of 419430 / 256 or 1638 MBUF's
per connection. Considering a "busy" server is unlikely to exceed
100 connections open simultaneously (and the few news servers and irc
servers that exceed that are special cases, very few web servers or
general purpose boxes get to that number) 163800 MBUF's maximum would
be enough. Of course, if they were all allocated for some reason that
default would allow about 40 Meg of memory to be allocated to the
network, but then I think it's reasonable to assume that a machine
servicing 100 simultaneous connections has 40 Meg around to spare,
in the worst cases.
The defaults now are based on:
#define NMBCLUSTERS (512 + MAXUSERS * 16)
TUNABLE_INT_DECL("kern.ipc.nmbclusters", NMBCLUSTERS, nmbclusters);
TUNABLE_INT_DECL("kern.ipc.nmbufs", NMBCLUSTERS * 4, nmbufs);
So you get 32768 MBUF's today with maxusers at 512. So I'm suggesting
that number needs to be about 5x higher to make everything work
right.
In reality, I'm betting about 3x-4x would be the right number for
real world situations.
So, MBUF's need to be evaluated as well. As this is quickly getting
more complicated, I'll suggest that this pr might need to split. I
still believe the defaults need to be higher to help the end users
installing the system who don't know how to tune this stuff. I
also think perhaps a new section of the handbook or something needs
to address these issues in more detail with all the related gook
that needs to be changed.
--
Leo Bicknell - bicknell@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org
People as long winded as us shouldn't talk in PRs. :P -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6) I think all the suggestions here are valid, however not for the default FreeBSD install. Your efforts would be much better spent writing a section of the handbook or providing a patch that either documents or makes it easy to set the values for higher performance via rc.conf. -- -Alfred Perlstein [alfred@freebsd.org] Ok, who wrote this damn function called '??'? And why do my programs keep crashing in it? On Tue, Jul 10, 2001 at 11:31:57PM -0500, Alfred Perlstein wrote: > I think all the suggestions here are valid, however not for > the default FreeBSD install. Your efforts would be much better > spent writing a section of the handbook or providing a patch > that either documents or makes it easy to set the values > for higher performance via rc.conf. My methods aside, I will respectfully disagree. Working for an ISP I explain to 2-3 customers per month why a user on a cable modem can't get full speed to a server on our network, and ever time turning up the tcp sizes fixes it. 16k is too small. I will accept that I'm way off base, and that maybe 32k is all that's needed. I think it is reasonable for an end user to expect a system on a DSL/Cable modem line can take full advantage of it, and that some of us "smart people" should be able to make that happen in some reasonable way. A handbook section is definately necessary though, as there are clearly a number of special cases (in particular IRC and News servers, but also web and ftp servers) that may need additional changes. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org This pr can be closed. There are other issues that must be solved before the network defaults can be raised. That work is happening elsewhere, and will be tracked in hackers for the time being. -- Leo Bicknell - bicknell@ufp.org Systems Engineer - Internetworking Engineer - CCIE 3440 Read TMBG List - tmbg-list-request@tmbg.org, www.tmbg.org State Changed From-To: open->closed Closed at originators request. |