|Summary:||setsockopt() handling of kern.ipc.maxsockbuf limit|
|Product:||Base System||Reporter:||Cameron Sparr <cameronsparr>|
|Component:||kern||Assignee:||freebsd-net mailing list <net>|
|Severity:||Affects Many People||CC:||alfred, allanjude, cameronsparr, driesm.michiels, glebius, koobs, ngie, rwatson|
Description Cameron Sparr 2015-11-10 18:17:53 UTC
Hello, I'd like to put out a proposal for changing the way that the setsockopt() function handles the max socket buffer setting. Currently, it's a bit unintuitive that socket send and receive buffers (SO_RCVBUF & SO_SNDBUF) can not actually be set to the kern.ipc.maxsockbuf value, even though it is described as being possible in the man page: https://www.freebsd.org/cgi/man.cgi?query=setsockopt&sektion=2 This is because setsockopt() will error out if the value passed is over the _adjusted_ maximum, which on my system (amd64) turns out to be something like kern.ipc.maxsockbuf * 0.889 (kern.ipc.maxsockbuf * (1 << 11) / (256 + (1 << 11)) = kern.ipc.maxsockbuf * (2048) / (2304)). see here: https://github.com/freebsd/freebsd/blob/master/sys/kern/uipc_sockbuf.c#L420 and here: https://github.com/freebsd/freebsd/blob/master/sys/kern/uipc_sockbuf.c#L63-L64 I believe that the behavior could be made more intuitive by checking if the value passed is under the _actual_ max before erroring out, and if it is under the actual max, then set it to the adjusted max and continue the function, this would be a simple change and would look something like this (apologies for the whitespace diffs): https://github.com/sparrc/freebsd/commit/157f90c55d1d54d33f41c6f7517de1a9c5f5e229 FWIW, the linux kernel takes it a step further by never failing if setting the buffer over the maximum. It just takes whatever value it's given and sets it to min(given_value, max_value). see here: https://github.com/torvalds/linux/blob/master/net/core/sock.c#L762-L768. Not saying this is a good option, just pointing out that there is a precedent for doing something like this. I would be happy to make this change and submit a patch to phabricator if approved. Thanks all, Cameron
Comment 1 Allan Jude 2015-11-10 21:41:47 UTC
I have run into this, but never understood why it happened, just worked around it. in openssh-portable with HPN enabled, I tried to use '-oTcpRcvBuf=4M' to increase the socket buffer size to receive large ZFS replication streams more quickly. Of course, with kern.ipc.maxsockbuf set to 4M, this fails, I have to use 3.5M or something. This patch would seem to solve that. Thank you. The biggest question is, why can't I actually use a sockbuf for 4M?
Comment 2 Cameron Sparr 2015-11-10 21:55:09 UTC
I don't quite understand the purpose of the adjusted max either, but changing that could have much larger ramifications.
Comment 3 Enji Cooper 2015-11-10 22:47:08 UTC
Reassigning to -net@
Comment 4 Hiren Panchasara 2015-11-10 23:54:42 UTC
I am not sure about the linux way but the first part looks okay to me. I am no expert here so adding Robert and Gleb for their inputs.
Comment 5 Kubilay Kocak 2015-11-11 11:44:39 UTC
@Cameron, if you could get a changeset up on reviews. that would be great. Feel free to replace the URL field value in this issue with the review URL once it's up.
Comment 6 Cameron Sparr 2015-11-11 20:38:39 UTC
Created attachment 163023 [details] Patch of the diff in the URL section. Changes behavior of setsockopt This change makes it so that the setsockopt() function does not error when a user passes a value for SO_RCVBUF equal or under the maximim (kern.ipc.maxsockbuf). Currently the behavior is to error if the value is greater than the _adjusted_ max, which on amd64 turns out to be something like kern.ipc.maxsockbuf * 0.889 This is confusing for users, and this change will set the value passed to the adjusted max if the value is greater than the adjusted and less than the actual max.
Comment 7 Cameron Sparr 2015-11-11 20:41:17 UTC
@Kubilay I realized that I actually have no idea how to use svn/phabricator, so I'm just going to attach a patch to this case. Let me know how it looks, thank you.
Comment 8 Allan Jude 2015-11-11 21:29:10 UTC
There is a command line tool for Phabricator called 'arcanist' (devel/php5-arcanist) that simplifies posting patches to PHabriactor. It works with svn and git. Or, via the Phabricator web interface, you can post a .patch file (helps if you set it to 999999 lines of context)
Comment 9 Enji Cooper 2015-11-11 21:34:20 UTC
This wiki page describes how our current code review process works: https://wiki.freebsd.org/CodeReview
Comment 10 Cameron Sparr 2015-11-11 22:27:14 UTC
Got it, thanks for the link, I put up a review on phabricator. May I suggest you put a link to the wiki (https://wiki.freebsd.org/CodeReview) on the homepage of the phabricator site? Looks like you can edit it using the "dashboard" app but I don't have permissions, obviously: https://reviews.freebsd.org/dashboard/manage/2/
Comment 11 Kubilay Kocak 2015-11-12 06:37:49 UTC
That's perfectly fine Cameron, thank you Once a committer 'takes' /assigns this issue, they can put it up on reviews if necessary.
Comment 12 Kubilay Kocak 2015-11-12 06:40:30 UTC
Looks like I missed the review creation, ignore my last comment, except for the 'needs a committer/assignee' bit ;D
Comment 13 Alfred Perlstein 2015-11-12 18:20:08 UTC
I believe the reason for the adjustment is that "sb_max" is for "max kernel memory taken by the socketbuffer INCLUDING MBUFS THEMSELVES" So what is actually happening is that each MCLSIZE (size of cluster) of "data" is having MSIZE (size of mbuf) added to it. Why is it scaled up so? Because in reality one needs MSIZE actual memory for each cluster as metadata to point to it. And why is this done? So that you can actually trust "sb_max" to mean maximum kernel memory taken to support N bytes per socket. I didn't realize Linux silents truncates the requested amount, that's a little scary, however I'm learning to trust more and more what Linux does. What might make a bit more sense in the long run is actually to make: 1) kern.ipc.maxsockbuf == max number of bytes of DATA in each socketbuffer 2) kern.ipc.maxsockbufmeta = max number of bytes of DATA + METADATA required to be allocated.
Comment 14 Cameron Sparr 2015-11-12 19:08:51 UTC
@Alfred, thanks for the background info Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes. Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?' Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all
Comment 15 Alfred Perlstein 2015-11-12 19:33:04 UTC
> Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes. I think so. However just to be clear, let's not have a sysctl that has the following behavior: % sysctl kern.ipc.maxsockbuf=1000000 % sysctl kern.ipc.maxsockbuf kern.ipc.maxsockbuf=1125000 $ We probably want this: % sysctl kern.ipc.maxsockbuf=1000000 % sysctl kern.ipc.maxsockbuf kern.ipc.maxsockbuf=1000000 $ sysctl kern.ipc.maxsockbufmeta kern.ipc.maxsockbufmeta=1125000 > Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?' I think having them auto-set each other would be the most user friendly option. > Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all Yes. It would be a good idea to hop on a Linux kernel IRC channel/mailing list and ask them. Last time I did this I got some pretty useful information which changed my world view.
Comment 16 Cameron Sparr 2015-11-12 19:54:38 UTC
I forgot that it's actually spelled out pretty clearly in the comments of the code: /* Don't error on this BSD doesn't and if you think * about it this is right. Otherwise apps have to * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ val = min_t(u32, val, sysctl_rmem_max); So essentially they are just treating those buffer sizes as "hints". It makes some sense, because they are correct about applications having no idea what size they can actually set it to (without root access to the system). So that would be another option, which is to just set the buffer to min(cc, sb_max_adj)
Comment 17 Alfred Perlstein 2015-11-13 22:42:41 UTC
That is interesting... how to applications warn "your system doesn't have sufficient buffering enabled for performance" ?
Comment 18 Cameron Sparr 2015-11-16 16:29:30 UTC
@Alfred, I guess on Linux there is no way to warn about that, users would have to check the buffer size limits themselves. In conclusion, should I change my patch to do what you originally suggested? Which is to have kern.ipc.maxsockbuf represent that maximum amount of DATA that the socket can handle?