I'd like to put out a proposal for changing the way that the setsockopt() function handles the max socket buffer setting.
Currently, it's a bit unintuitive that socket send and receive buffers (SO_RCVBUF & SO_SNDBUF) can not actually be set to the kern.ipc.maxsockbuf value, even though it is described as being possible in the man page: https://www.freebsd.org/cgi/man.cgi?query=setsockopt&sektion=2
This is because setsockopt() will error out if the value passed is over the _adjusted_ maximum, which on my system (amd64) turns out to be something like kern.ipc.maxsockbuf * 0.889 (kern.ipc.maxsockbuf * (1 << 11) / (256 + (1 << 11)) = kern.ipc.maxsockbuf * (2048) / (2304)).
I believe that the behavior could be made more intuitive by checking if the value passed is under the _actual_ max before erroring out, and if it is under the actual max, then set it to the adjusted max and continue the function, this would be a simple change and would look something like this (apologies for the whitespace diffs): https://github.com/sparrc/freebsd/commit/157f90c55d1d54d33f41c6f7517de1a9c5f5e229
FWIW, the linux kernel takes it a step further by never failing if setting the buffer over the maximum. It just takes whatever value it's given and sets it to min(given_value, max_value). see here: https://github.com/torvalds/linux/blob/master/net/core/sock.c#L762-L768. Not saying this is a good option, just pointing out that there is a precedent for doing something like this.
I would be happy to make this change and submit a patch to phabricator if approved.
I have run into this, but never understood why it happened, just worked around it.
in openssh-portable with HPN enabled, I tried to use '-oTcpRcvBuf=4M' to increase the socket buffer size to receive large ZFS replication streams more quickly. Of course, with kern.ipc.maxsockbuf set to 4M, this fails, I have to use 3.5M or something.
This patch would seem to solve that. Thank you.
The biggest question is, why can't I actually use a sockbuf for 4M?
I don't quite understand the purpose of the adjusted max either, but changing that could have much larger ramifications.
Reassigning to -net@
I am not sure about the linux way but the first part looks okay to me. I am no expert here so adding Robert and Gleb for their inputs.
@Cameron, if you could get a changeset up on reviews. that would be great. Feel free to replace the URL field value in this issue with the review URL once it's up.
Created attachment 163023 [details]
Patch of the diff in the URL section. Changes behavior of setsockopt
This change makes it so that the setsockopt() function does not error
when a user passes a value for SO_RCVBUF equal or under the maximim
Currently the behavior is to error if the value is greater than the
_adjusted_ max, which on amd64 turns out to be something like
kern.ipc.maxsockbuf * 0.889
This is confusing for users, and this change will set the value passed
to the adjusted max if the value is greater than the adjusted and less
than the actual max.
@Kubilay I realized that I actually have no idea how to use svn/phabricator, so I'm just going to attach a patch to this case. Let me know how it looks, thank you.
There is a command line tool for Phabricator called 'arcanist' (devel/php5-arcanist) that simplifies posting patches to PHabriactor. It works with svn and git.
Or, via the Phabricator web interface, you can post a .patch file (helps if you set it to 999999 lines of context)
This wiki page describes how our current code review process works: https://wiki.freebsd.org/CodeReview
Got it, thanks for the link, I put up a review on phabricator. May I suggest you put a link to the wiki (https://wiki.freebsd.org/CodeReview) on the homepage of the phabricator site? Looks like you can edit it using the "dashboard" app but I don't have permissions, obviously: https://reviews.freebsd.org/dashboard/manage/2/
That's perfectly fine Cameron, thank you
Once a committer 'takes' /assigns this issue, they can put it up on reviews if necessary.
Looks like I missed the review creation, ignore my last comment, except for the 'needs a committer/assignee' bit ;D
I believe the reason for the adjustment is that "sb_max" is for "max kernel memory taken by the socketbuffer INCLUDING MBUFS THEMSELVES"
So what is actually happening is that each MCLSIZE (size of cluster) of "data" is having MSIZE (size of mbuf) added to it.
Why is it scaled up so? Because in reality one needs MSIZE actual memory for each cluster as metadata to point to it.
And why is this done? So that you can actually trust "sb_max" to mean maximum kernel memory taken to support N bytes per socket.
I didn't realize Linux silents truncates the requested amount, that's a little scary, however I'm learning to trust more and more what Linux does.
What might make a bit more sense in the long run is actually to make:
1) kern.ipc.maxsockbuf == max number of bytes of DATA in each socketbuffer
2) kern.ipc.maxsockbufmeta = max number of bytes of DATA + METADATA required to be allocated.
@Alfred, thanks for the background info
Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes.
Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?'
Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all
> Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes.
I think so. However just to be clear, let's not have a sysctl that has the following behavior:
% sysctl kern.ipc.maxsockbuf=1000000
% sysctl kern.ipc.maxsockbuf
We probably want this:
% sysctl kern.ipc.maxsockbuf=1000000
% sysctl kern.ipc.maxsockbuf
$ sysctl kern.ipc.maxsockbufmeta
> Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?'
I think having them auto-set each other would be the most user friendly option.
> Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all
Yes. It would be a good idea to hop on a Linux kernel IRC channel/mailing list and ask them. Last time I did this I got some pretty useful information which changed my world view.
I forgot that it's actually spelled out pretty clearly in the comments of the code:
/* Don't error on this BSD doesn't and if you think
* about it this is right. Otherwise apps have to
* play 'guess the biggest size' games. RCVBUF/SNDBUF
* are treated in BSD as hints
val = min_t(u32, val, sysctl_rmem_max);
So essentially they are just treating those buffer sizes as "hints". It makes some sense, because they are correct about applications having no idea what size they can actually set it to (without root access to the system).
So that would be another option, which is to just set the buffer to min(cc, sb_max_adj)
That is interesting... how to applications warn "your system doesn't have sufficient buffering enabled for performance" ?
@Alfred, I guess on Linux there is no way to warn about that, users would have to check the buffer size limits themselves.
In conclusion, should I change my patch to do what you originally suggested? Which is to have kern.ipc.maxsockbuf represent that maximum amount of DATA that the socket can handle?
For now it makes sense to implement the sysctls as described in comment #15. changing the behavior at this point seems a bad idea.