Bug 204438 - setsockopt() handling of kern.ipc.maxsockbuf limit
Summary: setsockopt() handling of kern.ipc.maxsockbuf limit
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: freebsd-net mailing list
URL: https://reviews.freebsd.org/D4129
Keywords: needs-qa, patch
Depends on:
Blocks:
 
Reported: 2015-11-10 18:17 UTC by Cameron Sparr
Modified: 2018-08-20 17:41 UTC (History)
8 users (show)

See Also:
koobs: mfc-stable9?
koobs: mfc-stable10?


Attachments
Patch of the diff in the URL section. Changes behavior of setsockopt (1.16 KB, patch)
2015-11-11 20:38 UTC, Cameron Sparr
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Cameron Sparr 2015-11-10 18:17:53 UTC
Hello,

I'd like to put out a proposal for changing the way that the setsockopt() function handles the max socket buffer setting.

Currently, it's a bit unintuitive that socket send and receive buffers (SO_RCVBUF & SO_SNDBUF) can not actually be set to the kern.ipc.maxsockbuf value, even though it is described as being possible in the man page: https://www.freebsd.org/cgi/man.cgi?query=setsockopt&sektion=2

This is because setsockopt() will error out if the value passed is over the _adjusted_ maximum, which on my system (amd64) turns out to be something like kern.ipc.maxsockbuf * 0.889 (kern.ipc.maxsockbuf * (1 << 11) / (256 + (1 << 11)) = kern.ipc.maxsockbuf * (2048) / (2304)).

see here:
https://github.com/freebsd/freebsd/blob/master/sys/kern/uipc_sockbuf.c#L420
and here:
https://github.com/freebsd/freebsd/blob/master/sys/kern/uipc_sockbuf.c#L63-L64

I believe that the behavior could be made more intuitive by checking if the value passed is under the _actual_ max before erroring out, and if it is under the actual max, then set it to the adjusted max and continue the function, this would be a simple change and would look something like this (apologies for the whitespace diffs): https://github.com/sparrc/freebsd/commit/157f90c55d1d54d33f41c6f7517de1a9c5f5e229

FWIW, the linux kernel takes it a step further by never failing if setting the buffer over the maximum. It just takes whatever value it's given and sets it to min(given_value, max_value). see here: https://github.com/torvalds/linux/blob/master/net/core/sock.c#L762-L768. Not saying this is a good option, just pointing out that there is a precedent for doing something like this.

I would be happy to make this change and submit a patch to phabricator if approved.

Thanks all,
Cameron
Comment 1 Allan Jude freebsd_committer 2015-11-10 21:41:47 UTC
I have run into this, but never understood why it happened, just worked around it.

in openssh-portable with HPN enabled, I tried to use '-oTcpRcvBuf=4M' to increase the socket buffer size to receive large ZFS replication streams more quickly. Of course, with kern.ipc.maxsockbuf set to 4M, this fails, I have to use 3.5M or something.

This patch would seem to solve that. Thank you.

The biggest question is, why can't I actually use a sockbuf for 4M?
Comment 2 Cameron Sparr 2015-11-10 21:55:09 UTC
I don't quite understand the purpose of the adjusted max either, but changing that could have much larger ramifications.
Comment 3 Enji Cooper freebsd_committer 2015-11-10 22:47:08 UTC
Reassigning to -net@
Comment 4 Hiren Panchasara freebsd_committer 2015-11-10 23:54:42 UTC
I am not sure about the linux way but the first part looks okay to me. I am no expert here so adding Robert and Gleb for their inputs.
Comment 5 Kubilay Kocak freebsd_committer freebsd_triage 2015-11-11 11:44:39 UTC
@Cameron, if you could get a changeset up on reviews. that would be great. Feel free to replace the URL field value in this issue with the review URL once it's up.
Comment 6 Cameron Sparr 2015-11-11 20:38:39 UTC
Created attachment 163023 [details]
Patch of the diff in the URL section. Changes behavior of setsockopt

This change makes it so that the setsockopt() function does not error
when a user passes a value for SO_RCVBUF equal or under the maximim
(kern.ipc.maxsockbuf).

Currently the behavior is to error if the value is greater than the
_adjusted_ max, which on amd64 turns out to be something like
kern.ipc.maxsockbuf * 0.889

This is confusing for users, and this change will set the value passed
to the adjusted max if the value is greater than the adjusted and less
than the actual max.
Comment 7 Cameron Sparr 2015-11-11 20:41:17 UTC
@Kubilay I realized that I actually have no idea how to use svn/phabricator, so I'm just going to attach a patch to this case. Let me know how it looks, thank you.
Comment 8 Allan Jude freebsd_committer 2015-11-11 21:29:10 UTC
There is a command line tool for Phabricator called 'arcanist' (devel/php5-arcanist) that simplifies posting patches to PHabriactor. It works with svn and git.

Or, via the Phabricator web interface, you can post a .patch file (helps if you set it to 999999 lines of context)
Comment 9 Enji Cooper freebsd_committer 2015-11-11 21:34:20 UTC
This wiki page describes how our current code review process works: https://wiki.freebsd.org/CodeReview
Comment 10 Cameron Sparr 2015-11-11 22:27:14 UTC
Got it, thanks for the link, I put up a review on phabricator. May I suggest you put a link to the wiki (https://wiki.freebsd.org/CodeReview) on the homepage of the phabricator site? Looks like you can edit it using the "dashboard" app but I don't have permissions, obviously: https://reviews.freebsd.org/dashboard/manage/2/
Comment 11 Kubilay Kocak freebsd_committer freebsd_triage 2015-11-12 06:37:49 UTC
That's perfectly fine Cameron, thank you

Once a committer 'takes' /assigns this issue, they can put it up on reviews if necessary.
Comment 12 Kubilay Kocak freebsd_committer freebsd_triage 2015-11-12 06:40:30 UTC
Looks like I missed the review creation, ignore my last comment, except for the 'needs a committer/assignee' bit ;D
Comment 13 Alfred Perlstein freebsd_committer 2015-11-12 18:20:08 UTC
I believe the reason for the adjustment is that "sb_max" is for "max kernel memory taken by the socketbuffer INCLUDING MBUFS THEMSELVES"

So what is actually happening is that each MCLSIZE (size of cluster) of "data" is having MSIZE (size of mbuf) added to it.

Why is it scaled up so?  Because in reality one needs MSIZE actual memory for each cluster as metadata to point to it.

And why is this done?  So that you can actually trust "sb_max" to mean maximum kernel memory taken to support N bytes per socket.

I didn't realize Linux silents truncates the requested amount, that's a little scary, however I'm learning to trust more and more what Linux does.

What might make a bit more sense in the long run is actually to make:

1) kern.ipc.maxsockbuf == max number of bytes of DATA in each socketbuffer
2) kern.ipc.maxsockbufmeta = max number of bytes of DATA + METADATA required to be allocated.
Comment 14 Cameron Sparr 2015-11-12 19:08:51 UTC
@Alfred, thanks for the background info

Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes.

Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?'

Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all
Comment 15 Alfred Perlstein freebsd_committer 2015-11-12 19:33:04 UTC
> Re (1) do you mean that when kern.ipc.maxsockbuf gets set, scale the sb_max value up by the inverse of sb_max_adj? ie, kern.ipc.maxsockbuf * 2304 / 2048. In that case, setting kern.ipc.maxsockbuf to 1000000 bytes would actually set sb_max to 1125000 (and sb_max_adj to 1000000), making it possible to set the buffers to 1000000 bytes.

I think so.  However just to be clear, let's not have a sysctl that has the following behavior:
% sysctl kern.ipc.maxsockbuf=1000000
% sysctl kern.ipc.maxsockbuf
kern.ipc.maxsockbuf=1125000
$

We probably want this:
% sysctl kern.ipc.maxsockbuf=1000000
% sysctl kern.ipc.maxsockbuf
kern.ipc.maxsockbuf=1000000
$ sysctl kern.ipc.maxsockbufmeta
kern.ipc.maxsockbufmeta=1125000



> Re (2), If kern.ipc.maxsockbufmeta were read/write, which of the settings would have precedence? Would you have to set maxsockbufmeta first anytime you wanted to set maxsockbuf? Or would setting maxsockbuf auto-set maxsockbufmeta?'

I think having them auto-set each other would be the most user friendly option.

> Re Linux silent truncation: agreed, it's scary and gives the impression that setsockopt() worked, when really it didn't do anything at all

Yes.  It would be a good idea to hop on a Linux kernel IRC channel/mailing list and ask them.  Last time I did this I got some pretty useful information which changed my world view.
Comment 16 Cameron Sparr 2015-11-12 19:54:38 UTC
I forgot that it's actually spelled out pretty clearly in the comments of the code:

/* Don't error on this BSD doesn't and if you think
 * about it this is right. Otherwise apps have to
 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 * are treated in BSD as hints
 */
val = min_t(u32, val, sysctl_rmem_max);

So essentially they are just treating those buffer sizes as "hints". It makes some sense, because they are correct about applications having no idea what size they can actually set it to (without root access to the system).

So that would be another option, which is to just set the buffer to min(cc, sb_max_adj)
Comment 17 Alfred Perlstein freebsd_committer 2015-11-13 22:42:41 UTC
That is interesting... how to applications warn "your system doesn't have sufficient buffering enabled for performance" ?
Comment 18 Cameron Sparr 2015-11-16 16:29:30 UTC
@Alfred, I guess on Linux there is no way to warn about that, users would have to check the buffer size limits themselves.

In conclusion, should I change my patch to do what you originally suggested? Which is to have kern.ipc.maxsockbuf represent that maximum amount of DATA that the socket can handle?
Comment 19 Alfred Perlstein freebsd_committer 2015-11-17 02:00:04 UTC
For now it makes sense to implement the sysctls as described in comment #15.  changing the behavior at this point seems a bad idea.