235582 – rpc_svc_gss / nfsd kernel panic

Bug 235582 - rpc_svc_gss / nfsd kernel panic

Summary: rpc_svc_gss / nfsd kernel panic

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	kern (show other bugs)
Version:	11.2-RELEASE
Hardware:	amd64 Any

Importance:	--- Affects Some People
Assignee:	Rick Macklem

URL:
Keywords:	crash

Depends on:
Blocks:

Reported:	2019-02-07 18:24 UTC by Peter Eriksson
Modified:	2022-10-12 00:50 UTC (History)
CC List:	5 users (show)

See Also:

Flags:	rmacklem: mfc-stable12+ rmacklem: mfc-stable11+ rmacklem: mfc-stable10+

Attachments
fix srv side rpcsec_gss to handle NULL client principal name (3.54 KB, patch) 2019-02-09 00:06 UTC, Rick Macklem	no flags	Details \| Diff
updated fix for srv side rpcsec_gss NULL client principal (4.64 KB, patch) 2019-02-09 05:18 UTC, Rick Macklem	no flags	Details \| Diff
fix race in ref counting of svc_rpc_gss_client struct (1.64 KB, patch) 2019-02-09 23:28 UTC, Rick Macklem	no flags	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Eriksson 2019-02-07 18:24:31 UTC

We have recently gone "live" with more NFS users "banging" on our FreeBSD-based fileservers. And now something seems to have started triggering kernel panics. Since the they major difference from before is the number of NFS users so this is the major suspect...

We just caught a panic and got a screendump from the console and the stack traceback shows:

> Fatal trap 12: page fault while in kernel mode
> cpuid = 8; apic id = 08
> fault virtual addresa  = 0x0
> fault code             = supervisor read data, page not present
> instruction pointer    = 0x20:0xffffffff82b578e9
> stack pointer          = 0x20:0xfffffe3fdc627760
> code segment           = base rx0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags       = interrupt enabled, resume, IOPL = 0
> current process        = 2519 (nfsd: service)
> trap number            = 12
> panic: page fault
> cpuid = 8
> KDB: stack backtrace
> #0 0xffffffff80b3d577 at kdb_backtrace+0x67
> #1 0xffffffff80af6b17 at vpanic+0x177
> #2 0xffffffff80af6993 at panic+0x43
> #3 0xffffffff80f77fdf at trap_fatal+0x35f
> #4 0xffffffff80f78039 at trap_pfault+0x49
> #5 0xffffffff80f77807 at trap+0x2c7
> #6 0xffffffff80f56fbc at calltrap+0x8
> #7 0xffffffff82b5d4d2 at svc_rpc_gss+0x8f2
> # 8 0xffffffff80d6c1b6 at svc_run_internal+0x726
> #9 0xffffffff80d6cd4b at svc_thread_start+0xb
> #10 0xffffffff80aba093 at fork_exit+0x8
> #11 0xffffffff80f48ede at fork_trampoline+0xe

(Unfortunately not kernel crash dump from this machine).

Systems are: Dell PowerEdge R730xd with 256GB RAM, HBA330 (LSI 3008) SAS controllers, ZFS-storage, Intel X710 10GE-ethernet machines running FreeBSD 11.2. No swap enabled. ZFS ARC capped to 128GB.

NFS v4.0 or v4.1 client with sec=krb5:krb5i:krb5p security. Most clients (if not all) are running Linux CentOS or Ubuntu). Around 200 active clients per server.

(Most clients are Windows users using SMB via Samba though)
We have enabled a crash dump device one a couple of the machines and are going to enable it on more in order to try to get a crash-dump when the next server panics...

Any ideas where this bug might be or how we could workaround it? (Disabling NFS is unfortunately not an option).

Comment 1 Rick Macklem freebsd_committer

2019-02-07 22:46:07 UTC

Well, if you can somehow find out the source line# for
 #7 0xffffffff82b5d4d2 at svc_rpc_gss+0x8f2
then I can take a look.

It is pretty obviously a use of a NULL ptr, but without the
source line#, it's pretty hard to guess what it is.

rick

Comment 2 Peter Eriksson 2019-02-07 23:35:02 UTC

This is a huge wild chance, but I was looking at the assembler code for the svc_rpc_gss function at around the offset (0x8f2 = 2290) and it looks like this:

0xffffffff8286d4bc <svc_rpc_gss+2268>:	callq  0xffffffff8286bc50 <rpc_gss_oid_to_mech>
0xffffffff8286d4c1 <svc_rpc_gss+2273>:	mov    0x78(%r14),%rsi
0xffffffff8286d4c5 <svc_rpc_gss+2277>:	lea    -0x38(%rbp),%rdi
0xffffffff8286d4c9 <svc_rpc_gss+2281>:	lea    -0x70(%rbp),%rdx
0xffffffff8286d4cd <svc_rpc_gss+2285>:	callq  0xffffffff828678b0 <gss_export_name>
0xffffffff8286d4d2 <svc_rpc_gss+2290>:	test   %eax,%eax
0xffffffff8286d4d4 <svc_rpc_gss+2292>:	je     0xffffffff8286d932 <svc_rpc_gss+3410>

Looking at the source code in the svc_rpcsec_gss.c file this _might_ correspond to code in svc_rpc_gss_accept_sec_context() at around line 941:

            client->cl_rawcred.version = RPCSEC_GSS_VERSION;
                rpc_gss_oid_to_mech(mech, &client->cl_rawcred.mechanism);
                maj_stat = gss_export_name(&min_stat, client->cl_cname,
                    &export_name);
                if (maj_stat != GSS_S_COMPLETE) {
                        rpc_gss_log_status("gss_export_name", client->cl_mech,
                            maj_stat, min_stat);
                        return (FALSE);
                }
                client->cl_rawcred.client_principal =
                        mem_alloc(sizeof(*client->cl_rawcred.client_principal)
                            + export_name.length);
                client->cl_rawcred.client_principal->len = export_name.length;
                memcpy(client->cl_rawcred.client_principal->name,
                    export_name.value, export_name.length);

Comment 3 Conrad Meyer freebsd_committer

2019-02-08 00:19:56 UTC

@Peter, you should be able to use 'l *(svc_rpc_gss+0x8f2)' in kgdb to figure out the line number directly.

Comment 4 Conrad Meyer freebsd_committer

2019-02-08 00:26:25 UTC

The only two callers of that function appear to be rpc_gss_get_principal_name and svc_rpc_gss_accept_sec_context.  The latter is called exclusively by svc_rpc_gss, so it might be a good candidate.  The former appears to be dead code, so I think it's svc_rpc_gss -> svc_rpc_gss_accept_sec_context (l.1344) -> gss_export_name (l.957).  (Line numbers from CURRENT-ish, I don't know what you're running.)

Comment 5 Rick Macklem freebsd_committer

2019-02-08 23:17:40 UTC

Well, if I understood the comments, that would suggest client->cl_cname is NULL.
That is weird. I'm not a Kerberos guy, but this says that the server
was able to handle the GSSAPI initialization token (basically a Kerberos
session ticket + some GSSAPI gobbly gook), but the GSSAPI library
doesn't return a principal name for the gss_accept_sec_context() even
though it returns GSS_S_COMPLETE.
What does this mean in Kerberos land? I have no idea.

I can see two ways to handle it.
1 - Consider it a failed authentication.
OR
2 - Map it to "nobody".
   Basically that principal name in client->cl_cname is looked up in the
   password database and, if it is not found, then the credentials of
   "nobody" are used instead of the credentials for the user in the
   password database.
   --> Since no entry in the password database gets "nobody", it seems
       that "no principal name" might get the same treatment?
I think I'll generate a patch for #2 above and attach it to this PR#,
although I have no way to test it.

Comment 6 Rick Macklem freebsd_committer

2019-02-09 00:06:06 UTC

Created attachment 201858 [details]
fix srv side rpcsec_gss to handle NULL client principal name

This patch modifies the server side RPCSEC_GSS so that it handles
the case where gss_accept_sec_context() works for a client's token
and returns GSS_S_COMPLETE but the client principal name is NULL.

I have no idea what this means w.r.t. Kerberos, but since a principal
name that cannot be found in the password database is authenticated as
the user "nobody", this patch does the same for the case of "no prinicpal name".

It is untested, but hopefully Peter can test it?
(It assumes that the crash was caused by client->cl_cname == NULL.)

Comment 7 Rick Macklem freebsd_committer

2019-02-09 05:18:36 UTC

Created attachment 201862 [details]
updated fix for srv side rpcsec_gss NULL client principal

This is an update to the 201858 patch which does a couple
of printf()s and no KASSERT().
It uses a local clname instead of the one in client->cl_cname
so that any race caused by multiple RPC requests with the same
handle and GSS_S_CONTINUE_INIT will be handled.
(gss_release_name() sets client->cl_cname NULL and that could
 have resulted in the crash if multiple RPCs were received and
 handled concurrently as above.)

I didn't think that gss_accept_sec_context() ever returned
GSS_S_CONTINUE_INIT for Kerberos mechanism, but I could be
wrong on this and if this does happen and the client erroneously
sends the next token RPC twice, it could try to use client->cl_cname
after it is set NULL by the gss_release_name() for the previous
RPC message.

The printf()s should tell us more about how the NULL cname happens.

Comment 8 Peter Eriksson 2019-02-09 10:44:08 UTC

I'll try out the proposed fixes. Sound like it could be the culprit. A pity I haven't really found a way to force the issue yet (or even pinpointed which of the 100-200 clients caused it). But perhaps it's just a random thing caused by network retransmissions and load balancing over multiple LACP trunks...


A question - I enabled a lot of debugging output in order to try to see what's happening and I noticed one strange thing in the "dmesg" output:

> rpcsec_gss: accepted context for \^D\^A (41 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1>

(I had modified the rpc_gss_log_debug() call to also print the length of the client_principal member)

Ie, this call around line 1100 or so:

                      rpc_gss_log_debug("accepted context for %s (%d bytes) with "
                            "<mech %.*s, qop %d, svc %d>",
                            client->cl_rawcred.client_principal->name,
                            client->cl_rawcred.client_principal->len,
                            mechname.length, (char *)mechname.value,
                            client->cl_qop, client->cl_rawcred.service);

The "client_principal->name" output looks garbled.. (In /var/log/messages those characters get's filtered out so you won't find it there). Perhaps cl_rawcred.client_principal isn't supposed to be printable at all.

Comment 9 Rick Macklem freebsd_committer

2019-02-09 16:08:26 UTC

The bogus client_principal name suggests that client->cl_cname
was pointing at something bogus, because the client_principal
is filled in from it.

So, I think this suggests that client->cl_cname is sometimes
bogus and sometimes NULL. (It's filled in via gss_accept_sec_context(),
so that's in Kerberos land.)

Hopefully the patch will help resolve this, rick.

Comment 10 Rick Macklem freebsd_committer

2019-02-09 23:28:19 UTC

Created attachment 201876 [details]
fix race in ref counting of svc_rpc_gss_client struct

Upon inspection I found two related races when the svc_rpc_gss_client
structure in created/initialized.
1 - The cl_expiration field is set after the structure is linked into a
    list, making it visible to other threads when it is still 0.
    This could result in svc_rpc_gss_timeout_client() freeing it just after
    being linked into the list.
2 - Related to #1, cl_refs is initialized to 1 and then incremented by
    refcount_acquire() after svc_rpc_gss_create_client() returns it,
    leaving a window where cl_refs could be decremented to 0 and free'd.

This patch makes two changes to svc_rpc_gss_create_client():
- Move the code that initializes cl_expiration to before where it is linked
  into the list, so that it is set before it is visible to other threads.
- Initialize cl_refs to 2, so a refcount_acquire() is not needed after
  the svc_rpc_gss_create_client() call.

These races could allow the structure to be free'd when it is still being
used in svc_rpc_gss().

This patch can be applied separately from #201862.

Comment 11 Peter Eriksson 2019-02-10 16:12:35 UTC

(In reply to Rick Macklem from comment #9)

I modified the debug printout a bit more so it would print all bytes in the cname output and it looks like it's not a "pure" name but some 'struct'/packed data of some kind...

rpcsec_gss: svc_rpc_gss_accept_sec_context: cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x13ESPRESSO$@AD.LIU.SE (38 bytes), cl_sname = nfs@filur00.it.liu.se

Ie, the client principal is there but at the end of the data (19 bytes into it).

(\xNN = hexadecimal encoded character. Printable ascii chars are printed as is)

Comment 12 Rick Macklem freebsd_committer

2019-02-10 16:24:42 UTC

Yes, that is correct. In the gssapi, the call gss_export_name()
gets the plain ascii name out of the internal format.
You'll notice that the client_principal that was printing out
bogusly is a copy of the result of a gss_export_name() on cl_cname.
(So, assuming your principal names are printable ascii, as this example
 appears to be, the result of a successful gss_export_name() will
 be printable.)

Btw, I'm hoping that the second patch "fix race..." will fix this
problem, since I doubt that the Kerberos/gssapi libraries are returning
trash for the client principal name via gss_accept_sec_context().

Comment 13 Peter Eriksson 2019-02-10 18:57:18 UTC

I'm running a kernel with your patches now on our test server. With some added code to print the 'cname' to export_name converted as hexadecimal characters (for non-printable ones). It really looks like som 'struct' ending with the length of the principal as a 4-byte int and then the printable principal name.

Not many NFS clients connecting to that server so a race condition causing the "random" characters isn't really likely. Especially not when it's the first call to happening :-)


(Machine authentication)
rpcsec_gss: svc_rpc_gss_accept_sec_context:
  cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x16FILIFJONKAN$@AD.LIU.SE (41 bytes),
  cl_sname = nfs@filur00.it.liu.se
rpcsec_gss: accepted context for  (41 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1>

(User authentication)
rpcsec_gss: svc_rpc_gss_accept_sec_context:
  cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x12tesje148@AD.LIU.SE (37 bytes),
  cl_sname = nfs@filur00.it.liu.se
rpcsec_gss: accepted context for  (37 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1>


My quick hack to print the export_name:

               maj_stat = gss_export_name(&min_stat, cname,
                                             &export_name);

                  if (maj_stat != GSS_S_COMPLETE) {
                    rpc_gss_log_status("gss_export_name", client->cl_mech,
                                       maj_stat, min_stat);
                    return (FALSE);
                  }

                  if (1) /* Debug printout */
                  {
                    gss_buffer_desc tmp;
                    char *src, *dst;
                    int i;
                    OM_uint32 dummy;

                    dst = tmp.value = mem_alloc(tmp.length = export_name.length*4+1);
                    src = export_name.value;
                    for (i = 0; i < export_name.length; i++) {
                      if (*src < ' ' || *src > '~') {
                        sprintf(dst, "\\x%02X", *src);
                        dst += 4;
                        src++;
                      } else
                        *dst++ = *src++;
                    }
                    *dst = '\0';

                    rpc_gss_log_debug("svc_rpc_gss_accept_sec_context: cl_cname = %s (%d bytes), cl_sname = %s",
                                      tmp.value, export_name.length,
                                      client->cl_sname && client->cl_sname->sn_principal ? client->cl_sname->sn_principal : "<null>");
                    gss_release_buffer(&dummy, &tmp);
                  }

                  client->cl_rawcred.client_principal =
                    mem_alloc(sizeof(*client->cl_rawcred.client_principal)
                              + export_name.length);
                  client->cl_rawcred.client_principal->len = export_name.length;
                  memcpy(client->cl_rawcred.client_principal->name,
                         export_name.value, export_name.length);

                  gss_release_buffer(&min_stat, &export_name);

Comment 14 Rick Macklem freebsd_committer

2019-02-10 21:56:19 UTC

Well, since that is the output of gss_export_name(), I suspect that
won't be found in the password database and will map to nobody.
Note that the fact that it ends in "@AD.LIU.SE" suggests that the
gss_export_name() isn't doing what it needs to do.

Normally the output of gss_export_name() should be:
For a user principal, just the user, such as "tesje148".
For a host principal, <name>@<host.domain>, such as "nfs@filur00.it.liu.se".
Unless there is an additional step in the GSSAPI to get to the "user" or
"user@domain" name that I have forgotten. (I am not the author of this code,
but I did write a RPCSEC_GSS implementation for the early NFSv4 code I did
long ago.)
Maybe gss_pname_to_unix_cred() does some additional conversion?
(If you add a couple of printf()s in gss_pname_to_unix_cred(), you should
 be able to tell if the cname is translating to a unix cred ok.)

You also might consider testing with my second patch and not the first one
that changes client->cl_cname to a local cname. (I may have screwed the patch
up, since I can't test them.)

I don't think this is related to the crash.

Comment 15 Rick Macklem freebsd_committer

2019-02-11 00:17:37 UTC

Oops, I was wrong. (Maybe Heimdal just returned the principal name 10+
years ago, but no longer.)

If you look at (under your FreeBSD source tree):
crypto/heimdal/lib/gssapi/krb5/export_name.c
you'll see that it puts a bunch of junk in front of the principal
name. (It looks like it is always the same # of bytes, so for your case
I'd guess it's always 19bytes of junk.)

I'd guess that krb5_aname_to_localname() found in:
crypto/heimdal/lib/krb5/aname_to_localname.c
knows how to get rid of the junk, but that isn't callable from
the kernel, I'd guess? (I think the gssd daemon will only handle
a subset of the gss_XXX() calls.)

So, I don't think this is a bug and that leaves us with the crashes
due to a NULL cl_cname and I'm hoping they were caused by the race
fixed by the 2nd patch.

Good luck with it, rick

Comment 16 Peter Eriksson 2019-02-11 15:37:01 UTC

I agree that the debug printout stuff is not the problem here.

We are going to try out your second patch on our production servers this week.
(Just had one of the servers (without your patch) crash and reboot again twice).

Thankfully we found a way to speed up the mounting of 20k zfs filesystems massively so a reboot now takes about 5 minutes instead of 40+ minutes... (A coworker wrote a script that does the ZFS mounts in parallell instead of sequentially - there are room for even more speed improvements though :-)

Anyway - a question about a potential workaround:

Would limiting the number of NFSD thread with "-n 1" "mask" this problem? 
Our servers have 32 cpu "cores" so nfsd tries to start up 256 kernel threads which I'm guessing will improve the chance of the bug to be triggered.

Comment 17 Benjamin Kaduk freebsd_committer

2019-02-11 15:50:50 UTC

As some background (not necessarily helpful, unfortunately) on GSSAPI names, they are actually fairly complicated objects.

To start with, there are two broad types of name object, an "internal name" that is something of a generic name string and name type (e.g., "host@test.example.com" and GSS_NT_HOSTBASED_SERVICE), and a "mechanism name" (MN) that contains similar information content but has been restricted to a single GSSAPI mechanism (e.g., for the krb5 mechanism, you'd get "host/test.example.com@REALM" for the above example).  So the actual runtime object is going to have some OIDs associated with it, whether for name type or mechanism type and perhaps other things.

The original worldview w.r.t. GSSAPI names was that an MN could be converted to an "export token" that would be treated as an opaque binary string and used for equality comparison in ACLs, and that there would be a separate "display name" version that would be used in logging but not for authentication decisions.  (You can imagine how well that turned out.)  So now we have more advanced functions like gss_localname() that map a MN to a local username.

I did not look particularly closely at the debugger dump of the cl_name that had the "expected name" at the end of the buffer; it did look like there were probably some OIDs and maybe other data in front.  Presumably one could pull the relevant struct definition from the heimdal internal headers.

Comment 18 Rick Macklem freebsd_committer

2019-02-11 22:49:44 UTC

Yes, I suspect that "-n 1" on the nfsd startup would probably
mask the problem, although I would not recommend that.
(This assumes that I am correct to assume this race between threads
 is the culprit.)

Increasing the size of CLIENT_MAX (which recently became a tunable
in head) might also mask it and will help w.r.t. performance if
you have more than 128 users authenticating from the clients.
(In this case, CLIENT refers to a user on an NFS mount and not an
 NFS mount from a client.)

Interesting that the HASH table is set to 256, which is twice the
number of elements being kept in the hash lists. (Not sure why
anyone would make it that way, but I'm not the author of this stuff.;-)

I am really hoping that the 2nd patch "fix race..." will fix the
problem. (The first patch would just be a "safety belt" in case
the GSSAPI did return GSS_S_COMPLETE for gss_accept_sec_context(),
but somehow leave the cname argument NULL.

Comment 19 Peter Eriksson 2019-02-11 23:13:26 UTC

(In reply to Rick Macklem from comment #18)

Hmm.. I wonder what would happen if I set CLIENT_MAX to 1 or 2 instead on our test server and then have a couple of Linux clients try to do some concurrent access (as different users). Ah well, something for tomorrow to test. We definitely have more that 128 different users accessing each server so will probably up the limits to 1024 or so later.

(Looking for ways to more easily provoke the issue :-)

Anyway, we're now running our servers with that second patch. Will be interesting to see what happens tomorrow... (One of the servers rebooted twice today (without that patch)).

Comment 20 Rick Macklem freebsd_committer

2019-02-12 00:27:00 UTC

Well, if you have NFS client(s) mounted with more than CLIENT_MAX different
users actively using the mount concurrently, that might tickle the
race?
- Basically, this code (which handles a new credential creation) happens
  when there is a miss on the cache of credentials (which is capped at
  CLIENT_MAX), where each distinct user (as in uid) would need a credential.
--> Conversely, making CLIENT_MAX >= the maximum # of different uids actively
    using the file system might reduce the likelyhood of the crash, since this
    code would be executed less frequently.

Related to Ben's comment (thanks for the nice description of the name):
- It did tickle a rusty brain cell. I think I was confusing gss_export_name()
  with gss_display_name(). Unfortunately, this isn't very useful, since
  neither gss_display_name() nor gss_localname() are supported by the
  KGSSAPI.
  If you search for "_svc" in usr.sbin/gssd/gssd.c, you'll see the rather
  small list of gssapi functions supported by the KGSSAPI (unless I've
  misread this code).
- I think the structure is called "Principal" (also called
  "krb5_principal_data"). It seems to be defined in
  krb5_asn1.h and that isn't in the kernel either.
--> I suspect this is why the logging code enabled via compiling it with
    DEBUG defined just logs the output of gss_export_name() and doesn't
    try and extract the components of it?

Good luck with your testing, rick

Comment 21 Peter Eriksson 2019-02-14 21:23:32 UTC

Just a quick feedback/comment: We've now been running our production servers with a patched kernel as per above since Monday night and so far everything seems to be running correctly... No kernel panics so far - *Fingers Crossed* :-)


Just a quick wish for someone - would it be possible to have the CLIENT_MAX and the hash table size increased a bit (like to 512/1024) in an "official" kernel patch for 11.2? 128/256 are very low numbers - especially since every client principal (HOST$@REALM) also uses up a slot in that table when a client connects... 

So basically it just requires 64 clients (with a different user on each client) to run into the CLIENT_MAX limit (which I'm guessing will cause gssd extra work so it's not a showstopper, just annoying :-).

(There are a bit too many kernel-crashing problems reported with ZFS in 12.0 for us to upgrade there yet).

I know I can build my own kernel with those limits increased but it would be nice to not have to do that (and distribute it to all our servers) whenever a new kernel/security patch arrives... :-)

/Lazy Bugger

Comment 22 Rick Macklem freebsd_committer

2019-02-16 23:04:35 UTC

I won't be able to do any commits until April, so by then we should
know if the patch stops the crashes. (I've taken this bug, so that
hopefully I will remember to look at this in April.)

I have pinged trasz@ w.r.t. doing an MFC of his patch to head that
makes CLIENT_MAX a sysctl you can tune. That would get it in 11.3.
(I've left a decision w.r.t. doing an errata change of CLIENT_MAX to him, although I'll admit it doesn't seem serious enough for an errata to me.)

Comment 23 commit-hook freebsd_committer

2019-02-19 11:07:15 UTC

A commit references this bug:

Author: trasz
Date: Tue Feb 19 11:07:03 UTC 2019
New revision: 344276
URL: https://svnweb.freebsd.org/changeset/base/344276

Log:
  Bump the default kern.rpc.gss.client_max from 128 to 1024.

  The old value resulted in bad performance, with high kernel
  and gssd(8) load, with more than ~64 clients; it also triggered
  crashes, which are to be fixed by a different patch.

  PR:		235582
  Discussed with:	rmacklem@
  MFC after:	2 weeks

Changes:
  head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c

Comment 24 Edward Tomasz Napierala freebsd_committer

2019-02-19 11:08:50 UTC

I've MFC-ed the patch, and also bumped the default in 13-CURRENT.

As for errata - I'm not sure, to be honest.  I think erratas are more about crasher bugs and regressions, while this one is neither.

Comment 25 commit-hook freebsd_committer

2019-04-02 23:52:09 UTC

A commit references this bug:

Author: rmacklem
Date: Tue Apr  2 23:51:09 UTC 2019
New revision: 345818
URL: https://svnweb.freebsd.org/changeset/base/345818

Log:
  Fix a race in the RPCSEC_GSS server code that caused crashes.

  When a new client structure was allocated, it was added to the list
  so that it was visible to other threads before the expiry time was
  initialized, with only a single reference count.
  The caller would increment the reference count, but it was possible
  for another thread to decrement the reference count to zero and free
  the structure before the caller incremented the reference count.
  This could occur because the expiry time was still set to zero when
  the new client structure was inserted in the list and the list was
  unlocked.

  This patch fixes the race by initializing the reference count to two
  and initializing all fields, including the expiry time, before inserting
  it in the list.

  Tested by:	peter@ifm.liu.se
  PR:		235582
  MFC after:	2 weeks

Changes:
  head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c

Comment 26 Rick Macklem freebsd_committer

2019-04-03 01:08:03 UTC

The second patch seems to fix the crashes and has been committed to head.

Comment 27 commit-hook freebsd_committer

2019-04-03 03:50:21 UTC

A commit references this bug:

Author: rmacklem
Date: Wed Apr  3 03:50:16 UTC 2019
New revision: 345828
URL: https://svnweb.freebsd.org/changeset/base/345828

Log:
  Add a comment to the r345818 patch to explain why cl_refs is initialized to 2.

  PR:		235582
  MFC after:	2 weeks

Changes:
  head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c

Comment 28 Rick Macklem freebsd_committer

2019-04-16 02:57:32 UTC

The patch that is believed to fix these crashes has been MFC'd.