We have recently gone "live" with more NFS users "banging" on our FreeBSD-based fileservers. And now something seems to have started triggering kernel panics. Since the they major difference from before is the number of NFS users so this is the major suspect... We just caught a panic and got a screendump from the console and the stack traceback shows: > Fatal trap 12: page fault while in kernel mode > cpuid = 8; apic id = 08 > fault virtual addresa = 0x0 > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff82b578e9 > stack pointer = 0x20:0xfffffe3fdc627760 > code segment = base rx0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 2519 (nfsd: service) > trap number = 12 > panic: page fault > cpuid = 8 > KDB: stack backtrace > #0 0xffffffff80b3d577 at kdb_backtrace+0x67 > #1 0xffffffff80af6b17 at vpanic+0x177 > #2 0xffffffff80af6993 at panic+0x43 > #3 0xffffffff80f77fdf at trap_fatal+0x35f > #4 0xffffffff80f78039 at trap_pfault+0x49 > #5 0xffffffff80f77807 at trap+0x2c7 > #6 0xffffffff80f56fbc at calltrap+0x8 > #7 0xffffffff82b5d4d2 at svc_rpc_gss+0x8f2 > # 8 0xffffffff80d6c1b6 at svc_run_internal+0x726 > #9 0xffffffff80d6cd4b at svc_thread_start+0xb > #10 0xffffffff80aba093 at fork_exit+0x8 > #11 0xffffffff80f48ede at fork_trampoline+0xe (Unfortunately not kernel crash dump from this machine). Systems are: Dell PowerEdge R730xd with 256GB RAM, HBA330 (LSI 3008) SAS controllers, ZFS-storage, Intel X710 10GE-ethernet machines running FreeBSD 11.2. No swap enabled. ZFS ARC capped to 128GB. NFS v4.0 or v4.1 client with sec=krb5:krb5i:krb5p security. Most clients (if not all) are running Linux CentOS or Ubuntu). Around 200 active clients per server. (Most clients are Windows users using SMB via Samba though) We have enabled a crash dump device one a couple of the machines and are going to enable it on more in order to try to get a crash-dump when the next server panics... Any ideas where this bug might be or how we could workaround it? (Disabling NFS is unfortunately not an option).
Well, if you can somehow find out the source line# for #7 0xffffffff82b5d4d2 at svc_rpc_gss+0x8f2 then I can take a look. It is pretty obviously a use of a NULL ptr, but without the source line#, it's pretty hard to guess what it is. rick
This is a huge wild chance, but I was looking at the assembler code for the svc_rpc_gss function at around the offset (0x8f2 = 2290) and it looks like this: 0xffffffff8286d4bc <svc_rpc_gss+2268>: callq 0xffffffff8286bc50 <rpc_gss_oid_to_mech> 0xffffffff8286d4c1 <svc_rpc_gss+2273>: mov 0x78(%r14),%rsi 0xffffffff8286d4c5 <svc_rpc_gss+2277>: lea -0x38(%rbp),%rdi 0xffffffff8286d4c9 <svc_rpc_gss+2281>: lea -0x70(%rbp),%rdx 0xffffffff8286d4cd <svc_rpc_gss+2285>: callq 0xffffffff828678b0 <gss_export_name> 0xffffffff8286d4d2 <svc_rpc_gss+2290>: test %eax,%eax 0xffffffff8286d4d4 <svc_rpc_gss+2292>: je 0xffffffff8286d932 <svc_rpc_gss+3410> Looking at the source code in the svc_rpcsec_gss.c file this _might_ correspond to code in svc_rpc_gss_accept_sec_context() at around line 941: client->cl_rawcred.version = RPCSEC_GSS_VERSION; rpc_gss_oid_to_mech(mech, &client->cl_rawcred.mechanism); maj_stat = gss_export_name(&min_stat, client->cl_cname, &export_name); if (maj_stat != GSS_S_COMPLETE) { rpc_gss_log_status("gss_export_name", client->cl_mech, maj_stat, min_stat); return (FALSE); } client->cl_rawcred.client_principal = mem_alloc(sizeof(*client->cl_rawcred.client_principal) + export_name.length); client->cl_rawcred.client_principal->len = export_name.length; memcpy(client->cl_rawcred.client_principal->name, export_name.value, export_name.length);
@Peter, you should be able to use 'l *(svc_rpc_gss+0x8f2)' in kgdb to figure out the line number directly.
The only two callers of that function appear to be rpc_gss_get_principal_name and svc_rpc_gss_accept_sec_context. The latter is called exclusively by svc_rpc_gss, so it might be a good candidate. The former appears to be dead code, so I think it's svc_rpc_gss -> svc_rpc_gss_accept_sec_context (l.1344) -> gss_export_name (l.957). (Line numbers from CURRENT-ish, I don't know what you're running.)
Well, if I understood the comments, that would suggest client->cl_cname is NULL. That is weird. I'm not a Kerberos guy, but this says that the server was able to handle the GSSAPI initialization token (basically a Kerberos session ticket + some GSSAPI gobbly gook), but the GSSAPI library doesn't return a principal name for the gss_accept_sec_context() even though it returns GSS_S_COMPLETE. What does this mean in Kerberos land? I have no idea. I can see two ways to handle it. 1 - Consider it a failed authentication. OR 2 - Map it to "nobody". Basically that principal name in client->cl_cname is looked up in the password database and, if it is not found, then the credentials of "nobody" are used instead of the credentials for the user in the password database. --> Since no entry in the password database gets "nobody", it seems that "no principal name" might get the same treatment? I think I'll generate a patch for #2 above and attach it to this PR#, although I have no way to test it.
Created attachment 201858 [details] fix srv side rpcsec_gss to handle NULL client principal name This patch modifies the server side RPCSEC_GSS so that it handles the case where gss_accept_sec_context() works for a client's token and returns GSS_S_COMPLETE but the client principal name is NULL. I have no idea what this means w.r.t. Kerberos, but since a principal name that cannot be found in the password database is authenticated as the user "nobody", this patch does the same for the case of "no prinicpal name". It is untested, but hopefully Peter can test it? (It assumes that the crash was caused by client->cl_cname == NULL.)
Created attachment 201862 [details] updated fix for srv side rpcsec_gss NULL client principal This is an update to the 201858 patch which does a couple of printf()s and no KASSERT(). It uses a local clname instead of the one in client->cl_cname so that any race caused by multiple RPC requests with the same handle and GSS_S_CONTINUE_INIT will be handled. (gss_release_name() sets client->cl_cname NULL and that could have resulted in the crash if multiple RPCs were received and handled concurrently as above.) I didn't think that gss_accept_sec_context() ever returned GSS_S_CONTINUE_INIT for Kerberos mechanism, but I could be wrong on this and if this does happen and the client erroneously sends the next token RPC twice, it could try to use client->cl_cname after it is set NULL by the gss_release_name() for the previous RPC message. The printf()s should tell us more about how the NULL cname happens.
I'll try out the proposed fixes. Sound like it could be the culprit. A pity I haven't really found a way to force the issue yet (or even pinpointed which of the 100-200 clients caused it). But perhaps it's just a random thing caused by network retransmissions and load balancing over multiple LACP trunks... A question - I enabled a lot of debugging output in order to try to see what's happening and I noticed one strange thing in the "dmesg" output: > rpcsec_gss: accepted context for \^D\^A (41 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1> (I had modified the rpc_gss_log_debug() call to also print the length of the client_principal member) Ie, this call around line 1100 or so: rpc_gss_log_debug("accepted context for %s (%d bytes) with " "<mech %.*s, qop %d, svc %d>", client->cl_rawcred.client_principal->name, client->cl_rawcred.client_principal->len, mechname.length, (char *)mechname.value, client->cl_qop, client->cl_rawcred.service); The "client_principal->name" output looks garbled.. (In /var/log/messages those characters get's filtered out so you won't find it there). Perhaps cl_rawcred.client_principal isn't supposed to be printable at all.
The bogus client_principal name suggests that client->cl_cname was pointing at something bogus, because the client_principal is filled in from it. So, I think this suggests that client->cl_cname is sometimes bogus and sometimes NULL. (It's filled in via gss_accept_sec_context(), so that's in Kerberos land.) Hopefully the patch will help resolve this, rick.
Created attachment 201876 [details] fix race in ref counting of svc_rpc_gss_client struct Upon inspection I found two related races when the svc_rpc_gss_client structure in created/initialized. 1 - The cl_expiration field is set after the structure is linked into a list, making it visible to other threads when it is still 0. This could result in svc_rpc_gss_timeout_client() freeing it just after being linked into the list. 2 - Related to #1, cl_refs is initialized to 1 and then incremented by refcount_acquire() after svc_rpc_gss_create_client() returns it, leaving a window where cl_refs could be decremented to 0 and free'd. This patch makes two changes to svc_rpc_gss_create_client(): - Move the code that initializes cl_expiration to before where it is linked into the list, so that it is set before it is visible to other threads. - Initialize cl_refs to 2, so a refcount_acquire() is not needed after the svc_rpc_gss_create_client() call. These races could allow the structure to be free'd when it is still being used in svc_rpc_gss(). This patch can be applied separately from #201862.
(In reply to Rick Macklem from comment #9) I modified the debug printout a bit more so it would print all bytes in the cname output and it looks like it's not a "pure" name but some 'struct'/packed data of some kind... rpcsec_gss: svc_rpc_gss_accept_sec_context: cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x13ESPRESSO$@AD.LIU.SE (38 bytes), cl_sname = nfs@filur00.it.liu.se Ie, the client principal is there but at the end of the data (19 bytes into it). (\xNN = hexadecimal encoded character. Printable ascii chars are printed as is)
Yes, that is correct. In the gssapi, the call gss_export_name() gets the plain ascii name out of the internal format. You'll notice that the client_principal that was printing out bogusly is a copy of the result of a gss_export_name() on cl_cname. (So, assuming your principal names are printable ascii, as this example appears to be, the result of a successful gss_export_name() will be printable.) Btw, I'm hoping that the second patch "fix race..." will fix this problem, since I doubt that the Kerberos/gssapi libraries are returning trash for the client principal name via gss_accept_sec_context().
I'm running a kernel with your patches now on our test server. With some added code to print the 'cname' to export_name converted as hexadecimal characters (for non-printable ones). It really looks like som 'struct' ending with the length of the principal as a 4-byte int and then the printable principal name. Not many NFS clients connecting to that server so a race condition causing the "random" characters isn't really likely. Especially not when it's the first call to happening :-) (Machine authentication) rpcsec_gss: svc_rpc_gss_accept_sec_context: cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x16FILIFJONKAN$@AD.LIU.SE (41 bytes), cl_sname = nfs@filur00.it.liu.se rpcsec_gss: accepted context for (41 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1> (User authentication) rpcsec_gss: svc_rpc_gss_accept_sec_context: cl_cname = \x04\x01\x00\x0B\x06\x09*\xFFH\xFF\xFF\x12\x01\x02\x02\x00\x00\x00\x12tesje148@AD.LIU.SE (37 bytes), cl_sname = nfs@filur00.it.liu.se rpcsec_gss: accepted context for (37 bytes) with <mech { 1 2 840 113554 1 2 2 }, qop 0, svc 1> My quick hack to print the export_name: maj_stat = gss_export_name(&min_stat, cname, &export_name); if (maj_stat != GSS_S_COMPLETE) { rpc_gss_log_status("gss_export_name", client->cl_mech, maj_stat, min_stat); return (FALSE); } if (1) /* Debug printout */ { gss_buffer_desc tmp; char *src, *dst; int i; OM_uint32 dummy; dst = tmp.value = mem_alloc(tmp.length = export_name.length*4+1); src = export_name.value; for (i = 0; i < export_name.length; i++) { if (*src < ' ' || *src > '~') { sprintf(dst, "\\x%02X", *src); dst += 4; src++; } else *dst++ = *src++; } *dst = '\0'; rpc_gss_log_debug("svc_rpc_gss_accept_sec_context: cl_cname = %s (%d bytes), cl_sname = %s", tmp.value, export_name.length, client->cl_sname && client->cl_sname->sn_principal ? client->cl_sname->sn_principal : "<null>"); gss_release_buffer(&dummy, &tmp); } client->cl_rawcred.client_principal = mem_alloc(sizeof(*client->cl_rawcred.client_principal) + export_name.length); client->cl_rawcred.client_principal->len = export_name.length; memcpy(client->cl_rawcred.client_principal->name, export_name.value, export_name.length); gss_release_buffer(&min_stat, &export_name);
Well, since that is the output of gss_export_name(), I suspect that won't be found in the password database and will map to nobody. Note that the fact that it ends in "@AD.LIU.SE" suggests that the gss_export_name() isn't doing what it needs to do. Normally the output of gss_export_name() should be: For a user principal, just the user, such as "tesje148". For a host principal, <name>@<host.domain>, such as "nfs@filur00.it.liu.se". Unless there is an additional step in the GSSAPI to get to the "user" or "user@domain" name that I have forgotten. (I am not the author of this code, but I did write a RPCSEC_GSS implementation for the early NFSv4 code I did long ago.) Maybe gss_pname_to_unix_cred() does some additional conversion? (If you add a couple of printf()s in gss_pname_to_unix_cred(), you should be able to tell if the cname is translating to a unix cred ok.) You also might consider testing with my second patch and not the first one that changes client->cl_cname to a local cname. (I may have screwed the patch up, since I can't test them.) I don't think this is related to the crash.
Oops, I was wrong. (Maybe Heimdal just returned the principal name 10+ years ago, but no longer.) If you look at (under your FreeBSD source tree): crypto/heimdal/lib/gssapi/krb5/export_name.c you'll see that it puts a bunch of junk in front of the principal name. (It looks like it is always the same # of bytes, so for your case I'd guess it's always 19bytes of junk.) I'd guess that krb5_aname_to_localname() found in: crypto/heimdal/lib/krb5/aname_to_localname.c knows how to get rid of the junk, but that isn't callable from the kernel, I'd guess? (I think the gssd daemon will only handle a subset of the gss_XXX() calls.) So, I don't think this is a bug and that leaves us with the crashes due to a NULL cl_cname and I'm hoping they were caused by the race fixed by the 2nd patch. Good luck with it, rick
I agree that the debug printout stuff is not the problem here. We are going to try out your second patch on our production servers this week. (Just had one of the servers (without your patch) crash and reboot again twice). Thankfully we found a way to speed up the mounting of 20k zfs filesystems massively so a reboot now takes about 5 minutes instead of 40+ minutes... (A coworker wrote a script that does the ZFS mounts in parallell instead of sequentially - there are room for even more speed improvements though :-) Anyway - a question about a potential workaround: Would limiting the number of NFSD thread with "-n 1" "mask" this problem? Our servers have 32 cpu "cores" so nfsd tries to start up 256 kernel threads which I'm guessing will improve the chance of the bug to be triggered.
As some background (not necessarily helpful, unfortunately) on GSSAPI names, they are actually fairly complicated objects. To start with, there are two broad types of name object, an "internal name" that is something of a generic name string and name type (e.g., "host@test.example.com" and GSS_NT_HOSTBASED_SERVICE), and a "mechanism name" (MN) that contains similar information content but has been restricted to a single GSSAPI mechanism (e.g., for the krb5 mechanism, you'd get "host/test.example.com@REALM" for the above example). So the actual runtime object is going to have some OIDs associated with it, whether for name type or mechanism type and perhaps other things. The original worldview w.r.t. GSSAPI names was that an MN could be converted to an "export token" that would be treated as an opaque binary string and used for equality comparison in ACLs, and that there would be a separate "display name" version that would be used in logging but not for authentication decisions. (You can imagine how well that turned out.) So now we have more advanced functions like gss_localname() that map a MN to a local username. I did not look particularly closely at the debugger dump of the cl_name that had the "expected name" at the end of the buffer; it did look like there were probably some OIDs and maybe other data in front. Presumably one could pull the relevant struct definition from the heimdal internal headers.
Yes, I suspect that "-n 1" on the nfsd startup would probably mask the problem, although I would not recommend that. (This assumes that I am correct to assume this race between threads is the culprit.) Increasing the size of CLIENT_MAX (which recently became a tunable in head) might also mask it and will help w.r.t. performance if you have more than 128 users authenticating from the clients. (In this case, CLIENT refers to a user on an NFS mount and not an NFS mount from a client.) Interesting that the HASH table is set to 256, which is twice the number of elements being kept in the hash lists. (Not sure why anyone would make it that way, but I'm not the author of this stuff.;-) I am really hoping that the 2nd patch "fix race..." will fix the problem. (The first patch would just be a "safety belt" in case the GSSAPI did return GSS_S_COMPLETE for gss_accept_sec_context(), but somehow leave the cname argument NULL.
(In reply to Rick Macklem from comment #18) Hmm.. I wonder what would happen if I set CLIENT_MAX to 1 or 2 instead on our test server and then have a couple of Linux clients try to do some concurrent access (as different users). Ah well, something for tomorrow to test. We definitely have more that 128 different users accessing each server so will probably up the limits to 1024 or so later. (Looking for ways to more easily provoke the issue :-) Anyway, we're now running our servers with that second patch. Will be interesting to see what happens tomorrow... (One of the servers rebooted twice today (without that patch)).
Well, if you have NFS client(s) mounted with more than CLIENT_MAX different users actively using the mount concurrently, that might tickle the race? - Basically, this code (which handles a new credential creation) happens when there is a miss on the cache of credentials (which is capped at CLIENT_MAX), where each distinct user (as in uid) would need a credential. --> Conversely, making CLIENT_MAX >= the maximum # of different uids actively using the file system might reduce the likelyhood of the crash, since this code would be executed less frequently. Related to Ben's comment (thanks for the nice description of the name): - It did tickle a rusty brain cell. I think I was confusing gss_export_name() with gss_display_name(). Unfortunately, this isn't very useful, since neither gss_display_name() nor gss_localname() are supported by the KGSSAPI. If you search for "_svc" in usr.sbin/gssd/gssd.c, you'll see the rather small list of gssapi functions supported by the KGSSAPI (unless I've misread this code). - I think the structure is called "Principal" (also called "krb5_principal_data"). It seems to be defined in krb5_asn1.h and that isn't in the kernel either. --> I suspect this is why the logging code enabled via compiling it with DEBUG defined just logs the output of gss_export_name() and doesn't try and extract the components of it? Good luck with your testing, rick
Just a quick feedback/comment: We've now been running our production servers with a patched kernel as per above since Monday night and so far everything seems to be running correctly... No kernel panics so far - *Fingers Crossed* :-) Just a quick wish for someone - would it be possible to have the CLIENT_MAX and the hash table size increased a bit (like to 512/1024) in an "official" kernel patch for 11.2? 128/256 are very low numbers - especially since every client principal (HOST$@REALM) also uses up a slot in that table when a client connects... So basically it just requires 64 clients (with a different user on each client) to run into the CLIENT_MAX limit (which I'm guessing will cause gssd extra work so it's not a showstopper, just annoying :-). (There are a bit too many kernel-crashing problems reported with ZFS in 12.0 for us to upgrade there yet). I know I can build my own kernel with those limits increased but it would be nice to not have to do that (and distribute it to all our servers) whenever a new kernel/security patch arrives... :-) /Lazy Bugger
I won't be able to do any commits until April, so by then we should know if the patch stops the crashes. (I've taken this bug, so that hopefully I will remember to look at this in April.) I have pinged trasz@ w.r.t. doing an MFC of his patch to head that makes CLIENT_MAX a sysctl you can tune. That would get it in 11.3. (I've left a decision w.r.t. doing an errata change of CLIENT_MAX to him, although I'll admit it doesn't seem serious enough for an errata to me.)
A commit references this bug: Author: trasz Date: Tue Feb 19 11:07:03 UTC 2019 New revision: 344276 URL: https://svnweb.freebsd.org/changeset/base/344276 Log: Bump the default kern.rpc.gss.client_max from 128 to 1024. The old value resulted in bad performance, with high kernel and gssd(8) load, with more than ~64 clients; it also triggered crashes, which are to be fixed by a different patch. PR: 235582 Discussed with: rmacklem@ MFC after: 2 weeks Changes: head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
I've MFC-ed the patch, and also bumped the default in 13-CURRENT. As for errata - I'm not sure, to be honest. I think erratas are more about crasher bugs and regressions, while this one is neither.
A commit references this bug: Author: rmacklem Date: Tue Apr 2 23:51:09 UTC 2019 New revision: 345818 URL: https://svnweb.freebsd.org/changeset/base/345818 Log: Fix a race in the RPCSEC_GSS server code that caused crashes. When a new client structure was allocated, it was added to the list so that it was visible to other threads before the expiry time was initialized, with only a single reference count. The caller would increment the reference count, but it was possible for another thread to decrement the reference count to zero and free the structure before the caller incremented the reference count. This could occur because the expiry time was still set to zero when the new client structure was inserted in the list and the list was unlocked. This patch fixes the race by initializing the reference count to two and initializing all fields, including the expiry time, before inserting it in the list. Tested by: peter@ifm.liu.se PR: 235582 MFC after: 2 weeks Changes: head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
The second patch seems to fix the crashes and has been committed to head.
A commit references this bug: Author: rmacklem Date: Wed Apr 3 03:50:16 UTC 2019 New revision: 345828 URL: https://svnweb.freebsd.org/changeset/base/345828 Log: Add a comment to the r345818 patch to explain why cl_refs is initialized to 2. PR: 235582 MFC after: 2 weeks Changes: head/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
The patch that is believed to fix these crashes has been MFC'd.