Summary: | Kernel panics with vfs.nfsd.enable_locallocks=1 and nfs clients doing hdf5 file operations | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Matthew L. Dailey <matthew.l.dailey> | ||||||
Component: | kern | Assignee: | Rick Macklem <rmacklem> | ||||||
Status: | Closed FIXED | ||||||||
Severity: | Affects Some People | CC: | cy, grahamperrin, rmacklem | ||||||
Priority: | --- | Keywords: | crash | ||||||
Version: | 15.0-CURRENT | Flags: | rmacklem:
mfc-stable14+
rmacklem: mfc-stable13+ |
||||||
Hardware: | amd64 | ||||||||
OS: | Any | ||||||||
Attachments: |
|
Description
Matthew L. Dailey
2024-08-21 15:00:46 UTC
Created attachment 252988 [details]
Steps to reproduce panics
Attached are instructions on how to set up a server and client and steps to reproduce the panics.
Just fyi, I have found a place where the code does an msleep() when vfs.nfs.enable_locallocks is 1. This is bad news, since the underlying linked list data structures could change during the msleep(). I haven't yet gotten to the point of coming up with a fix. Until then, I would recommend avoiding setting vfs.nfsd.enable_locallocks to 1. Created attachment 253528 [details]
Fix close handling when vfs.nfsd.enable_locallocks=1
I think this patch might fix the problem.
nfsrv_freeopen() was being called after the mutex
lock was released, making it possible for other
kernel threads to change the lists while nfsrv_freeopen()
took the nfsstateid out of the lists.
This patch moves the code around
"if (nfsrv_freeopen(stp, vp, 1 p) == 0) {"
into nfsrv_freeopen(), so that it can remove the nfsstateid
structure from all lists before unlocking the mutex.
This should avoid any race between CLOSE and other nfsd threads
updating the NFSv4 state.
The patch does not affect the semantics when vfs.nfsd.enable_locallocks=0.
If the reporter can test this patch, that would be great.
(In reply to Rick Macklem from comment #3) Thanks, Rick - this looks promising! I should be able to get a test started today with this patch and see how it goes. Since the panics can take hours or days to happen (sometimes weeks), it may be a while before I can say anything definitive. I'll update this report once I have any results. Sounds good. I won't be able to do commits for about the next 10 days anyhow and it will be a while before the 14.2 release cycle gets going, so this can wait for a few weeks for results. (If it doesn't fix your problem, I will probably still commit it unless you report that it has made things worse.) Thanks, rick Quick update on this. I have had a test system running with this patch now for just over 10 days with no panics or issues. The hdf5 workload from a linux client has been running 24/7 for this entire period. This is certainly not definitive in terms of the original bug, but it's looking good so far. And, it does not appear that this has introduced any new problems, at least in this limited testing. I'll update again in another week or two. (In reply to Matthew L. Dailey from comment #6) Thanks for the update. Given the 14.2 release schedule, I think I will commit the patch to main now and MFC it, given that I do believe it fixes a locking problem. I will leave this PR Open until you report again. A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=eb345e05ac6602eeef0c33fce9025bbc8ec44d0f commit eb345e05ac6602eeef0c33fce9025bbc8ec44d0f Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2024-09-30 22:49:57 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2024-09-30 22:49:57 +0000 nfsd: Fix close handling when vfs.nfsd.enable_locallocks=1 nfsrv_freeopen() was being called after the mutex lock was released, making it possible for other kernel threads to change the lists while nfsrv_freeopen() took the nfsstateid out of the lists. This patch moves the code around "if (nfsrv_freeopen(stp, vp, 1 p) == 0) {" into nfsrv_freeopen(), so that it can remove the nfsstateid structure from all lists before unlocking the mutex. This should avoid any race between CLOSE and other nfsd threads updating the NFSv4 state. The patch does not affect semantics when vfs.nfsd.enable_locallocks=0. PR: 280978 Tested by: Matthew L. Dailey <matthew.l.dailey@dartmouth.edu> MFC after: 1 week sys/fs/nfsserver/nfs_nfsdstate.c | 62 +++++++++++++++++++--------------------- 1 file changed, 30 insertions(+), 32 deletions(-) A commit in branch stable/14 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=e7474c619270c5c325ce844b77eecc2dcf98147c commit e7474c619270c5c325ce844b77eecc2dcf98147c Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2024-09-30 22:49:57 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2024-10-12 00:43:27 +0000 nfsd: Fix close handling when vfs.nfsd.enable_locallocks=1 nfsrv_freeopen() was being called after the mutex lock was released, making it possible for other kernel threads to change the lists while nfsrv_freeopen() took the nfsstateid out of the lists. This patch moves the code around "if (nfsrv_freeopen(stp, vp, 1 p) == 0) {" into nfsrv_freeopen(), so that it can remove the nfsstateid structure from all lists before unlocking the mutex. This should avoid any race between CLOSE and other nfsd threads updating the NFSv4 state. The patch does not affect semantics when vfs.nfsd.enable_locallocks=0. PR: 280978 Tested by: Matthew L. Dailey <matthew.l.dailey@dartmouth.edu> (cherry picked from commit eb345e05ac6602eeef0c33fce9025bbc8ec44d0f) sys/fs/nfsserver/nfs_nfsdstate.c | 62 +++++++++++++++++++--------------------- 1 file changed, 30 insertions(+), 32 deletions(-) A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=cc07d914bc80f0c644584de6eab2efd30e911d8d commit cc07d914bc80f0c644584de6eab2efd30e911d8d Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2024-09-30 22:49:57 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2024-10-12 00:49:55 +0000 nfsd: Fix close handling when vfs.nfsd.enable_locallocks=1 nfsrv_freeopen() was being called after the mutex lock was released, making it possible for other kernel threads to change the lists while nfsrv_freeopen() took the nfsstateid out of the lists. This patch moves the code around "if (nfsrv_freeopen(stp, vp, 1 p) == 0) {" into nfsrv_freeopen(), so that it can remove the nfsstateid structure from all lists before unlocking the mutex. This should avoid any race between CLOSE and other nfsd threads updating the NFSv4 state. The patch does not affect semantics when vfs.nfsd.enable_locallocks=0. PR: 280978 Tested by: Matthew L. Dailey <matthew.l.dailey@dartmouth.edu> (cherry picked from commit eb345e05ac6602eeef0c33fce9025bbc8ec44d0f) sys/fs/nfsserver/nfs_nfsdstate.c | 62 +++++++++++++++++++--------------------- 1 file changed, 30 insertions(+), 32 deletions(-) The patch has now been MFC'd. I will leave this PR Open until the reporter reports back w.r.t. success of the patch. Our test system with this patch has now been running non-stop without any panics or errors for almost 24 days. This is well outside of any failure window we've seen in our prior testing, so I'm cautiously optimistic that this patch has fixed this issue. At this point, I think this can be closed. I'll reach out again if we see anything related to this in further testing. Huge thanks to Rick for fixing this! The reporter has had good test results when using the patch (which is now MFC'd to stable/13 and stable/14). |