| Summary: | ZFS/NFS: Intermittent hangs and crashes after a period of time in: nfscl_hasexpired || dbuf_write_done || zio_execute | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Base System | Reporter: | Pader <ypnow> | ||||||||||||
| Component: | kern | Assignee: | freebsd-fs (Nobody) <fs> | ||||||||||||
| Status: | Closed FIXED | ||||||||||||||
| Severity: | Affects Some People | CC: | fs, grahamperrin, rmacklem | ||||||||||||
| Priority: | --- | Keywords: | crash, needs-qa | ||||||||||||
| Version: | 13.0-RELEASE | Flags: | koobs:
maintainer-feedback?
(fs) koobs: maintainer-feedback? (rmacklem) |
||||||||||||
| Hardware: | amd64 | ||||||||||||||
| OS: | Any | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Pader
2021-12-24 14:41:54 UTC
@Reporter Please include additional information: - /var/run/dmesg.boot (as an attachment) - pciconf -lv output (as an attachment) I'll make a few random comments.
1 - Since the crashes occur at different places, I agree with
you that it might be a hardware problem.
It would be nice if you could set up another machine with
exactly the same software/usage and see if it crashes as well.
2 - You should never use "intr" nor "soft" on NFSv4 mounts. This is
mentioned in the BUGS section at the end of "man mount_nfs".
To get rid of a hung NFS mount, use "umount -N <mnt_path>.
(It can take a couple of minutes, but normally succeeds. Note
that any file writing that was in progress when you do this
will get lost.)
3 - If you can still log in and do these when "hung", capture the
output of:
# ps axHl
# procstat -a -kk
# netstat -a
# ping <nfs-server>
4 - nfscl_hasexpired() only gets called when the NFS client does not
get a response from the server for minutes and then gets a
NFSERR_EXPIRED reply from the server.
You did not mention what NFS server you are using. If it is a FreeBSD 13.0
server, then see PR#256280.
If you get the output from #3, please post it here.
Created attachment 230434 [details]
dmesg.boot file
Created attachment 230435 [details]
pciconf output
(In reply to Kubilay Kocak from comment #1) Thank, I have add the attachments of those two files. (In reply to Rick Macklem from comment #2) NFS Server is two linux servers, one is an old serve running almost 10 years, another is a new server we set recently. I think the NFS servers are normal because of there is about 6 linux servers mount these NFS servers and running same program, and they did not freeze or crash. It just so happened that the system crashed today, I will add some output file attachments as you said. Created attachment 230436 [details]
ps axHl
Created attachment 230437 [details]
procstat -a -kk
Created attachment 230438 [details]
netstat -a
Would it be possible for you to upgrade your kernel from the sources for stable/13? There is a patch in stable/13 that fixed a problem where open/create would defer to an exclusive lock request and cause a deadlock. It is commit 701eb03cc0dc in stable/13, but I do not know if it will apply to 13.0 kernel sources cleanly? I thought this could only occur when delegations were being issued by the server and, since you are not running nfscbd(8), delegations should never be issued to the client, even for non-FreeBSD servers. You never mentioned what NFS server you are using. If it is a FreeBSD one, make sure delegations are not enabled. The sysctl vfs.nfsd.enable_delegations should be set to 0 on the server. Ok, it does look like the commit in stable/13 would fix this hang. It happens because read calls nfscl_hasexpired(), which tries to acquire the exclusive lock (similar to delegation return cases). However, calling nfscl_hasexpired() should *almost never* happen. It happens when the client has been partitioned from the NFSv4 server for at least a minute. For the FreeBSD NFSv4 server (which is what is called a courteous server), the expired only happens when a conflicting open/lock request is done by another client or when open/lock resources become exhausted. When recovery from "expired" is done, all byte range locks are lost, so getting the client/server into this state is to be avoiding if at all possible. If your NFSv4 server is a Linux one, expired will happen when the client is network partitioned from the server for over 60sec. In other words, I think you have some sort of network connectivity problem to the NFS server. As an alternative to upgrading to stable/13, you could switch to using NFSv3 mounts to avoid the hang. Oh, and try disabling TSO. It's often broken for various net drivers/chips. (In reply to Rick Macklem from comment #11) Thank you, I 'm now trying mount nfs as nfsv3. One of server mount as nfsv3 is ok. Another one report: RPCPROG_NFS: RPC: Program not registered Do showmount to that server: showmount -e 192.168.1.2 RPC: Program not registered showmount: can't do exports rpc However, this nfs is not strongly dependent, so I don't need to mount it for now, continue to observe for a while. If you are going to do an NFSv3 mount, the NFS server must be configured for that. I can't comment further, since I do not know what kind of NFS server you are using. For the client, rpcbind must be running for NFSv3 mounts. (In reply to Rick Macklem from comment #10) It's been 28 days ago after I switch to NFSv3, everything's normal. Server is very stable. I think that is the problem. It's been almost 49 days ago with NFSv3, everything's normal. Since I believe this is fixed in the 13.1 release, close the PR. |