Bug 178231

Summary: [nfs] 8.3 nfsv4 client reports "nfsv4 client/server protocol prob err=10026"
Product: Base System Reporter: Alex Strange <astrange>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: Open ---    
Severity: Affects Only Me CC: chris.stephan, ml
Priority: Normal    
Version: Unspecified   
Hardware: Any   
OS: Any   

Description Alex Strange 2013-04-29 06:40:01 UTC
I have a busy web server which mounts two other systems over NFSv4 and writes files to them. All three systems are running FreeBSD 8.3 amd64.

dmesg frequently fills up with this:
nfsv4 client/server protocol prob err=10026
nfsv4 client/server protocol prob err=10026


which causes file writes to fail on specific worker processes. Killing the right process while the logging is going fixes the issue for a while, but this would cause requests to error so we'd prefer to avoid it.

nfs activity:
% nfsstat -w 1 -e
        GtAttr Lookup Rdlink   Read  Write Rename Access  Rddir

Client:   1106    880      0      0    424     60    880      0
Server:      0      0      0      0      0      0      0      0

nfs related sysctls:
vfs.newnfs.downdelayinitial: 12
vfs.newnfs.downdelayinterval: 30
vfs.newnfs.keytab_enctype: 1
vfs.newnfs.skip_wcc_data_onerr: 1
vfs.newnfs.nfs3_jukebox_delay: 10
vfs.newnfs.reconnects: 0
vfs.newnfs.bufpackets: 4
vfs.newnfs.callback_addr: 
vfs.newnfs.nfs4acl_enable: 1
vfs.newnfs.realign_count: 0
vfs.newnfs.realign_test: 0
vfs.newnfs.nfs_directio_allow_mmap: 1
vfs.newnfs.nfs_directio_enable: 0
vfs.newnfs.clean_pages_on_close: 1
vfs.newnfs.commit_on_close: 0
vfs.newnfs.prime_access_cache: 0
vfs.newnfs.access_cache_timeout: 60
vfs.newnfs.diskless_rootpath: 
vfs.newnfs.diskless_valid: 0
vfs.newnfs.nfs_ip_paranoia: 1
vfs.newnfs.defect: 0
vfs.newnfs.iodmax: 4
vfs.newnfs.iodmin: 0
vfs.newnfs.iodmaxidle: 120
vfs.newnfs.disable_checkutf8: 0
vfs.newnfs.server_max_nfsvers: 4
vfs.newnfs.server_min_nfsvers: 2
vfs.newnfs.nfs_privport: 0
vfs.newnfs.enable_locallocks: 0
vfs.newnfs.issue_delegations: 0
vfs.newnfs.commit_miss: 0
vfs.newnfs.commit_blks: 0
vfs.newnfs.mirrormnt: 1
vfs.newnfs.minthreads: 1
vfs.newnfs.maxthreads: 4
vfs.newnfs.threads: 0
vfs.newnfs.request_space_used: 0
vfs.newnfs.request_space_used_highest: 0
vfs.newnfs.request_space_high: 47185920
vfs.newnfs.request_space_low: 31457280
vfs.newnfs.request_space_throttled: 0
vfs.newnfs.request_space_throttle_count: 0

Applicable fstab:
boards:/boards /www/boards nfs rw,noatime,nfsv4,async,nolockd,late,intr,bg,timeout=5 0 0
images:/images /www/images nfs rw,noatime,nfsv4,async,nolockd,late,intr,bg,timeout=5 0 0
images:/thumbs /www/thumbs nfs rw,noatime,nfsv4,async,nolockd,late,intr,bg,timeout=5 0 0

How-To-Repeat: Wait.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2013-05-20 04:29:21 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 2 Rick Macklem freebsd_committer freebsd_triage 2013-05-20 22:06:30 UTC
State Changed
From-To: open->feedback


I have emailed the reporter and asked him to test without 
the "intr" mount option, which could cause this. The error 
indicates that the client got "out of sync" with the server 
while doing open/lock operations, which are serialized via 
a sequence number.
Comment 3 Eitan Adler freebsd_committer freebsd_triage 2017-12-31 07:59:45 UTC
For bugs matching the following criteria:

Status: In Progress Changed: (is less than) 2014-06-01

Reset to default assignee and clear in-progress tags.

Mail being skipped
Comment 4 ml 2020-11-21 15:56:54 UTC
Hello.

I've got a 12.2 client connecting to a 12.2 server and I'm seeing this message a few times a day.
I'm not sure it leads to consequences, though.

I also rarely see the same message, but with error 10020 and "nfsv4 expired locks lost".

I don't use the intr option.



I'd be glad to help investigate this, but I need guidance.
Comment 5 Chris Stephan 2021-10-02 01:39:48 UTC
Seeing same logs as OP: nfsv4 client/server protocol prob err=10026

In an effort to prepare to migrate to our next deployment of production servers to leverage NFSv4 over TLS, we have built a development environment on 13.0-RELEASE. All hosts were built from scratch using 13.0-RELEASE txz's, expanded into custom ZFS datasets on Supermicro X9 series Intel. NFS Clients are Diskless Dell workstations, booting PXE from /srv/tftpboot, root onto MFS, mouting NFS shares on server on /net. MFS includes most of base.txz. Minor bits have been removed mainly from /usr/share/ /boot that are better suited for tmpfs or NFS. /usr/bin and /usr/sbin are untouched. The method for this build is identical to what was is done for our 12.2 deployment, short of the source now being the 13-RELEASE tarball and the use of NFSv4 vs NFSv3 previously.

## /etc/fstab entries tried:
192.168.10.101:/   /net    nfs    nfsv4,rw,hard,tcp             0 0
192.168.10.101:/   /net    nfs    nfsv4,rw,soft,retrycnt=0,tcp  0 0

All works fine, until one of the testers starts Chromium, which, after even minor browsing causes all of the window system to freeze. Originally we thought this was a bug in Chrome or the X server. After trying to isolate with D-trace for the last three days, we found we can trigger the event by waiting for the window system to freeze up and setting up a dtrace on the window manager (fluxbox) and right clicking on the desktop to open a menu, which triggers the open() call to pull the menu file from the user's home directory in:

`/net/home/<user>/.fluxbox/menu`

NFS never fulfills this request. It locks up the window manager and I can switch to another VT and troubleshoot from there. On the NFS server, /usr/sbin/nfsdumpstate shows there are locks for each of the clients running chrome. On the clients, /usr/bin/nfsstat shows thousands of timeout and retries but they have stopped incrementing by the time everything has locked up. When checking stats even after hours of running (so long as Chrome is not started), stats in question stay at 0. It seems apparent that the NFS client has seized at this point and can not recover. Rebooting a client does not clear the locks on the server. clear_locks does not appear to resolve the server side either (but I'm not sure clear_locks works with NFSv4. 

Any application, CLI or GUI which accesses the NFS system locks up and never returns. If I try to umount [-f] /net the command locks the VT. if I try to read the subtree of /net issuance of the command locks the VT. 

Our previous setup uses NFSv3+KRB5i/p on 12.2-RELEASE-p10 and works flawlessly. 

We also have tried connecting a lab client from our 12.2 cluster as an NFSv4 client to the 13.0 server, and the same thing happens. I am not willing to attempt to connect a client via NFSv4 to our 12.2 cluster because I really don't want to cause some further issue in the event we lock up the server with active testing going on.

When we change the client mounts to NFSv3, all is well. So this definately feels like a bug in the NFSv4 client or server.

Anyways, I'm at the point where I feel like there are smarter folk than I that might be interested in looking at this. I have an relatively idle cluster ready for all the testing anyone wants to throw at it.