Bug 254590 - NFSv4.1 mounts from the Linux client gets "stuck" with partially closed TCP connection
Summary: NFSv4.1 mounts from the Linux client gets "stuck" with partially closed TCP c...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: Any Any
: --- Affects Some People
Assignee: Rick Macklem
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-26 20:49 UTC by Rick Macklem
Modified: 2021-04-19 14:21 UTC (History)
5 users (show)

See Also:


Attachments
add soshutdown() calls to server side krpc for non-functional TCP conn (802 bytes, patch)
2021-03-26 21:01 UTC, Rick Macklem
no flags Details | Diff
enable the 6minute krpc timeout for NFSv4.1/4.2 client mounts (910 bytes, patch)
2021-04-05 02:04 UTC, Rick Macklem
no flags Details | Diff
enable the 6minute... for FreeBSD12 and FreeBSD13.0 (1.01 KB, patch)
2021-04-05 14:47 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Rick Macklem freebsd_committer 2021-03-26 20:49:56 UTC

    
Comment 1 Rick Macklem freebsd_committer 2021-03-26 21:01:29 UTC
Created attachment 223621 [details]
add soshutdown() calls to server side krpc for non-functional TCP conn

Jason Breitman reported "stuck" Linux NFSv4.1 mounts
against a FreeBSD NFSv4.1 server.
Although the underlying cause is not known, the
TCP connection is in FIN_WAIT2 on the client
and CLOSE_WAIT on the server.

The server side TCP remains in CLOSE_WAIT because
the server's krpc cannot soclose() the socket until
the backchannel is re-assigned to another TCP connection.
This re-assignment happens when the client establishes
a new TCP connection and does a BindConnectionToSession
operation on the new connection.

This patch adds soshutdown(SHUT_WR) calls in the 3 places
where the server krpc knows that the TCP socket is no
longer usable.
--> I think this will allow the TCP connection to proceed
    past CLOSE_WAIT and allow the TCP connection closure
    to complete.
This will hopefully get the Linux mount "unstuck".
Comment 2 Rick Macklem freebsd_committer 2021-03-26 21:07:24 UTC
I am hoping testers
Comment 3 Rick Macklem freebsd_committer 2021-03-26 21:11:43 UTC
I am hoping that testing will indicate that, at least,
the patch does not result in a regression.
If so, I will commit it.
(Unfortunately Jason cannot patch his production
 server.)

Jason does report that this script can be run on the
Linux client to "unstick" the mount.
(Essentially, it looks for the stuck TCP connection and then
 blocks network traffic to get the closureto complete.)
#!/bin/sh

progName="nfsClientFix"
delay=15
nfs_ip=NFS.Server.IP.X

nfs_fin_wait2_state() {
    /usr/bin/netstat -an | /usr/bin/grep ${nfs_ip}:2049 | /usr/bin/grep FIN_WAIT2 > /dev/null 2>&1
    return $?
}


nfs_fin_wait2_state
result=$?
if [ ${result} -eq 0 ] ; then
    /usr/bin/logger -s -i -p local7.error -t ${progName} "NFS Connection is in FIN_WAIT2!"
    /usr/bin/logger -s -i -p local7.error -t ${progName} "Enabling firewall to block ${nfs_ip}!"
    /usr/sbin/iptables -A INPUT -s ${nfs_ip} -j DROP

    while true
    do
        /usr/bin/sleep ${delay}
        nfs_fin_wait2_state
        result=$?
        if [ ${result} -ne 0 ] ; then
            /usr/bin/logger -s -i -p local7.notice -t ${progName} "NFS Connection is OK."
            /usr/bin/logger -s -i -p local7.error -t ${progName} "Disabling firewall to allow access to ${nfs_ip}!"
            /usr/sbin/iptables -D INPUT -s ${nfs_ip}  -j DROP
            break
        fi
    done
fi
Comment 4 Ryan Moeller freebsd_committer 2021-03-26 21:39:39 UTC
I've asked someone to help test this. I imagine there is no need for a setup with kerberos?
Comment 5 Rick Macklem freebsd_committer 2021-03-26 23:35:43 UTC
You can certainly test it without Kerberos.
Since we do not know the underlying cause,
I do not know if the problem can occur on
non-Kerberized mounts.

If you look at the email here, you can see
that TCP window size adjustment might be at
least part of the underlying cause?
http://docs.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB0968FB1FF0FC481CE37E9A81DD649
Comment 6 Jason 2021-03-31 19:12:42 UTC
I appreciate your work on the patch and testing.  
Are you able to provide me with a target date for the patch to be available via a standard package update?
Comment 7 Rick Macklem freebsd_committer 2021-03-31 22:35:40 UTC
The patch will probably be in FreeBSD 12.3 and 13.1,
whenever those releases occur (6mon-> I think).

Unless someone, such as yourself can confirm that
it fixes the problem, I have no basis on which to
ask re@ to consider it for an errata fix.
(My testing can only try to confirm that it
 does not cause a regression, since I have no
 idea how to reproduce your issue.)

A tester has a problem (which I think is a
different one), but the patch did not fix the
problem for them.
Comment 8 Jason 2021-04-01 15:13:32 UTC
Are you able to provide me with a process to install and uninstall the patch?  Ultimately I would want a package that I could add and remove so that I have a rollback plan if the patch has a negative side effect.
Comment 9 Peter Eriksson 2021-04-01 15:38:12 UTC
This is how I would do it... But your mileage may vary :-)


Install:

> emacs /etc/freebsd-update.conf # Remove "kernel" from Components)
> cd /usr/src
> svn checkout https://svn.freebsd.org/base/releng/12.2
> patch </PATH/TO/PATCH/FILE
> make buildkernel
> mv /boot/kernel /boot/kernel.ORIGINAL
> make installkernel
> reboot


Backout:

> mv /boot/kernel /boot/kernel.BACKOUT
> mv /boot/kernel.ORIGINAL /boot/kernel
> emacs /etc/freebsd-update.conf # Reinstall "kernel" in Components)
# cp -r /boot/kernel /boot/kernel.pre-update
> freebsd-update fetch install # Optionally...
> reboot


It's all described in the FreeBSD handbook (somewhere).
Comment 10 Rick Macklem freebsd_committer 2021-04-01 22:34:18 UTC
The only thing I'd add to what Peter said is that,
if the kernel won't boot for some reason after
doing "make installkernel", you can use "3" during
booting to get the boot prompt and then type:

boot kernel.ORIGINAL

Oh, and I'd NEVER use emacs;-)
Comment 11 Jason 2021-04-03 14:24:18 UTC
I was able to apply the patch today and will let you know if it resolves the issue.  We should gain confidence after 14 days without an issue and I believe we can say the patch was the solution after 21 days.  I used vi to edit the file based on your recommendation.  :)
Comment 12 Rick Macklem freebsd_committer 2021-04-05 02:04:04 UTC
Created attachment 223815 [details]
enable the 6minute krpc timeout for NFSv4.1/4.2 client mounts

The server side krpc has always had a 6 minute
"no activity" timeout for connections. Without
this patch, the timeout is applied to TCP
connections that are not used for a back channel.
(NFSv3, NFSv4.0 mounts, plus FreeBSD NFSv4.1/4.2
mounts from clients not running the nfscbd(8)
daemon.)

The thinking w.r.t. not doing the timeout for
connections with a back channel was to avoid loss
of the backchannel. This is not a serious concern,
since a normal NFSv4.1/4.2 client will renew the
lease every minute or so and, as such, only a
network partitioning or similar will result in a
6 minute timeout.

However, I have been able to get a Linux NFSv4.1
mount "stuck" indefinitely after a 2minute network
partitioning without the timeout.

So, this simple patch enables the 6minute timeout
for all connections.
Comment 13 Rick Macklem freebsd_committer 2021-04-05 02:09:13 UTC
Oh, and for older systems without the first patch
found in PR#254560, there is a third case of
  nd->nd_xprt->xp_idletimeout = 0;
that should be deleted.
(Or apply the patch in PR#254560 before "the 6minute" one here.

This patch is only needed if the NFS server has Linux
NFSv4.1 or 4.2 mounts on it.
Comment 14 Rick Macklem freebsd_committer 2021-04-05 14:47:39 UTC
Created attachment 223831 [details]
enable the 6minute... for FreeBSD12 and FreeBSD13.0

Same patch as 223815, but for FreeBSD12 and FreeBSD13.0.
(223815 is for FreeBSD-current.)
Comment 15 Jason 2021-04-19 13:36:28 UTC
I wanted to provide you with an update.
It has been 14 days without issues which is a good sign.

It should be noted that I only applied patch 223621 - add soshutdown() calls to server side krpc for non-functional TCP conn for bug #254590.
I did not apply the other patch as it was posted after my maintenance window.

It should also be noted that my original kernel was 12.1.

I will apply the same patch to my other production server this coming weekend given that we continue with issues on the patched server.
Comment 16 Rick Macklem freebsd_committer 2021-04-19 13:52:42 UTC
Sounds good.

Actually, not applying the second patch
for testing was preferred.
I am still not 100% sure the timeout
should be enabled for NFSv4.1/4.2.

I put it here as "something to try"
if the first patch did not resolve the
problem.

If I recall, you felt that, if you
can run one more week without the
issue, then it can be considered
resolved. Is that correct?

Thanks for testing this.
Comment 17 Jason 2021-04-19 14:21:18 UTC
Correct.
21 days without an issue will be a strong indicator that the bug is resolved.
We were seeing 1 or more NFS hangs every 7 - 10 days for 4 weeks.