Bug 260011 - Unresponsive NFS mount on AWS EFS
Summary: Unresponsive NFS mount on AWS EFS
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-24 09:08 UTC by Alex Dupre
Modified: 2021-11-25 18:24 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Dupre freebsd_committer 2021-11-24 09:08:04 UTC
I'm experiencing annoying issues with an AWS EFS mountpoint on FreeBSD 13 EC2 instances. The filesystem is mounted by 3 instances (2 with the same access patterns, 1 with a different one)

Initially I had the /etc/fstab entry configured with: 

`rw,nosuid,noatime,bg,nfsv4,minorversion=1,rsize=1048576,wsize=1048576,timeo=600,oneopenown`

and this after a few days led my java application to have all threads blocked on never returning `stat64` kernel calls, without the ability to even kill -9 the process.

After digging it up it seems the normal behavior for hard mount points, even if I fail to understand why one should prefer to have the system completely freezed when the NFS mount point is not responding.

So I later changed the configuration with:

`rw,nosuid,noatime,bg,nfsv4,minorversion=1,intr,soft,retrans=2,rsize=1048576,wsize=1048576,timeo=600,oneopenown`

by adding `intr,soft,retrans=2`.

Btw, I think there is a typo in mount_nfs(8), it says to set `retrycnt` instead of `retrans` for the `soft` option, can you confirm?

After the change `nfsstat -m` reports:
`nfsv4,minorversion=1,oneopenown,tcp,resvport,soft,intr,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2`

I wonder why it seems that the timeo,rsize,wsize have been ignored, but this is irrelevant to the issue.

After a few days the application on the two similar EC2 instances stopped working again, though. Any command accessing the mounted efs filesystem didn't complete in reasonable time (ls, df, umount, etc.), but I could kill the processes. The only way to recover the situation was to reboot the instances, though.

On one of them I've seen the following kernel messages, but they have been generated only when I tried to debug the issue hours later, and only on one EC2 instance, so I'm not sure if they are relevant or helpful:

```
kernel: newnfs: server 'fs-xxx.efs.us-east-1.amazonaws.com' error: fileid changed. fsid 0:0: expected fileid 0x4d2369b89a58a920, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
kernel: nfs server fs-xxx.efs.us-east-1.amazonaws.com:/: not responding
```

The third EC2 instance survived and was still able to access the filesystem, but I think it wasn't accessing the filesystem when there have been the network/nfs issue  that affected the two others.
Comment 1 Rick Macklem freebsd_committer 2021-11-24 15:44:42 UTC
Assorted comments:
- FreeBSD 13.0 shipped with a bug in the TCP stack which could
  result in a missed socket receive upcall.
  --> This normally causes problems for the server, but could
      cause problems for the client side as well.
  When you do a "netstat -a" on the hung system, the TCP connection
  for the mount is shown as ESTABLISHED with Recv-Q non-zero.
  --> The fix is tp upgrade to stable/13.
  There is more info on this in PR#254590.

- AWS uses a small, fixed number of open_owners, which is why
  "oneopenown" is needed. If it also uses a small, fixed number
  of lock_owners (for byte range locking instead of Windows Open locks),
  then you could run out of these.
  --> If so, you are SOL (Shit Outa Luck) and my only suggestion
      would be to try an nfsv3 mount with the "nolockd" mount option.
  "netstat -E -c" should show you how many lock_owners are allocated
  at the time of the hang.

Other than that, if AWS has now added support for delegations, there
could be assorted breakage. Not running the nfscbd(8) daemon should
avoid isuuing of delegations, if that is the case.

If a soft mount fails a syscall, then the session slot is screwed up
and this makes the mount fail in weird ways.
"umount -N <mnt_path>" is a much better way to deal with hung mounts.

As a starting point, posting the output of:
ps axHl
procstat -kk
netstat -a
nfsstat -E -c
on the client when hung will give us more information.

Good luck with it, rick
Comment 2 Colin Percival freebsd_committer 2021-11-24 16:46:17 UTC
No ideas about the hanging, unless it's networking related, but...

> expected fileid 0x4d2369b89a58a920, got 0x2

This looks suspiciously like ENOENT is getting turned into a fileid value somewhere.
Comment 3 Rick Macklem freebsd_committer 2021-11-24 22:45:10 UTC
Or the server has followed the *unix* tradition of
using fileno == 2 for the root directory of a file
system, but did not indicate that a server file system
boundary was crossed by changing the FSID (file system ID),
which is an attribute that defines which server file
system is which.

I have no idea if Amazon's EFSI uses multiple file systems?
Comment 4 Alex Dupre freebsd_committer 2021-11-25 10:15:20 UTC
Thanks for the debug suggestions, I'll run those commands next time it happens and report here.

For the records, I'm not running any additional NFS daemon and haven't anything NFS-specific in rc.conf, it's just a plain mount, and it's not a heavily accessed file system either.

I see you recommend to use `hard` mounts. I tried to use the `soft` mounts to avoid the infinite hanging and inability to kill, hoping that would help in recovering, but from what I've understood now even the hard mount point should recover when the NFS server comes back, so the problem was really a different one and is affecting both mount types.

Any idea why a few arguments seems to have been ignored? Does it make sense to set higher rsize/wsize on tcp endpoints? I see that the recent efs automounter doesn't use any of them, so probably it is not worth.
Comment 5 Rick Macklem freebsd_committer 2021-11-25 16:41:04 UTC
Mount options are "negotiated" with the NFS server and
other tunables in the system.
For example, to increase rsize/wsize to 128K, you must
set vfs.maxbcachebuf=131072 in /boot/loader.conf.

To increase rsize/wsize to 1Mbyte, you must
set vfs,maxbcachebuf=1048576 in /boot/loader.conf
and set kern.ipc.maxsockbuf=4737024 (or larger)
in /etc/sysctl.conf.
--> This assumes you have at least 4Gbytes of ram on the
    system.  The further you move away from defaults,
    the less widely tested your configuration is.
Also, in the case rszie/wsize, the system will use the
largest size that is "negotiable" given other tuning.
The use of the rsize/wsize options is mainly to reduce
the size below the maximum negotiable.
--> From my limited testing, sizes above 256K do not
    perform better, but what works best for EFS?
    I have no idea.

If a server restarts, clients should recover. If a client
is hung like you describe, either due to an unresponsive server,
a broken server (that generates bogus replies or no replies to
certain RPCs) or a client bug:
# umount -N <mnt_path>
is your best bet at getting rid of the mount.
Comment 6 Rick Macklem freebsd_committer 2021-11-25 17:13:59 UTC
Oh, and my comment 2.r.t. don't run the nfscbd may
not be good advice. Here's a more complete answer:

The nfscbd(8) provides callback handling when it
is running.  Callbacks need to be working for the
server to issue delegations or layouts (the latter is
pNFS only).

When the nfscbd(8) is not running, the server should
work fine and never issue delegations or layouts.
It should set a flag in the reply to the Sequence
operation (the first one in each compound RPC)
called SEQ4_STATUS_CB_PATH_DOWN. This is an FYI
for the client.

I found another round of bugs related to delegations
during a recent IETF NFSv4 testing event. These are
fixed in stable/13, but not 13.0.
--> As such, delegations are problematic and you don't
    want them being issued.
    --> Don't run the nfscbd(8) daemon.
Unfortunately Amazon does not attend these testing
events, so what their server does is ??? for me.

However, if it is known that the Amazon EFS never
issues delegations or layouts (I believe cpercival@
said that was the case three years ago), then the
server might be broken and get "perterbed" by the
callbacks not working.
--> In this case, you should run the nfscbd daemon
    by setting nfscbd_enable="YES" in your /etc/rc.conf
    or start it manually.

And, given the above, I think you can see why my initial
advice was just "don't run it".
Comment 7 Colin Percival freebsd_committer 2021-11-25 18:24:04 UTC
FWIW, EFS is still documented as "does not support... Client delegation or callbacks of any type".