Bug 251347

Summary: NFS hangs on client side when mounted from outside in Jail Tree (BROKEN NFS SERVER OR MIDDLEWARE)
Product: Base System Reporter: Arne Steinkamm <arne>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Only Me CC: arne, julien, kib, rmacklem
Priority: ---    
Version: 12.1-STABLE   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
tar file with all configs and command outputs of server, client and jail none

Description Arne Steinkamm 2020-11-24 18:24:03 UTC
Created attachment 219934 [details]
tar file with all configs and command outputs of server, client and jail

Hi folks,
I ran into an really nasty problem after introducing FreeBSD to a really big company.

All machines are running FreeBSD 12.1 Release and use 10GBit/s ethernet links with Cisco Nexus 55xx switches. Additional 1GBit/s networks for admin- und security related issues exists.

I have this problem on all my NFS client machine running FreeBSD.

NFS Server (FreeBSD 12.1) exports a ZFS dataset with ca. 1,5 TB.

Linux clients mounting this dataset with nfsv3 TCP mounts have no problems.

FreeBSD hosts mount this dataset with nfs3 TCP.

Scenario 1)
FreeBSD client mounts share. No problems.

Scenario 2)
FreeBSD client mounts share and relay it using nullfs mounts in VNET (if_bridge) Jails.

Scenario 3)
FreeBSD client mounts share twice. One mountpoint is located in his primary file system tree and the other one is inside the jail's sub tree.

Accessing the share from inside the jail hangs the NFS mount after only a few minutes of usage.
In Scenario 3 only the mount inside the jails directory sub tree is blocked. The other one is working. Nevertheless df & friends on the main host is blocked too because they touch the Jails subtree o9n their traversal down.
Same in Scenario 2. There the only NFS mount is blocked.

There are no error messages at this point of time. No console text about stale NFS links etc. Nothing in the Jail's console file.

umount of the mount blocks too but the blocking mount line vanished in the mount(8) output. 
The only way to unfreeze is to stop the jail in which sub tree the NULLFS or NFS mount is located in.

Stopping the jail immediately frees the hanging df and ls etc. processes.

On the console I can read after umount -f / jail stop:
newnfs: server '192.168.67.38' error: fileid changed. fsid 0:0: expected fileid 0x4, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)

I read the comments about this error message in /usr/src/sys/fs/nfsclient/nfs_clport.c. They are not matching this setup. There is no broken middleware... only cisco switches, a HP Proliant Gen9 FreeBSD NFS Server and a few Oracle Sunfire-something x86_64 FreeBSD clients.

The NFS Server is running FreeBSD 12.1 and the network is as simple as a 10 GBit/s network can be.

Playing with enforce_statfs seems to have no effect.

What's going wrong here?

Some details:
I use the jib script from /usr/src/share/exampes/jails after adding two lines to add MTU support to handle the Jumbo 10 Gig network.
Firewall configurations are enabled in the kernel. To understand the problem here all is disabled or in pass through configurations.
There are no other visible problems. 

I add this files as attachment:

adm01:
total 240
-rw-r--r--  1 arne  ego  21833 Nov 24 18:24 NFS-Client-ADM001
-rw-------  1 arne  ego   3254 Nov 24 00:07 NFS-Client-devfs.rules
-rw-r-----  1 arne  ego  17288 Nov 24 18:57 NFS-Client-df
-rw-r--r--  1 arne  ego  19051 Nov 24 18:24 NFS-Client-dmesg.boot
-rw-r--r--  1 arne  ego    636 Nov 24 14:41 NFS-Client-fstab
-rw-r-----  1 arne  ego  25965 Nov 24 18:24 NFS-Client-ifconfig-a
-rw-r--r--  1 arne  ego   3600 Nov 24 18:33 NFS-Client-jail.conf
-rw-r--r--  1 arne  ego   1239 Nov 23 23:31 NFS-Client-loader.conf
-rw-r-----  1 arne  ego    522 Nov 24 18:58 NFS-Client-nfsstat-m
-rw-r-----  1 arne  ego  76519 Nov 24 03:23 NFS-Client-pf.conf
-rw-r--r--  1 arne  ego  25053 Nov 24 18:21 NFS-Client-rc.conf
-rw-r--r--  1 arne  ego   1084 Nov 23 23:33 NFS-Client-sysctl.conf

fs001:
total 124
-rw-r--r--  1 arne  ego  21767 Nov 24 19:05 Fileserver-FS001
-rw-r--r--  1 arne  ego  27395 Nov 24 19:06 Fileserver-dmesg.boot
-rw-r-----  1 arne  ego  14781 Nov 24 19:03 Fileserver-ifconfig-a
-rw-r--r--  1 arne  ego   1020 Nov 16 21:40 Fileserver-loader.conf
-rw-r--r--  1 arne  ego  14356 Nov 24 19:04 Fileserver-rc.conf
-rw-r--r--  1 arne  ego   1106 Nov 16 21:38 Fileserver-sysctl.conf
-rw-r-----  1 arne  ego   6398 Nov 24 18:14 Fileserver-zfs_properties

jail:
total 28
-rw-r-----  1 arne  ego  1671 Nov 24 19:01 inside-jail-ifconfig-a
-r--------  1 arne  ego     9 Nov 24 00:22 inside-jail-pf.conf
-rw-r--r--  1 arne  ego   636 Nov 24 19:00 inside-jail-rc.conf
-rw-r--r--  1 arne  ego   709 Nov 24 00:17 inside-jail-sysctl.conf

Thanks in advance,
.//. Arne
Comment 1 Conrad Meyer freebsd_committer 2020-11-24 19:42:46 UTC
It seems like the mount -> nullfs -> export setup is producing corrupt NFS RPC responses (i.e., "BROKEN NFS SERVER").
Comment 2 Arne Steinkamm 2020-11-24 20:32:32 UTC
export?

Only the fileserver is exporting. Do you mean the jails on the fileserver?

The nfs clients I have trouble with are all not exporting anything.

Thanks,
.//. Arne
Comment 3 Arne Steinkamm 2020-11-25 06:42:17 UTC
Ah few new findings:

binding portmapper, nfsd, mountd, et.al. to one ip address on server and client changed nothing. This is no surprise, I'm in the try&error phase.

Switching to NFSv4 makes no difference but gives a more detailed error message on the console every time I stop the jail associated with the nullfs mount.

newnfs: server '192.168.67.38' error: fileid changed. fsid deaf3afe:f75a86de: expected fileid 0x4, got 0x2. (BROKEN NFS SERVER OR MODDLEWARE)

This message is only on the console. There is nothing in dmesg output or log files.

Can someone enlighten me how to decode the error message? I got the exact same numbers twice now so I'm curious.

Thanks in advance,
.//. Arne
Comment 4 Arne Steinkamm 2020-11-25 06:57:00 UTC
The FreeBSD based ZFS NFS server seems to be not the problem.
I get now exact the same problem with another share mounted from a Linux server.
Comment 5 Konstantin Belousov freebsd_committer 2020-11-25 22:24:39 UTC
First, switch to 12.2.  There were fixes for some NFS client issues that might be
relevant (could be not).

I am sure that NFS client is not VNET-aware, i.e. it does not switch context
to the proper VNET as needed.  Esp. when offloading async io to nfsiod daemons.
In other words, I do not expect the scenario 3 to work.

For the second scenario, nullfs remount of NFS mount, VNET jail should be
irrelevant.  When the hang occurs, gather the procstat -kk -a output (on the
host).  Also it might worth a try to '-o nocache' for nullfs mounts, I remember there were some problems with nullfs+nfs+caching.
Comment 6 Rick Macklem freebsd_committer 2020-11-26 00:31:54 UTC
I am not a jails guy, so I don't know if that was a factor.
W.r.t "..BROKEN MIDDLEWARE.."
It refers to a case where something between the NFS server
and client tries to cache Getattr replies, but doesn't get
it right.

Try testing where the NFS client connects to the server
without anything like a cisco switch in between. Just a wire
or a really dumb no frills switch.

Also, try an NFSv4 mount, since the NFSv4 RPCs are much
harder to cache and send replies.

If the problem persists, add another comment to this bug
report and we can try some other stuff.
(The error is generated because a file's fileno has changed
 and that should never happen.)
My guess is that having the two mounts for the same file
system has somehow triggered the generation of a bogus cached
reply.
Comment 7 Arne Steinkamm 2020-11-26 00:48:17 UTC
(In reply to Rick Macklem from comment #6)

Thanks Rick,

as said before:
Using NFSv4 changes nothing :-(

I'm unable to change the wiring because the complete setup is in a computer center 600km from here but I have access to all lights out service processors, the switches, etc.

I have the same problem now with a Linux NFS server so I think the problem is not on the server side.

To make the picture clearer:

two main administrative servers: adm001 and adm002
Both have a couple of VNET jails (if_bridge, not netgraph) running nameserver etc.
And both have so called login server jails. Just a jail where the people can ssh too and jump (ssh) inside the development networks.

Two fileserver, one Linux one FreeBSD share the $HOMEs and a pool share.

These are mounted at adm001 and amd002 to /l/home and /l/pool.

NULLFS mounts forward these in the two login jails:

/l/home --> /l/prison/login1/l/home and
/l/pool --> /l/prison/login1/l/pool.

Same procedure on adm002 with login002.

After 30 seconds to 60 minutes one of the two NFS mounts freezes.
df, ls, umount etc. hang.

And now the strange part:

  "sh /etc/rc.d/jail stop login1"

immediately unfreezes the NFS mount and everything is working normal again but, of course, without the jail.

In the second the Jail is dead the "BROKEN NFS..." text is displayed only on the console, not in dmesg buffer or log files and, this is strange, only on adm001.
adm001 and adm002 are different hardware (other CPU) but there are clones, real clones made with zfs send/receive. Of course the hostids are different.

The workaround is to allow the users to use adm001 and adm002 directly. But this is not the way we want to have it.

When I don't use the NULLFS and make two independent NFS mounts to /l/home and /l/prison/login1/l/home then only the second freezes.

I guess this is not normal and a real bug... 

Any help?

Thanks in advance
.//. Arne
Comment 8 Rick Macklem freebsd_committer 2020-11-26 03:06:15 UTC
Ok, I took a quick look and I don't think
an NFS mount within a vnet jail is going to work.

When the kernel RPC does a socreate() it passes
the argument cred as curthread->cred.
If this is done by an nfsiod thread, the credential
won't be for the correct vnet.

This can be fixed by adding cred arguments to assorted
functions so that the credential used at mount time can
be passed in, but the fix is not trivial.

I've never used a vnet jail, so there might be other
issues.
Sorry, but the short answer is that you will need to
figure out a way to do what you are doing without an
NFS mount in a vnet jail, I think.

I'll take this bug, but don't expect a quick fix.
Comment 9 Arne Steinkamm 2020-11-26 03:32:41 UTC
(In reply to Rick Macklem from comment #8)

Yes, we know that NFS does not work from inside a jail.
This is the reason using NULLFS is best practice for years now to give a jail (vnet or not) access to a NFS mount point.

The real NFS mountpoint is outside the jail.
And nullfs should cover this. Or?

And:
I have no idea what the freeze triggers...
I simulate heavy (!) load from inside the jail and it works up to ca. one hour and tons of gigabyte without problems...

Really, the answer can't be to use a fuse'ed NTFS filesystem to share mountpoints between two FreeBSD server...
Comment 10 Rick Macklem freebsd_committer 2020-11-26 23:04:46 UTC
Well, I took a closer look at the krpc code and does
do TCP reconnects with the mount time credentials.
Although I don't test it, I do know some people
use NFS mounts within jails, but I don't know
if anyone does an NFS mount within a vnet prison?
Comment 11 Julien Cigar 2020-11-27 09:27:41 UTC
just to say that we are using NFS (V4) mounts inside (non-VNET) jails for years and it works like a charm, we never had an issue:

for example on one of our production server:
orval% mount -t nfs
filer.prod.lan:/pictures/collections on /usr/jails/www1/filer/pictures/collection (nfs, read-only, nfsv4acls)
filer.prod.lan:/webapps on /usr/jails/www1/filer/webapps (nfs, nfsv4acls)
filer.prod.lan:/documents on /usr/jails/www1/filer/documents (nfs, read-only, nfsv4acls)
filer.prod.lan:/geoserver on /usr/jails/java1/filer/geoserver (nfs, nfsv4acls)
filer.prod.lan:/pictures/collections on /usr/jails/j_www1/filer/pictures/collection (nfs, read-only, nfsv4acls)
filer.prod.lan:/webapps on /usr/jails/j_www1/filer/webapps (nfs, nfsv4acls)
filer.prod.lan:/documents on /usr/jails/j_www1/filer/documents (nfs, read-only, nfsv4acls)
filer.prod.lan:/apache on /usr/jails/j_www1/filer/apache (nfs, nfsv4acls)
filer.prod.lan:/pypi on /usr/jails/j_www1/filer/devpi (nfs, nfsv4acls)
filer.prod.lan:/webapps/phegea on /usr/jails/j_www1/filer/webapps/phegea (nfs, read-only, nfsv4acls)
filer.prod.lan:/pictures/collections on /usr/jails/j_www1_rb1/filer/pictures/collection (nfs, read-only, nfsv4acls)
filer.prod.lan:/webapps on /usr/jails/j_www1_rb1/filer/webapps (nfs, nfsv4acls)
filer.prod.lan:/ipt on /usr/jails/j_ipt1/filer/ipt (nfs, nfsv4acls)
filer.prod.lan:/geoserver on /usr/jails/geoserver1/filer/geoserver (nfs, nfsv4acls)

(BTW I remember that I was wondering some years ago what would be the best way of mounting several NFS shares in a jail: one mount one the HOST + NULLFS or multiple mounts of the same share in different jails, as the latter was the best)