Bug 249871 - NFSv4 faulty directory listings under heavy load
Summary: NFSv4 faulty directory listings under heavy load
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-25 02:51 UTC by Jason W. Bacon
Modified: 2020-10-25 01:25 UTC (History)
2 users (show)

See Also:


Attachments
log readdirplus VOP_LOOKUP() failures (355 bytes, patch)
2020-10-03 02:40 UTC, Rick Macklem
no flags Details | Diff
small test program that reads a directory and prints out d_off (883 bytes, text/plain)
2020-10-22 00:11 UTC, Rick Macklem
no flags Details
allow readahead=0 NFS mount option (428 bytes, patch)
2020-10-25 01:21 UTC, Rick Macklem
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jason W. Bacon freebsd_committer 2020-09-25 02:51:44 UTC
I think I've discovered a peculiar bug in NFSv4.  When the server is under heavy load, directory listings sometimes show duplicate filenames and other times omit filenames.

This was discovered when running parallel jobs on a small HPC cluster, each running xzcat on an NFS-served file, dumping the uncompressed output to a local disk on the client, followed by some brief heavy computation and writing several small output files to the NFS server.  As shown below, there are 11,031 files processed.  Parallel jobs were capped between 50 to 150 at a time, with the problem occurring with any cap.

All files list-*.txt shown below were produced by

    ls | grep 'combined.*-ad\.vcf\.xz'

or

    find . -maxdepth 1 'combined.*-ad.vcf.xz'

The file list-1.txt contains the correct directory listing.

list-100.txt, however, contains duplicate filenames, and list-1000.txt has both duplicate and missing filenames.

# sort list-1.txt | uniq -d

# sort list-100.txt | uniq -d
combined.NWD297242-ad.vcf.xz
combined.NWD745320-ad.vcf.xz
combined.NWD787696-ad.vcf.xz

# wc -l list-1.txt list-100.txt list-1000.txt
   11031 list-1.txt
   11034 list-100.txt
   11027 list-1000.txt
   33092 total

# diff list-1.txt list-100.txt
2404a2405
> combined.NWD297242-ad.vcf.xz
7856a7858
> combined.NWD745320-ad.vcf.xz
8391a8394
> combined.NWD787696-ad.vcf.xz

# diff list-1.txt list-1000.txt
153a154
> combined.NWD111306-ad.vcf.xz
170d170
< combined.NWD113182-ad.vcf.xz
512d511
[snip]

If I revert the mounts to NFSv3, the problem goes away (but performance suffers).

There are no apparent problems delivering file content, just directory listings.  Using this fact, I can work around the problem by writing the directory listing to a file beforehand, when the server is not under load:

    ls | grep 'combined.*-ad\.vcf\.xz' > VCF-list.txt

Reading this file under heavy load does not pose any problems.  It's only if I do a new directory listing with "ls" or "find".

The problem is consistently reproducible under heavy load and does not occur  under light load.

/etc/exports:

V4: /

/etc/zfs/exports:

# !!! DO NOT EDIT THIS FILE MANUALLY !!!

/pxeserver/images	-alldirs -ro -network 192.168.0.0 -mask 255.255.128.0 
/raid-00	-maproot=root -network 192.168.0.0 -mask 255.255.128.0 
/sharedapps	-maproot=root -network 192.168.0.0 -mask 255.255.128.0 
/usr/home	-maproot=root -network 192.168.0.0 -mask 255.255.128.0 
/var/cache/pkg	-maproot=root -network 192.168.0.0 -mask 255.255.128.0 

/etc/fstab on the clients:

login:/usr/home         /usr/home       nfs     rw,bg,intr,noatime 0       0
login:/raid-00          /raid-00        nfs     rw,bg,intr,noatime 0       0
login:/sharedapps       /sharedapps     nfs     rw,bg,intr,noatime 0       0
login:/var/cache/pkg    /var/cache/pkg  nfs     rw,bg,intr,noatime 0       0
Comment 1 Rick Macklem freebsd_committer 2020-09-26 15:48:44 UTC
Yep, that is weird.
Assuming the clients are FreeBSD and not Linux,
the only thing I can think of to try is take
the "intr" option off the mounts.
The BUGS section of "man mount_nfs" notes it
should never be used.

If somehow a signal were to be posted to
the process on the client, that might
explain this, if a sleep() returns ERESTART
or something like that.
--> Anyhow, for reasons mostly related to
    sessions (or lock sequencing for NFSv4.0)
    you should never use "intr" nor "soft"
    on NFSv4 mounts.

I'll look through the code, but the NFSv3
and NFSv4 code is very similar for Readdir.
One more question:
- Are you using nfsuserd or uids in strings?
If the former, you could try setting
vfs.nfs.enable_uidtostring=1
vfs.nfsd.enable_stringtouid=1
and then don't run nfsuserd.
(When the uid<->name cache misses, the
 upcall to the nfsuserd could take a long
 time on the heavily loaded server.
 However, I cannot think why slow response
 could cause this unless it is related to
 using "intr" on the mounts.

If the clients are Linux, then I vaguely
recall mention of problems with reading
large directories being discussed on
linux-nfs@vger.kernels.org.
Comment 2 Jason W. Bacon freebsd_committer 2020-10-01 18:41:39 UTC
Thanks for the suggestions.  I disabled intr and I'm still getting variable results.

Server and clients are all FreeBSD 12.1 and I am not using nfsuserd.

Server:

vfs.zfs.min_auto_ashift=12
net.inet.icmp.drop_redirect=1
net.inet.ip.sourceroute=0
net.inet.ip.accept_sourceroute=0
net.inet.tcp.rfc1323=0
vfs.nfs.enable_uidtostring=1
vfs.nfsd.enable_stringtouid=1
vfs.nfsd.tcphighwater=20000
vfs.nfsd.tcpcachetimeo=3600

Clients:

vfs.nfs.enable_uidtostring=1

Happy to try other suggestions if you have any.
Comment 3 Rick Macklem freebsd_committer 2020-10-03 02:40:54 UTC
Created attachment 218479 [details]
log readdirplus VOP_LOOKUP() failures

This little patch adds a printf() that logs VOP_LOOKUP()
errors, if any occur in the NFSv4 server's readdir.
--> VOP_LOOKUP() gets used instead of VFS_VGET() when
    the NFSv4 server is performing Readdir over ZFS,
    since VFS_VGET() is not reliable for ZFS.

    - If the VOP_LOOKUP() fails (I do not think it ever
      should), that would corrupt the Readdir reply.

If you can add this patch to your server and note if
the printf() happens (especially when "ls" fails),
please do so.
Comment 4 Rick Macklem freebsd_committer 2020-10-03 02:56:30 UTC
Another thing to try is disabling the name cache.
I think the sysctl is called debug.vfscache?
(For some reason it has disappeared from head/current,
 but should be in 12.1.)
--> Set it to 0 on the server and see if the
    problem still exists.
Comment 5 Jason W. Bacon freebsd_committer 2020-10-04 15:23:21 UTC
Disabling debug.vfscache didn't help.

Adding the patch will take a little more doing, as this is a production system.  I'll try reproducing the problem on a private test cluster first, maybe later this week.

A little more info: The problem seems to be limited to the directory being hammered by multiple readers.  Listing a sibling directory produces consistent and correct results.
Comment 6 Rick Macklem freebsd_committer 2020-10-04 23:34:18 UTC
One additional thing you could do is:
# vmstat -z | fgrep mbuf_cluster
on the server when it is under heavy load,
to see if it is running out of mbuf clusters.
(Theoretically the NFS server should keep
 working when mbuf clusters are exhausted,
 but the result would be a Readdir reply
 made up of a long list of regular mbufs.
 That could impact things like TSO, if the
 net interface on the server has that
 enabled.)

Basically, other than possible mbuf exhaustion,
I can't think of any way heavy load would
affect the NFS server code (except slower
response).

Since the name cache doesn't seem to be the
culprit, that leaves all the caching that
goes on inside ZFS.
--> If the readdir contents is somehow reordered
    by ZFS when the directory is under heavy
    readdir load or the directory offset cookies
    somehow change, that would explain the
    problem.

Yet one more thing that you could try is having
client mounts done with "nfsv3,rdirplus".
You mentioned that NFSv3 worked ok.
NFSv3 + ridrplus works more closely to NFSv4
in the server, and whether or not that fails
might be useful information.
Comment 7 Jason W. Bacon freebsd_committer 2020-10-07 15:29:01 UTC
Doesn't appear to be an mbuf issue.  Running the two scripts below simultaneously:

#!/bin/sh -e

cd AD-VCFs-whi/MAF-0.01-1000nt
while true; do
    ls | grep 'combined.*\.vcf\.xz' | wc -l
    sleep 2
done

   11035
   11033
   11031
   11031
   11027
   11030
   11031
   11031
   11038
   11031
   11032

#!/bin/sh -e

while true; do
    vmstat -z | egrep 'ITEM|mbuf_cluster'
    sleep 1
done

ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7360,    3042,1297272200,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7304,    3098,1297298159,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7548,    2854,1297322887,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7428,    2974,1297348314,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7208,    3194,1297374731,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7302,    3100,1297403922,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7394,    3008,1297434774,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7455,    2947,1297464218,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7503,    2899,1297489544,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7381,    3021,1297516540,   0,   0
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_cluster:          2048, 2039976,    7432,    2970,1297539052,   0,   0
Comment 8 Jason W. Bacon freebsd_committer 2020-10-07 16:00:16 UTC
I thought you were onto something with the rdirplus hypothesis, but after a few more trials, I'm seeing some inconsistency with NFSv3 even without rdirplus.

So we can scrap the notion that the problem is limited NFSv4.
Comment 9 Rick Macklem freebsd_committer 2020-10-08 00:07:33 UTC
Well, if you can see the problem with vanilla NFSv3
(without rdirplus), then forget about the patch that
prints out an error return for VOP_LOOKUP().
--> It in never called for NFSv3 Readdir RPCs.

I have no idea, but it sounds like ZFS returns
different blocks of entries for VOP_READDIR()
when under heavy load or somehow messes up
the directory offset cookies?

It's weird that no one else has ever reported this?

All I can suggest is posting to a mailing list that
the ZFS folks read and see if they have any insight
into what might be occurring.
Comment 10 Jason W. Bacon freebsd_committer 2020-10-08 14:09:37 UTC
I suspect that listing directories like this is infrequently done and even less frequently verified.  I don't recall ever doing something like this before that depended on an accurate directory listing.  So the issue may have happened to others but likely went unnoticed.  Had I not had parallel processes needing the exact same file list in order to function properly, I never would have caught it.

I'll try and construct a minimal, reproducible test case so it can be explored more easily.
Comment 11 Rick Macklem freebsd_committer 2020-10-09 01:34:55 UTC
If it were possible for you to set up a server
using UFS instead of ZFS, that would be nice,
in that would tell us if the problem is in ZFS
or more generic code.
Comment 12 Rick Macklem freebsd_committer 2020-10-22 00:11:48 UTC
Created attachment 218960 [details]
small test program that reads a directory and prints out d_off

This small program can be used to read a directory,
checking that d_off is monotonically increasing and,
optionally print out d_off for each entry.
Comment 13 Rick Macklem freebsd_committer 2020-10-22 00:17:30 UTC
I've put a small program called rdir.c in the
attachments. You could run this program on
the directory (locally on the NFS server)
as follows:
# rdir -v <directory>
It should print out a bunch of numbers, which
are the d_off value for each directory entry.
If it spits out "Not monotonically increasing",
I'm not sure if that is considered an error for
ZFS (I think they're supposed to be monotonically
increasing, but I'm not sure for ZFS.)

If you run it repeatedly (locally on the server,
not on an NFS mount) when the server is under heavy
load, the output should always be the same.
If the output changes, then that would explain
the problem, although I have no idea what the
fix might be (it's a ZFS issue).

If you can try it on the problematic directory,
it would be interesting to hear what the results are.
Comment 14 Jason W. Bacon freebsd_committer 2020-10-22 13:49:27 UTC
(In reply to Rick Macklem from comment #13)
Absolutely.  I'm a little swamped at the moment, but will give this a try ASAP.  Stand by...
Comment 15 Rick Macklem freebsd_committer 2020-10-23 01:49:53 UTC
Sure. No rush.
If the "Not monotonically increasing" prints out,
you can just comment out that bit of code and
run it again, although I think it is worth noting
if/when it happens. (It should not break NFS, but
having the list of d_offs change will definitely
break the NFS server.)

Thanks for your continuing work on this, rick
Comment 16 Jason W. Bacon freebsd_committer 2020-10-24 21:29:49 UTC
(In reply to Rick Macklem from comment #11)

I created an NFS share on a compute node using UFS and the problem *DOES* occur there as well.

Running the "ls" test from any NFS client to that node gives something like the following:

#!/bin/sh -e

cd AD-VCFs-whi/MAF-0.01-1000nt
while true; do
    ls | grep 'combined.*\.vcf\.xz' | wc -l
    sleep 2
done

   11031
   11031
   11030
   11031
   11031
   11031
   11031
   11031
   11029
   11031

Running the same test script on either the ZFS or NFS server gives consistently correct results (11031 every time).
Comment 17 Jason W. Bacon freebsd_committer 2020-10-25 00:14:10 UTC
(In reply to Rick Macklem from comment #13)

rdir.c isn't reporting any problems on the server at the same time the NFS client is malfunctioning badly...

On the server:

FreeBSD login.peregrine  bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1030: ./test-rdir AD-VCFs-whi/MAF-0.01-1000nt/ |& tee rdir.out
............................................................................................................................................^C

FreeBSD login.peregrine  bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1031: cat test-rdir 
#!/bin/sh -e

usage()
{
    printf "Usage: $0 dir\n"
    exit 1
}

if [ $# != 1 ]; then
    usage
fi

# AD-VCFs-whi/MAF-0.01-1000nt
while true; do
    ./rdir $1
    printf "."
    sleep 1
done

On a client:

FreeBSD compute-009.peregrine  bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1002: ./test-ls AD-VCFs-whi/MAF-0.01-1000nt/

These should all read 11031:

   11029
   11024
   11035
   11031
   11029
   11034
   11028
   11032
   11031
Comment 18 Rick Macklem freebsd_committer 2020-10-25 01:21:13 UTC
Created attachment 219047 [details]
allow readahead=0 NFS mount option

For some reason, the NFS client code did not allow
readahead=0 to be set as a mount option to disable
read ahead.

This one line patch fixes it.
Comment 19 Rick Macklem freebsd_committer 2020-10-25 01:25:08 UTC
Well, I was going to suggest trying the "readahead=0" mount
option on a client (which disabled read ahead, including for
directory blocks).
Then I discovered the code didn't allow that.
So, first you'll need to apply the simple patch
"allow readahead=0 NFS mount option" and
then you can try it on a client.

It seems more likely that the client is broken than the
server (especially since UFS and ZFS both exhibit the
problem). All that a client should see when the server
is under heavy load is slow response.
--> Maybe that causes problems when a directory block
    is read ahead?