Summary: | NFSv4 faulty directory listings under heavy load | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Base System | Reporter: | Jason W. Bacon <jwb> | ||||||||
Component: | kern | Assignee: | freebsd-bugs (Nobody) <bugs> | ||||||||
Status: | Closed Overcome By Events | ||||||||||
Severity: | Affects Some People | CC: | emaste, rmacklem | ||||||||
Priority: | --- | ||||||||||
Version: | 12.1-RELEASE | ||||||||||
Hardware: | amd64 | ||||||||||
OS: | Any | ||||||||||
Attachments: |
|
Description
Jason W. Bacon
2020-09-25 02:51:44 UTC
Yep, that is weird. Assuming the clients are FreeBSD and not Linux, the only thing I can think of to try is take the "intr" option off the mounts. The BUGS section of "man mount_nfs" notes it should never be used. If somehow a signal were to be posted to the process on the client, that might explain this, if a sleep() returns ERESTART or something like that. --> Anyhow, for reasons mostly related to sessions (or lock sequencing for NFSv4.0) you should never use "intr" nor "soft" on NFSv4 mounts. I'll look through the code, but the NFSv3 and NFSv4 code is very similar for Readdir. One more question: - Are you using nfsuserd or uids in strings? If the former, you could try setting vfs.nfs.enable_uidtostring=1 vfs.nfsd.enable_stringtouid=1 and then don't run nfsuserd. (When the uid<->name cache misses, the upcall to the nfsuserd could take a long time on the heavily loaded server. However, I cannot think why slow response could cause this unless it is related to using "intr" on the mounts. If the clients are Linux, then I vaguely recall mention of problems with reading large directories being discussed on linux-nfs@vger.kernels.org. Thanks for the suggestions. I disabled intr and I'm still getting variable results. Server and clients are all FreeBSD 12.1 and I am not using nfsuserd. Server: vfs.zfs.min_auto_ashift=12 net.inet.icmp.drop_redirect=1 net.inet.ip.sourceroute=0 net.inet.ip.accept_sourceroute=0 net.inet.tcp.rfc1323=0 vfs.nfs.enable_uidtostring=1 vfs.nfsd.enable_stringtouid=1 vfs.nfsd.tcphighwater=20000 vfs.nfsd.tcpcachetimeo=3600 Clients: vfs.nfs.enable_uidtostring=1 Happy to try other suggestions if you have any. Created attachment 218479 [details]
log readdirplus VOP_LOOKUP() failures
This little patch adds a printf() that logs VOP_LOOKUP()
errors, if any occur in the NFSv4 server's readdir.
--> VOP_LOOKUP() gets used instead of VFS_VGET() when
the NFSv4 server is performing Readdir over ZFS,
since VFS_VGET() is not reliable for ZFS.
- If the VOP_LOOKUP() fails (I do not think it ever
should), that would corrupt the Readdir reply.
If you can add this patch to your server and note if
the printf() happens (especially when "ls" fails),
please do so.
Another thing to try is disabling the name cache. I think the sysctl is called debug.vfscache? (For some reason it has disappeared from head/current, but should be in 12.1.) --> Set it to 0 on the server and see if the problem still exists. Disabling debug.vfscache didn't help. Adding the patch will take a little more doing, as this is a production system. I'll try reproducing the problem on a private test cluster first, maybe later this week. A little more info: The problem seems to be limited to the directory being hammered by multiple readers. Listing a sibling directory produces consistent and correct results. One additional thing you could do is: # vmstat -z | fgrep mbuf_cluster on the server when it is under heavy load, to see if it is running out of mbuf clusters. (Theoretically the NFS server should keep working when mbuf clusters are exhausted, but the result would be a Readdir reply made up of a long list of regular mbufs. That could impact things like TSO, if the net interface on the server has that enabled.) Basically, other than possible mbuf exhaustion, I can't think of any way heavy load would affect the NFS server code (except slower response). Since the name cache doesn't seem to be the culprit, that leaves all the caching that goes on inside ZFS. --> If the readdir contents is somehow reordered by ZFS when the directory is under heavy readdir load or the directory offset cookies somehow change, that would explain the problem. Yet one more thing that you could try is having client mounts done with "nfsv3,rdirplus". You mentioned that NFSv3 worked ok. NFSv3 + ridrplus works more closely to NFSv4 in the server, and whether or not that fails might be useful information. Doesn't appear to be an mbuf issue. Running the two scripts below simultaneously: #!/bin/sh -e cd AD-VCFs-whi/MAF-0.01-1000nt while true; do ls | grep 'combined.*\.vcf\.xz' | wc -l sleep 2 done 11035 11033 11031 11031 11027 11030 11031 11031 11038 11031 11032 #!/bin/sh -e while true; do vmstat -z | egrep 'ITEM|mbuf_cluster' sleep 1 done ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7360, 3042,1297272200, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7304, 3098,1297298159, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7548, 2854,1297322887, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7428, 2974,1297348314, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7208, 3194,1297374731, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7302, 3100,1297403922, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7394, 3008,1297434774, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7455, 2947,1297464218, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7503, 2899,1297489544, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7381, 3021,1297516540, 0, 0 ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_cluster: 2048, 2039976, 7432, 2970,1297539052, 0, 0 I thought you were onto something with the rdirplus hypothesis, but after a few more trials, I'm seeing some inconsistency with NFSv3 even without rdirplus. So we can scrap the notion that the problem is limited NFSv4. Well, if you can see the problem with vanilla NFSv3 (without rdirplus), then forget about the patch that prints out an error return for VOP_LOOKUP(). --> It in never called for NFSv3 Readdir RPCs. I have no idea, but it sounds like ZFS returns different blocks of entries for VOP_READDIR() when under heavy load or somehow messes up the directory offset cookies? It's weird that no one else has ever reported this? All I can suggest is posting to a mailing list that the ZFS folks read and see if they have any insight into what might be occurring. I suspect that listing directories like this is infrequently done and even less frequently verified. I don't recall ever doing something like this before that depended on an accurate directory listing. So the issue may have happened to others but likely went unnoticed. Had I not had parallel processes needing the exact same file list in order to function properly, I never would have caught it. I'll try and construct a minimal, reproducible test case so it can be explored more easily. If it were possible for you to set up a server using UFS instead of ZFS, that would be nice, in that would tell us if the problem is in ZFS or more generic code. Created attachment 218960 [details]
small test program that reads a directory and prints out d_off
This small program can be used to read a directory,
checking that d_off is monotonically increasing and,
optionally print out d_off for each entry.
I've put a small program called rdir.c in the attachments. You could run this program on the directory (locally on the NFS server) as follows: # rdir -v <directory> It should print out a bunch of numbers, which are the d_off value for each directory entry. If it spits out "Not monotonically increasing", I'm not sure if that is considered an error for ZFS (I think they're supposed to be monotonically increasing, but I'm not sure for ZFS.) If you run it repeatedly (locally on the server, not on an NFS mount) when the server is under heavy load, the output should always be the same. If the output changes, then that would explain the problem, although I have no idea what the fix might be (it's a ZFS issue). If you can try it on the problematic directory, it would be interesting to hear what the results are. (In reply to Rick Macklem from comment #13) Absolutely. I'm a little swamped at the moment, but will give this a try ASAP. Stand by... Sure. No rush. If the "Not monotonically increasing" prints out, you can just comment out that bit of code and run it again, although I think it is worth noting if/when it happens. (It should not break NFS, but having the list of d_offs change will definitely break the NFS server.) Thanks for your continuing work on this, rick (In reply to Rick Macklem from comment #11) I created an NFS share on a compute node using UFS and the problem *DOES* occur there as well. Running the "ls" test from any NFS client to that node gives something like the following: #!/bin/sh -e cd AD-VCFs-whi/MAF-0.01-1000nt while true; do ls | grep 'combined.*\.vcf\.xz' | wc -l sleep 2 done 11031 11031 11030 11031 11031 11031 11031 11031 11029 11031 Running the same test script on either the ZFS or NFS server gives consistently correct results (11031 every time). (In reply to Rick Macklem from comment #13) rdir.c isn't reporting any problems on the server at the same time the NFS client is malfunctioning badly... On the server: FreeBSD login.peregrine bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1030: ./test-rdir AD-VCFs-whi/MAF-0.01-1000nt/ |& tee rdir.out ............................................................................................................................................^C FreeBSD login.peregrine bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1031: cat test-rdir #!/bin/sh -e usage() { printf "Usage: $0 dir\n" exit 1 } if [ $# != 1 ]; then usage fi # AD-VCFs-whi/MAF-0.01-1000nt while true; do ./rdir $1 printf "." sleep 1 done On a client: FreeBSD compute-009.peregrine bacon ~/Data/TOPMed/Jobs/Haplohseq-analysis 1002: ./test-ls AD-VCFs-whi/MAF-0.01-1000nt/ These should all read 11031: 11029 11024 11035 11031 11029 11034 11028 11032 11031 Created attachment 219047 [details]
allow readahead=0 NFS mount option
For some reason, the NFS client code did not allow
readahead=0 to be set as a mount option to disable
read ahead.
This one line patch fixes it.
Well, I was going to suggest trying the "readahead=0" mount option on a client (which disabled read ahead, including for directory blocks). Then I discovered the code didn't allow that. So, first you'll need to apply the simple patch "allow readahead=0 NFS mount option" and then you can try it on a client. It seems more likely that the client is broken than the server (especially since UFS and ZFS both exhibit the problem). All that a client should see when the server is under heavy load is slow response. --> Maybe that causes problems when a directory block is read ahead? Yet another problem solved by wu wei... I can no longer reproduce this since upgrading to 12.2-RELEASE. Guessing either there was a bug that was fortuitously resolved or some bit rot in my system that was overwritten. All's well that ends well. Thanks for your help investigating. Good news. Hopefully it won't show up again. |