Created attachment 222358 [details] test program demonstrating getdirentries's output. r318736, the 64-bit inode project, added a d_off field to struct dirent. According to the man page, each entry's d_off field can be used with lseek(2) to position the directory descriptor to the next entry. That works for UFS, devfs, ZFS, and tmpfs. But the NFS client always returns 0 for d_off. Secondly, getdirentries is supposed to return the length of valid data in the buffer. For UFS, devfs, ZFS, and tmpfs it does it correctly. But for NFS, it always returns the size of the buffer. Steps to Reproduce: 1) Compile the attached program with cc -o getdirentries -Wall getdirentries.c 2) sudo mount -t tmpfs tmpfs /mnt 3) cd /mnt 4) sudo touch foo bar baz 5) /path/to/getdirentries 0 len=160 name=. d_reclen=32 d_off=1 name=.. d_reclen=32 d_off=15397777 name=bar d_reclen=32 d_off=15397785 name=baz d_reclen=32 d_off=250872870 name=foo d_reclen=32 d_off=2 6) /path/to/getdirentries 1 name=.. d_reclen=32 d_off=15397777 name=bar d_reclen=32 d_off=15397785 name=baz d_reclen=32 d_off=250872870 name=foo d_reclen=32 d_off=2 Notice that we supplied the first entry's d_off value, and getdirentries returned a buffer starting with the second entry. 7) cd /path/to/some/NFS/mount 8) touch foo bar baz 9) /path/to/getdirentries 0 len=8192 name=. d_reclen=40 d_off=0 name=.. d_reclen=40 d_off=0 name=bar d_reclen=40 d_off=0 name=foo d_reclen=40 d_off=0 name=baz d_reclen=352 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 name= d_reclen=512 d_off=0 Notice the long tail of "" entries. That's because the returned length is equal to the size of the buffer. Also, notice that d_off is always 0. Obviously, it can't be supplied to lseek. Observed with NFS v4.2 on a FreeBSD 14.0-CURRENT amd64 client with a 13.0-ALPHA1 server.
I suppose the simple answer to this is "yes". The layer in NFS below the buffer cache does know "physical" (as returned by an NFS server) offset cookies, but then each buffer cache block is filled in (completely filled, which is why the full block is always returned) with "fake" UFS directory blocks. The d_off field could be filled in with the byte offset within the "logical block", but that would just break for lseek(), etc. The only case that work is an lseek() to the beginning of the block, because the NFS client maintains a cache of directory offset cookies, one for each block returned. I suppose the bug is in the man page, in that it does not state the NFS client as an exception. OpenBSD does not cache directory blocks in the server and, as such, can return d_off values that are the directory offset cookies that work on the NFS server. I have resisted making this change for FreeBSD, becuase it could have a major impact on performance in certain scenarios. The man page should probably note that only the libc calls opendir(3) should ever use getdirentries(2) for NFS mounts.
I would be fine with that man page change. I'm not really sure why anybody would choose to use getdirentries over readdir, anyway.
Ok. I think the best change would be to note that d_off == 0 can be generated by certain file systems. such as NFS. I think a value of 0 for d_off is always meaningless and should be ignored. An offset == 0 refers to beginning of directory and would never refer to a valid "next entry", I think? Feel free to update the man page, or I may get around to it someday. Now that I'm done with NFS over TLS, I am going to start to work on directory delegations. For the case of servers that present monotonically increasing offset cookies, more might be doable here. --> I plan on proposing an extension to NFSv4.2 which is a new attribute to indicate monotonically increasing directory offset cookies, because that is the only case where the NFS client can implement directory delegations. --> These allow a directory cache to be updated via callbacks with add/remove entry info. from a server. No one has implemented these, but they might be useful.
Isn't the real bug there, the long non-shortened read from getdirentries()? It probably somehow related to the EOF caching for NFS directories, but I did not analyzed it much.
I once committed a patch that got rid of the trailing empty dirent entries that fill out the block. (Each one is a strucr dirent of d_reclen == 512 and d_fileno == 0.) It broke the directory caching badly, such that there was a much higher miss ratio on the caching. I'll admit I cannot remember exactly how, but I do remember that I couldn't come up with a way to fix it, so I reverted the patch. It was bde@ who spotted the breakage.
Just fyi, what I am looking at for the case where the directory offset cookie returned by the server is monotonically increasing is... Replacing uses ig gbincore() with a similar call the uses PCTRIE_LOOKUP_LE() instead of PCTRIE_LOOKUP(). This would allow the directory offset cookies to be used as keys for the block instead of the "logical byte offset" of the converted to UFS-like directory blocks. The problem is that there is no way to know if the directory offset cookies are monotonically increasing until you see them. (Linux does something where the client code checks for a non-monotonically increasing offset then disables the caching optimization if this happens. I am not yet sure if doing this will be practical for what I am starting to code.) I am planning on proposing a new attribute for NFSv4.2 to indicate whether or not the offsets are monotonically increasing, but that will take a long time to get adopted. Once keyed on directory offset cookies, I think the "short block caching issue might get fixed, since I think it was related to the "logical byte offset" of the next block changing when short directory blocks were returned. (Maybe this should be moved to a mailing list?)
(In reply to Rick Macklem from comment #5) You mean, you trimmed the buffer itself? Could you please point out corresponding commits, the change that trimmed and the later revert? Can we keep the buffer alone, but stop returning too large result to userspace. I.e. avoid zero entries.
r283330, reverted by r291117 I believe the caching story was roughly (assuming an 8K block size). - NFS VOP_READDIR() would read the first block at logical offset 0 and then a read-ahead at logical offset 8K. --> If the readdir reply only filled 6K, it would set the b_bcount to 6K When getdents() or read() did the next read, it would ask for a block at 6K and the read-ahead block in the cache would be missed. Now if someone (I'll not volunteering right now) were to add code above the VOP_READDIR() in getdirentries()/getdents() that skipped over the dirents with d_fileno == 0 but still advanced f_offset past them, then I think that would be ok (no change in NFS behaviour). Also, you could get rid of the similar code in ufs_readdir(), which would be nice for the NFS server, since the behaviour of ufs_readdir() would return to "subsequent directory offsets do not change when entrie(s) are deleted. --> With this property, doing readdir(), unlink() in a loop works over NFS.
(In reply to Rick Macklem from comment #8) I think these are different things. In NFS nfsrpc_readdirplus(), there is special code just to fill the incoming buffer with empty dirents (now I found it). I mean the bloc near the end, under the comment 'Add extra empty records to any remaining DIRBLKSIZ chunks.' There is nothing comparable to that in UFS. As I understand, you are complaining about UFS code which skips empty dirents (by checking d_ino == 0 and jumping to nextentry). But there is no code to fabricate dummy dirents at the end to fill up to the end of uio. Why this code is needed for NFS? To fill the buffer, so that the situation you described with 6K/8K skew cannot occur? Also, I do not think that the code to skip empty dirents from UFS can be lifted to VFS layer. Problem is that VOP_READDIR() has to do uiomove()s itself, so the skip must occur in VOP. Would it be useful if UFS know that ufs_readdir() is called by NFS server and avoided skip?
(In reply to Konstantin Belousov from comment #9) Like this, untested. diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c index 301c583291d1..bfdc16a1df72 100644 --- a/sys/ufs/ufs/ufs_vnops.c +++ b/sys/ufs/ufs/ufs_vnops.c @@ -2391,11 +2391,23 @@ ufs_readdir(ap) error = EIO; break; } - if (offset < startoffset || dp->d_ino == 0) + if (offset < startoffset) goto nextentry; + + /* + * In case of NFS server calling us, do not + * skip empty entries, since this keeps + * offsets consistent. + */ + if (cookies == NULL && dp->d_ino == 0) + goto nextentry; + dstdp.d_fileno = dp->d_ino; dstdp.d_reclen = GENERIC_DIRSIZ(&dstdp); - bcopy(dp->d_name, dstdp.d_name, dstdp.d_namlen); + if (dp->d_ino == 0) + bzero(dp->d_name, dstdp.d_namlen); + else + bcopy(dp->d_name, dstdp.d_name, dstdp.d_namlen); /* NOTE: d_off is the offset of the *next* entry. */ dstdp.d_off = offset + dp->d_reclen; dirent_terminate(&dstdp);
Ok, so now I see that stripping of d_ino == 0 entries must be done below the VOP_READDIR(). When I did r283330 to get rid of the creation of d_ino == 0 entries to fill up the block being filled in the buffer cache, I did not adjust uio_offset to indicate the offset of the next block. --> For my 6K/8K case, I could have advanced uio_offset by 8K even though it filled 6K into the buffer cache block. That might have made the caching work? I'll give that a try. --> It helps that read(2) is no longer expected to read directories. man getdirentries should be fixed for the d_off == 0 case. I'll come up with a man page patch and stick it on phabricator, because filling in d_off needs some thought for NFS. There may be some cases where the server's directory offset cookie can be used. Ignore trying to revert the change done to UFS in r252438, since the stripping cannot be done above the VOP_READDIR(). Unless both UFS and ZFS had the property that removal of an entry in a directory does not change other directory offsets, it is not worth worrying about. --> It also implies that d_ino == 0 dirents will be passed to user space which is not really intended now? At least that's my opinion. (And, yes, it was an aside and a separate issue.)
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=a0698341cd894ba4a640e9a9bb0f72c2133d1228 commit a0698341cd894ba4a640e9a9bb0f72c2133d1228 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2021-02-15 02:16:58 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2021-02-15 02:16:58 +0000 getdirentries.2: fix for NFS mounts It was reported that getdirentries(2) was returning dirents with d_off set to 0 for an NFS mount. This is believed to be correct behaviour at this time (it may change for some NFS mounts in the future), but is inconsistent with what the getdirentries(2) man page says. This patch fixes the man page. This is a content change. PR: 253428 Reviewed by: asomers MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28664 lib/libc/sys/getdirentries.2 | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
A commit in branch stable/13 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b96e66349ed80f2537777c0cc04266de1a3768b0 commit b96e66349ed80f2537777c0cc04266de1a3768b0 Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2021-02-15 02:16:58 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2021-03-01 21:00:38 +0000 getdirentries.2: fix for NFS mounts It was reported that getdirentries(2) was returning dirents with d_off set to 0 for an NFS mount. This is believed to be correct behaviour at this time (it may change for some NFS mounts in the future), but is inconsistent with what the getdirentries(2) man page says. This patch fixes the man page. This is a content change. PR: 253428 (cherry picked from commit a0698341cd894ba4a640e9a9bb0f72c2133d1228) lib/libc/sys/getdirentries.2 | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
A commit in branch stable/12 references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=0dcfb6d761ccc8bd45b68231f9a5f4ff4c6d989f commit 0dcfb6d761ccc8bd45b68231f9a5f4ff4c6d989f Author: Rick Macklem <rmacklem@FreeBSD.org> AuthorDate: 2021-02-15 02:16:58 +0000 Commit: Rick Macklem <rmacklem@FreeBSD.org> CommitDate: 2021-03-04 01:30:23 +0000 getdirentries.2: fix for NFS mounts It was reported that getdirentries(2) was returning dirents with d_off set to 0 for an NFS mount. This is believed to be correct behaviour at this time (it may change for some NFS mounts in the future), but is inconsistent with what the getdirentries(2) man page says. This patch fixes the man page. This is a content change. PR: 253428 (cherry picked from commit a0698341cd894ba4a640e9a9bb0f72c2133d1228) lib/libc/sys/getdirentries.2 | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
Assigning to committer. Rick, can we close this bug?
With the man page commit that clarifies that d_off == 0 should be ignored, I believe the behaviour is correct. Things wandered slightly off-topic into a discussion w.r.t. stripping off DIRBLKSIZ blocks of no useful dir entries. This may be possible if/when directory readahead is disabled in the NFS client. (Directory readahead does not make much sense, since the readahead needs to wait for the previous block's reply, so that it has the required directory offset cookie. As such, the readahead cannot normally happen concurrently with the read, so very little parallelism occurs.) Close the bug at least for now.