Bug 253428

Summary: getdirentries does not work correctly on NFS mounts
Product: Base System Reporter: Alan Somers <asomers>
Component: kernAssignee: Rick Macklem <rmacklem>
Status: Closed FIXED    
Severity: Affects Some People CC: kib-bugs, kib, rmacklem
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
test program demonstrating getdirentries's output. none

Description Alan Somers freebsd_committer freebsd_triage 2021-02-11 04:07:50 UTC
Created attachment 222358 [details]
test program demonstrating getdirentries's output.

r318736, the 64-bit inode project, added a d_off field to struct dirent.  According to the man page, each entry's d_off field can be used with lseek(2) to position the directory descriptor to the next entry.  That works for UFS, devfs, ZFS, and tmpfs.  But the NFS client always returns 0 for d_off.

Secondly, getdirentries is supposed to return the length of valid data in the buffer.  For UFS, devfs, ZFS, and tmpfs it does it correctly.  But for NFS, it always returns the size of the buffer.

Steps to Reproduce:
1) Compile the attached program with cc -o getdirentries -Wall getdirentries.c
2) sudo mount -t tmpfs tmpfs /mnt
3) cd /mnt
4) sudo touch foo bar baz
5) /path/to/getdirentries 0
  len=160
  name=. d_reclen=32 d_off=1
  name=.. d_reclen=32 d_off=15397777
  name=bar d_reclen=32 d_off=15397785
  name=baz d_reclen=32 d_off=250872870
  name=foo d_reclen=32 d_off=2
6) /path/to/getdirentries 1
  name=.. d_reclen=32 d_off=15397777
  name=bar d_reclen=32 d_off=15397785
  name=baz d_reclen=32 d_off=250872870
  name=foo d_reclen=32 d_off=2
Notice that we supplied the first entry's d_off value, and getdirentries returned a buffer starting with the second entry.
7) cd /path/to/some/NFS/mount
8) touch foo bar baz
9) /path/to/getdirentries 0
  len=8192
  name=. d_reclen=40 d_off=0
  name=.. d_reclen=40 d_off=0
  name=bar d_reclen=40 d_off=0
  name=foo d_reclen=40 d_off=0
  name=baz d_reclen=352 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
  name= d_reclen=512 d_off=0
Notice the long tail of "" entries.  That's because the returned length is equal to the size of the buffer.  Also, notice that d_off is always 0.  Obviously, it can't be supplied to lseek.

Observed with NFS v4.2 on a FreeBSD 14.0-CURRENT amd64 client with a 13.0-ALPHA1 server.
Comment 1 Rick Macklem freebsd_committer freebsd_triage 2021-02-11 23:13:17 UTC
I suppose the simple answer to this is "yes".

The layer in NFS below the buffer cache does know
"physical" (as returned by an NFS server) offset
cookies, but then each buffer cache block is filled
in (completely filled, which is why the full block
is always returned) with "fake" UFS directory blocks.

The d_off field could be filled in with the byte offset
within the "logical block", but that would just break
for lseek(), etc.

The only case that work is an lseek() to the beginning
of the block, because the NFS client maintains a cache
of directory offset cookies, one for each block returned.

I suppose the bug is in the man page, in that it does
not state the NFS client as an exception.

OpenBSD does not cache directory blocks in the server
and, as such, can return d_off values that are the
directory offset cookies that work on the NFS server.
I have resisted making this change for FreeBSD, becuase
it could have a major impact on performance in certain
scenarios.

The man page should probably note that only the libc
calls opendir(3) should ever use getdirentries(2)
for NFS mounts.
Comment 2 Alan Somers freebsd_committer freebsd_triage 2021-02-11 23:19:23 UTC
I would be fine with that man page change.  I'm not really sure why anybody would choose to use getdirentries over readdir, anyway.
Comment 3 Rick Macklem freebsd_committer freebsd_triage 2021-02-11 23:39:34 UTC
Ok. I think the best change would be to note that
d_off == 0 can be generated by certain file
systems. such as NFS.
I think a value of 0 for d_off is always meaningless
and should be ignored.

An offset == 0 refers to beginning of directory and would
never refer to a valid "next entry", I think?

Feel free to update the man page, or I may get around
to it someday.

Now that I'm done with NFS over TLS, I am going
to start to work on directory delegations.
For the case of servers that present monotonically
increasing offset cookies, more might be doable
here.
--> I plan on proposing an extension to NFSv4.2
    which is a new attribute to indicate monotonically
    increasing directory offset cookies, because that
    is the only case where the NFS client can implement
    directory delegations.
    --> These allow a directory cache to be updated via
        callbacks with add/remove entry info. from
        a server. No one has implemented these, but
        they might be useful.
Comment 4 Konstantin Belousov freebsd_committer freebsd_triage 2021-02-12 01:40:42 UTC
Isn't the real bug there, the long non-shortened read from getdirentries()?
It probably somehow related to the EOF caching for NFS directories, but I
did not analyzed it much.
Comment 5 Rick Macklem freebsd_committer freebsd_triage 2021-02-12 14:07:58 UTC
I once committed a patch that got rid of
the trailing empty dirent entries that
fill out the block. (Each one is a
strucr dirent of d_reclen == 512 and d_fileno == 0.)

It broke the directory caching badly,
such that there was a much higher miss
ratio on the caching.
I'll admit I cannot remember exactly how,
but I do remember that I couldn't come
up with a way to fix it, so I reverted
the patch. It was bde@ who spotted the
breakage.
Comment 6 Rick Macklem freebsd_committer freebsd_triage 2021-02-12 14:35:02 UTC
Just fyi, what I am looking at for the case
where the directory offset cookie returned
by the server is monotonically increasing is...

Replacing uses ig gbincore() with a similar
call the uses PCTRIE_LOOKUP_LE() instead
of PCTRIE_LOOKUP(). This would allow the
directory offset cookies to be used as keys
for the block instead of the "logical byte
offset" of the converted to UFS-like directory
blocks.

The problem is that there is no way to know
if the directory offset cookies are monotonically
increasing until you see them.
(Linux does something where the client code
 checks for a non-monotonically increasing
 offset then disables the caching optimization
 if this happens. I am not yet sure if doing this
 will be practical for what I am starting
 to code.)

I am planning on proposing a new attribute for
NFSv4.2 to indicate whether or not the offsets
are monotonically increasing, but that will
take a long time to get adopted.

Once keyed on directory offset cookies, I
think the "short block caching issue might
get fixed, since I think it was related to the
"logical byte offset" of the next block
changing when short directory blocks were
returned.
(Maybe this should be moved to a mailing list?)
Comment 7 Konstantin Belousov freebsd_committer freebsd_triage 2021-02-13 04:56:46 UTC
(In reply to Rick Macklem from comment #5)
You mean, you trimmed the buffer itself?  Could you please point out corresponding
commits, the change that trimmed and the later revert?

Can we keep the buffer alone, but stop returning too large result to userspace.
I.e. avoid zero entries.
Comment 8 Rick Macklem freebsd_committer freebsd_triage 2021-02-13 16:10:57 UTC
r283330, reverted by r291117
I believe the caching story was roughly
(assuming an 8K block size).
- NFS VOP_READDIR() would read the first
  block at logical offset 0
  and then a read-ahead at logical offset 8K.
--> If the readdir reply only filled 6K,
    it would set the b_bcount to 6K

When getdents() or read() did the next read,
it would ask for a block at 6K and the read-ahead
block in the cache would be missed.

Now if someone (I'll not volunteering right
now) were to add code above the VOP_READDIR()
in getdirentries()/getdents() that skipped
over the dirents with d_fileno == 0 but still
advanced f_offset past them, then I think
that would be ok (no change in NFS behaviour).

Also, you could get rid of the similar code
in ufs_readdir(), which would be nice for
the NFS server, since the behaviour of ufs_readdir()
would return to "subsequent directory offsets
do not change when entrie(s) are deleted.
--> With this property, doing readdir(), unlink()
    in a loop works over NFS.
Comment 9 Konstantin Belousov freebsd_committer freebsd_triage 2021-02-13 19:45:14 UTC
(In reply to Rick Macklem from comment #8)
I think these are different things.

In NFS nfsrpc_readdirplus(), there is special code just to fill the incoming
buffer with empty dirents (now I found it).  I mean the bloc near the end,
under the comment
'Add extra empty records to any remaining DIRBLKSIZ chunks.'

There is nothing comparable to that in UFS.  As I understand, you are complaining
about UFS code which skips empty dirents (by checking d_ino == 0 and jumping to
nextentry).  But there is no code to fabricate dummy dirents at the end to
fill up to the end of uio.

Why this code is needed for NFS?  To fill the buffer, so that the situation
you described with 6K/8K skew cannot occur?

Also, I do not think that the code to skip empty dirents from UFS can be lifted
to VFS layer.  Problem is that VOP_READDIR() has to do uiomove()s itself, so the
skip must occur in VOP.

Would it be useful if UFS know that ufs_readdir() is called by NFS server and
avoided skip?
Comment 10 Konstantin Belousov freebsd_committer freebsd_triage 2021-02-13 20:28:14 UTC
(In reply to Konstantin Belousov from comment #9)
Like this, untested.

diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c
index 301c583291d1..bfdc16a1df72 100644
--- a/sys/ufs/ufs/ufs_vnops.c
+++ b/sys/ufs/ufs/ufs_vnops.c
@@ -2391,11 +2391,23 @@ ufs_readdir(ap)
 				error = EIO;
 				break;
 			}
-			if (offset < startoffset || dp->d_ino == 0)
+			if (offset < startoffset)
 				goto nextentry;
+
+			/*
+			 * In case of NFS server calling us, do not
+			 * skip empty entries, since this keeps
+			 * offsets consistent.
+			 */
+			if (cookies == NULL && dp->d_ino == 0)
+				goto nextentry;
+
 			dstdp.d_fileno = dp->d_ino;
 			dstdp.d_reclen = GENERIC_DIRSIZ(&dstdp);
-			bcopy(dp->d_name, dstdp.d_name, dstdp.d_namlen);
+			if (dp->d_ino == 0)
+				bzero(dp->d_name, dstdp.d_namlen);
+			else
+				bcopy(dp->d_name, dstdp.d_name, dstdp.d_namlen);
 			/* NOTE: d_off is the offset of the *next* entry. */
 			dstdp.d_off = offset + dp->d_reclen;
 			dirent_terminate(&dstdp);
Comment 11 Rick Macklem freebsd_committer freebsd_triage 2021-02-13 23:08:32 UTC
Ok, so now I see that stripping of d_ino == 0
entries must be done below the VOP_READDIR().
When I did r283330 to get rid of the creation
of d_ino == 0 entries to fill up the block
being filled in the buffer cache, I
did not adjust uio_offset to indicate the
offset of the next block.
--> For my 6K/8K case, I could have advanced
    uio_offset by 8K even though it filled
    6K into the buffer cache block.
    That might have made the caching work?
    I'll give that a try.
    --> It helps that read(2) is no longer
        expected to read directories.

man getdirentries should be fixed for the
d_off == 0 case. I'll come up with a man
page patch and stick it on phabricator,
because filling in d_off needs some thought
for NFS. There may be some cases where the
server's directory offset cookie can be
used.

Ignore trying to revert the change done to UFS in r252438,
since the stripping cannot be done above the VOP_READDIR().
Unless both UFS and ZFS had the property
that removal of an entry in a directory
does not change other directory offsets,
it is not worth worrying about.
--> It also implies that d_ino == 0
    dirents will be passed to user space
    which is not really intended now?
At least that's my opinion.
(And, yes, it was an aside and a separate issue.)
Comment 12 commit-hook freebsd_committer freebsd_triage 2021-02-15 02:19:30 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a0698341cd894ba4a640e9a9bb0f72c2133d1228

commit a0698341cd894ba4a640e9a9bb0f72c2133d1228
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2021-02-15 02:16:58 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2021-02-15 02:16:58 +0000

    getdirentries.2: fix for NFS mounts

    It was reported that getdirentries(2) was
    returning dirents with d_off set to 0 for an NFS
    mount.

    This is believed to be correct behaviour at
    this time (it may change for some NFS mounts
    in the future), but is inconsistent with what the
    getdirentries(2) man page says.

    This patch fixes the man page.

    This is a content change.

    PR:     253428
    Reviewed by:    asomers
    MFC after:      2 weeks
    Differential Revision:  https://reviews.freebsd.org/D28664

 lib/libc/sys/getdirentries.2 | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)
Comment 13 commit-hook freebsd_committer freebsd_triage 2021-03-01 21:06:41 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b96e66349ed80f2537777c0cc04266de1a3768b0

commit b96e66349ed80f2537777c0cc04266de1a3768b0
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2021-02-15 02:16:58 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2021-03-01 21:00:38 +0000

    getdirentries.2: fix for NFS mounts

    It was reported that getdirentries(2) was
    returning dirents with d_off set to 0 for an NFS
    mount.

    This is believed to be correct behaviour at
    this time (it may change for some NFS mounts
    in the future), but is inconsistent with what the
    getdirentries(2) man page says.

    This patch fixes the man page.

    This is a content change.

    PR:     253428

    (cherry picked from commit a0698341cd894ba4a640e9a9bb0f72c2133d1228)

 lib/libc/sys/getdirentries.2 | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)
Comment 14 commit-hook freebsd_committer freebsd_triage 2021-03-04 01:33:46 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=0dcfb6d761ccc8bd45b68231f9a5f4ff4c6d989f

commit 0dcfb6d761ccc8bd45b68231f9a5f4ff4c6d989f
Author:     Rick Macklem <rmacklem@FreeBSD.org>
AuthorDate: 2021-02-15 02:16:58 +0000
Commit:     Rick Macklem <rmacklem@FreeBSD.org>
CommitDate: 2021-03-04 01:30:23 +0000

    getdirentries.2: fix for NFS mounts

    It was reported that getdirentries(2) was
    returning dirents with d_off set to 0 for an NFS
    mount.

    This is believed to be correct behaviour at
    this time (it may change for some NFS mounts
    in the future), but is inconsistent with what the
    getdirentries(2) man page says.

    This patch fixes the man page.

    This is a content change.

    PR:     253428

    (cherry picked from commit a0698341cd894ba4a640e9a9bb0f72c2133d1228)

 lib/libc/sys/getdirentries.2 | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)
Comment 15 Alan Somers freebsd_committer freebsd_triage 2022-07-24 16:54:27 UTC
Assigning to committer.  Rick, can we close this bug?
Comment 16 Rick Macklem freebsd_committer freebsd_triage 2022-07-24 20:00:52 UTC
With the man page commit that clarifies that d_off == 0
should be ignored, I believe the behaviour is correct.

Things wandered slightly off-topic into a discussion
w.r.t. stripping off DIRBLKSIZ blocks of no useful
dir entries. This may be possible if/when directory
readahead is disabled in the NFS client.
(Directory readahead does not make much sense, since
 the readahead needs to wait for the previous block's
 reply, so that it has the required directory offset
 cookie. As such, the readahead cannot normally happen
 concurrently with the read, so very little parallelism
 occurs.)

Close the bug at least for now.