Bug 230491 - stat(1): Improve performance with getpwuid() and getgrgid() caching
Summary: stat(1): Improve performance with getpwuid() and getgrgid() caching
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 11.2-STABLE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs mailing list
URL:
Keywords: needs-qa, patch, performance
Depends on:
Blocks: 232940
  Show dependency treegraph
 
Reported: 2018-08-10 01:37 UTC by Thomas Hurst
Modified: 2018-11-03 16:37 UTC (History)
2 users (show)

See Also:


Attachments
stat(1) caching (applies to STABLE and CURRENT) (1.21 KB, patch)
2018-08-10 01:37 UTC, Thomas Hurst
no flags Details | Diff
Untested counter-proposal using existing pwcache(3) mechanism (1019 bytes, patch)
2018-08-11 02:42 UTC, Conrad Meyer
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Hurst 2018-08-10 01:37:06 UTC
Created attachment 196041 [details]
stat(1) caching (applies to STABLE and CURRENT)

This patch for stat(1) adds very simple caching to its calls to getpwuid() and getgrgid(), for a substantial performance boost when user/group is displayed.

With this test command on a cached ZFS dataset:

    time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)

I see run time reduce from 22 seconds to 8.5 seconds.  On the OpenBSD CVS repository with files owned by a normal user I see an even more pronounced difference: 55 seconds reduced to 14 - I guess these functions have runtime proportional to how far down the user/group lists they look?

sysutils/coreutils gnustat shows identical performance to stock FreeBSD stat(1), so could probably benefit from similar changes.

Thanks to Vall on FreeNode #freebsd for reporting the performance problem.
Comment 1 Conrad Meyer freebsd_committer 2018-08-10 04:22:25 UTC
Patch looks functionally ok to me (works because stat is a single-threaded program); has some style issues.

It's unclear to me if we want to add caching to every individual tool accessing these databases, or if the caching should just happen in libc.

One other concern that the thought of caching in libc raises is: how do we handle cache invalidation?  Obviously we don't care if a user/group change races a running but ultimately bounded program like stat(1) across many files, but we probably do care if a daemon never sees renamed or numbered ids.

Btw, it seems such caching is already implemented in libc via the NS_CACHING option.
Comment 2 Thomas Hurst 2018-08-10 12:03:43 UTC
The built-in caching doesn't seem useful for this case.  Enabling it results in a test run creating this unfortunate entry in top:

63957 root             9  52    0 24000K  4696K uwait  19   9:15 268.91% nscd

and actually *slows down* stat(1) in my tests - /usr/src goes from 22 seconds to almost 90, and my CVS repository scan goes from 55 seconds to 2 minutes.
Comment 3 Garance A Drosehn freebsd_committer 2018-08-11 01:11:39 UTC
Just out of curiosity:  Have you tried 'nscd' (name service caching daemon).  I have never had reason to try it on my freebsd systems, but I've used it on some linux systems where it can produce a huge performance boost.
Comment 4 Conrad Meyer freebsd_committer 2018-08-11 01:15:25 UTC
(In reply to Garance A Drosehn from comment #3)
See comment #2 :-).  It's understandable given IPC overhead vs accessing a static variable.
Comment 5 Garance A Drosehn freebsd_committer 2018-08-11 01:16:18 UTC
(In reply to Garance A Drosehn from comment #3)

Ugh.  I didn't notice that 'ncsd' was mentioned in the response right before mine!  The performance of nscd might depend on how it's configured in /etc/nscd.conf.

On the linux systems where I use this, we have over 11000 userids, and turning on 'nscd' does make a huge difference in 'ls' and 'stat'.  The resulting speed is *almost* as fast as doing 'ls -ln /sample/directory', for instance.
Comment 6 Garance A Drosehn freebsd_committer 2018-08-11 01:20:22 UTC
(In reply to Garance A Drosehn from comment #5)

for instance, on one sample directory:

time ls -ln <sample>
....
real	0m0.010s
user	0m0.001s
sys	0m0.007s

time ls -l <sample>   # on host with nscd
....
real	0m0.012s
user	0m0.002s
sys	0m0.006s

time ls -l <sample>   # on host without nscd
....
real	0m1.156s
user	0m0.003s
sys	0m0.009s
Comment 7 Garance A Drosehn freebsd_committer 2018-08-11 01:51:40 UTC
(In reply to Conrad Meyer from comment #4)

Just how much overhead is there in IPC?  It's easy to see that a local cache will beat the caching done by a separate nscd process, but why would using nscd be so much slower than doing a totally non-cached look up of the user and group names each time?  Doesn't that seem a little odd?

And the local-to-stat caching:  won't that only help if many commands are 'stat'-ed in a single command, as opposed to doing many stat-commands with one file per command?

Does 'ls' do local caching?  If there is a big benefit in doing local-caching in 'stat', should we also do it with 'ls'?  I know I do a lot more 'ls -l's than I do 'stat'-commands of any kind.
Comment 8 Conrad Meyer freebsd_committer 2018-08-11 02:35:19 UTC
(In reply to Garance A Drosehn from comment #5)
(In reply to Garance A Drosehn from comment #7)
> but why would using nscd be so much slower than doing a totally non-cached look
> up of the user and group names each time?  Doesn't that seem a little odd?

You're totally right, that does seem suspect.  Maybe our ncsd isn't the same one as on Linux, or is an older version, or has some other deficiency?  I admit total lack of familiarity with any version or use of ncsd. :-)

> And the local-to-stat caching:  won't that only help if many commands are
> 'stat'-ed in a single command, as opposed to doing many stat-commands with one
> file per command?

Correct.  That's the use case described in Thomas' description, although I don't know if it represents any realworld workflow or if it's just an arbitrary pessimistic microbenchmark.  If the latter, I don't think it's worth bringing in the stat(1) patch.  (Even if it is a real-world use, the non-general optimization is questionable and we need to investigate why ncsd doesn't solve the problem first.)

> Does 'ls' do local caching?  If there is a big benefit in doing local-caching
> in 'stat', should we also do it with 'ls'?  I know I do a lot more 'ls -l's
> than I do 'stat'-commands of any kind.

Another great question, and the answer is yes — through a 3rd method.  ls uses user_from_uid(), group_from_gid(), which caches any implementation of getpwuid() / getgrgid().  Fortunately, libc's default implementation of these routines already uses _nsdispatch(), so it is not limited to local /etc/passwd.

So tl;dr: Instead of this patch, stat(1) should just use user_from_uid(3) / group_from_gid(3) (pwcache(3) API).
Comment 9 Conrad Meyer freebsd_committer 2018-08-11 02:36:17 UTC
And, heh, the need to cache id->name resolution was recognized *long* ago:

HISTORY
     The user_from_uid() and group_from_gid() functions first appeared in
     4.4BSD.

:-D.
Comment 10 Conrad Meyer freebsd_committer 2018-08-11 02:42:27 UTC
Created attachment 196068 [details]
Untested counter-proposal using existing pwcache(3) mechanism

I have not tested this patch, other than to verify it compiles.
Comment 11 Conrad Meyer freebsd_committer 2018-08-11 02:47:03 UTC
Here's my non-scientific results for my proposed patch (on CURRENT GENERIC, i.e., WITNESS and INVARIANTS are on, which introduces wider error bars than without, but still):

BEFORE
conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)
( find /usr/src -type f -print0 | xargs -0 stat > /dev/null; )  3.62s user 5.23s system 102% cpu 8.655 total
conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)
( find /usr/src -type f -print0 | xargs -0 stat > /dev/null; )  3.47s user 5.38s system 102% cpu 8.647 total

AFTER
conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat >/dev/null)
( find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat > /dev/nul)  1.23s user 1.81s system 108% cpu 2.810 total
conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat >/dev/null)
( find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat > /dev/nul)  1.43s user 1.54s system 107% cpu 2.754 total

(I gave both programs a throw-away warm-up run first to make sure the disk caches were comparably primed, etc.)

Seems like a clear win.
Comment 12 commit-hook freebsd_committer 2018-08-11 02:57:45 UTC
A commit references this bug:

Author: cem
Date: Sat Aug 11 02:56:44 UTC 2018
New revision: 337600
URL: https://svnweb.freebsd.org/changeset/base/337600

Log:
  stat(1): cache id->name resolution

  When invoked on a large list of files, it is most common for a small number of
  uids/gids to own most of the results.

  Like ls(1), use pwcache(3) to avoid repeatedly looking up the same IDs.

  Example microbenchmark and non-scientific results:

  $ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null)

  BEFORE:
  3.62s user 5.23s system 102% cpu 8.655 total
  3.47s user 5.38s system 102% cpu 8.647 total

  AFTER:
  1.23s user 1.81s system 108% cpu 2.810 total
  1.43s user 1.54s system 107% cpu 2.754 total

  Does this microbenchmark have any real-world significance?  Until a use case
  is demonstrated otherwise, I doubt it.  Ordinarily I would be resistant to
  optimizing pointless microbenchmarks in base utilities (e.g., recent totally
  gratuitous changes to yes(1)).  However, the pwcache(3) APIs actually
  simplify stat(1) logic ever so slightly compared to the raw APIs they wrap,
  so I think this is at worst harmless.

  PR:		230491
  Reported by:	Thomas Hurst <tom AT hur.st>
  Discussed with:	gad@

Changes:
  head/usr.bin/stat/stat.c
Comment 13 Conrad Meyer freebsd_committer 2018-08-11 02:59:27 UTC
Thomas, please try the proposed patch and report back if it is adequate for your needs.  If it is still significantly worse than the simpler cache, I'd like your help to understand (1) the use case that benefits from this microbenchmark and (2) the magnitude of the difference.

Thanks again for the report and patch.

And of course, thanks to Garance as well for the productive discussion and suggestion to refer to ls(1).
Comment 14 Thomas Hurst 2018-08-11 03:08:18 UTC
11.2-STABLE, consistent across several runs:

Conrad's patch: 8.792 real, 4.506 user, 4.788 sys;  page: 0 hard/13520 soft, swap: 0, I/O: 8818/0

My patch: 8.577 real, 4.547 user, 4.525 sys;  page: 0 hard/13264 soft, swap: 0, I/O: 8818/0

Looks good, particularly given it makes the code more straight-forward.

Reducing thread count improves performance with nscd, but at "threads 1" it's still 52s vs 22s uncached.
Comment 15 Thomas Hurst 2018-08-11 03:24:40 UTC
As for the use-case, here's the initial report:

> I'm seeing really, *REALLY* bad performance from stat(1) on FreeNAS. Case in point: a `find . -print0 | xargs -0 shasum >file.sha` takes about 7.5h on dirtree with 1.9TiB and 24.5M
>
> while a `find . -print0 | xargs -0 stat -f '%d %i %#p %l %u %g %r %z %a %m %c %B %k %b %f %N%SY' >file.stat` takes 21 hours
>
> almost TRIPLE the time for reading just the metadata versus reading ALL THE DATA, seems really weird!

Seems like they were making a catalogue of the metadata of a large set of small files.
Comment 16 Conrad Meyer freebsd_committer 2018-08-11 03:47:40 UTC
(In reply to Thomas Hurst from comment #15)
Thanks, that helps.  It still seems like a weird way to go about it (smells like a hand-rolled backup solution or something) but it's more concrete.