Created attachment 196041 [details] stat(1) caching (applies to STABLE and CURRENT) This patch for stat(1) adds very simple caching to its calls to getpwuid() and getgrgid(), for a substantial performance boost when user/group is displayed. With this test command on a cached ZFS dataset: time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null) I see run time reduce from 22 seconds to 8.5 seconds. On the OpenBSD CVS repository with files owned by a normal user I see an even more pronounced difference: 55 seconds reduced to 14 - I guess these functions have runtime proportional to how far down the user/group lists they look? sysutils/coreutils gnustat shows identical performance to stock FreeBSD stat(1), so could probably benefit from similar changes. Thanks to Vall on FreeNode #freebsd for reporting the performance problem.
Patch looks functionally ok to me (works because stat is a single-threaded program); has some style issues. It's unclear to me if we want to add caching to every individual tool accessing these databases, or if the caching should just happen in libc. One other concern that the thought of caching in libc raises is: how do we handle cache invalidation? Obviously we don't care if a user/group change races a running but ultimately bounded program like stat(1) across many files, but we probably do care if a daemon never sees renamed or numbered ids. Btw, it seems such caching is already implemented in libc via the NS_CACHING option.
The built-in caching doesn't seem useful for this case. Enabling it results in a test run creating this unfortunate entry in top: 63957 root 9 52 0 24000K 4696K uwait 19 9:15 268.91% nscd and actually *slows down* stat(1) in my tests - /usr/src goes from 22 seconds to almost 90, and my CVS repository scan goes from 55 seconds to 2 minutes.
Just out of curiosity: Have you tried 'nscd' (name service caching daemon). I have never had reason to try it on my freebsd systems, but I've used it on some linux systems where it can produce a huge performance boost.
(In reply to Garance A Drosehn from comment #3) See comment #2 :-). It's understandable given IPC overhead vs accessing a static variable.
(In reply to Garance A Drosehn from comment #3) Ugh. I didn't notice that 'ncsd' was mentioned in the response right before mine! The performance of nscd might depend on how it's configured in /etc/nscd.conf. On the linux systems where I use this, we have over 11000 userids, and turning on 'nscd' does make a huge difference in 'ls' and 'stat'. The resulting speed is *almost* as fast as doing 'ls -ln /sample/directory', for instance.
(In reply to Garance A Drosehn from comment #5) for instance, on one sample directory: time ls -ln <sample> .... real 0m0.010s user 0m0.001s sys 0m0.007s time ls -l <sample> # on host with nscd .... real 0m0.012s user 0m0.002s sys 0m0.006s time ls -l <sample> # on host without nscd .... real 0m1.156s user 0m0.003s sys 0m0.009s
(In reply to Conrad Meyer from comment #4) Just how much overhead is there in IPC? It's easy to see that a local cache will beat the caching done by a separate nscd process, but why would using nscd be so much slower than doing a totally non-cached look up of the user and group names each time? Doesn't that seem a little odd? And the local-to-stat caching: won't that only help if many commands are 'stat'-ed in a single command, as opposed to doing many stat-commands with one file per command? Does 'ls' do local caching? If there is a big benefit in doing local-caching in 'stat', should we also do it with 'ls'? I know I do a lot more 'ls -l's than I do 'stat'-commands of any kind.
(In reply to Garance A Drosehn from comment #5) (In reply to Garance A Drosehn from comment #7) > but why would using nscd be so much slower than doing a totally non-cached look > up of the user and group names each time? Doesn't that seem a little odd? You're totally right, that does seem suspect. Maybe our ncsd isn't the same one as on Linux, or is an older version, or has some other deficiency? I admit total lack of familiarity with any version or use of ncsd. :-) > And the local-to-stat caching: won't that only help if many commands are > 'stat'-ed in a single command, as opposed to doing many stat-commands with one > file per command? Correct. That's the use case described in Thomas' description, although I don't know if it represents any realworld workflow or if it's just an arbitrary pessimistic microbenchmark. If the latter, I don't think it's worth bringing in the stat(1) patch. (Even if it is a real-world use, the non-general optimization is questionable and we need to investigate why ncsd doesn't solve the problem first.) > Does 'ls' do local caching? If there is a big benefit in doing local-caching > in 'stat', should we also do it with 'ls'? I know I do a lot more 'ls -l's > than I do 'stat'-commands of any kind. Another great question, and the answer is yes — through a 3rd method. ls uses user_from_uid(), group_from_gid(), which caches any implementation of getpwuid() / getgrgid(). Fortunately, libc's default implementation of these routines already uses _nsdispatch(), so it is not limited to local /etc/passwd. So tl;dr: Instead of this patch, stat(1) should just use user_from_uid(3) / group_from_gid(3) (pwcache(3) API).
And, heh, the need to cache id->name resolution was recognized *long* ago: HISTORY The user_from_uid() and group_from_gid() functions first appeared in 4.4BSD. :-D.
Created attachment 196068 [details] Untested counter-proposal using existing pwcache(3) mechanism I have not tested this patch, other than to verify it compiles.
Here's my non-scientific results for my proposed patch (on CURRENT GENERIC, i.e., WITNESS and INVARIANTS are on, which introduces wider error bars than without, but still): BEFORE conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null) ( find /usr/src -type f -print0 | xargs -0 stat > /dev/null; ) 3.62s user 5.23s system 102% cpu 8.655 total conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null) ( find /usr/src -type f -print0 | xargs -0 stat > /dev/null; ) 3.47s user 5.38s system 102% cpu 8.647 total AFTER conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat >/dev/null) ( find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat > /dev/nul) 1.23s user 1.81s system 108% cpu 2.810 total conrad@n /u/s/u/stat ❯❯❯ time (find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat >/dev/null) ( find /usr/src -type f -print0 | xargs -0 $(make -V .OBJDIR)/stat > /dev/nul) 1.43s user 1.54s system 107% cpu 2.754 total (I gave both programs a throw-away warm-up run first to make sure the disk caches were comparably primed, etc.) Seems like a clear win.
A commit references this bug: Author: cem Date: Sat Aug 11 02:56:44 UTC 2018 New revision: 337600 URL: https://svnweb.freebsd.org/changeset/base/337600 Log: stat(1): cache id->name resolution When invoked on a large list of files, it is most common for a small number of uids/gids to own most of the results. Like ls(1), use pwcache(3) to avoid repeatedly looking up the same IDs. Example microbenchmark and non-scientific results: $ time (find /usr/src -type f -print0 | xargs -0 stat >/dev/null) BEFORE: 3.62s user 5.23s system 102% cpu 8.655 total 3.47s user 5.38s system 102% cpu 8.647 total AFTER: 1.23s user 1.81s system 108% cpu 2.810 total 1.43s user 1.54s system 107% cpu 2.754 total Does this microbenchmark have any real-world significance? Until a use case is demonstrated otherwise, I doubt it. Ordinarily I would be resistant to optimizing pointless microbenchmarks in base utilities (e.g., recent totally gratuitous changes to yes(1)). However, the pwcache(3) APIs actually simplify stat(1) logic ever so slightly compared to the raw APIs they wrap, so I think this is at worst harmless. PR: 230491 Reported by: Thomas Hurst <tom AT hur.st> Discussed with: gad@ Changes: head/usr.bin/stat/stat.c
Thomas, please try the proposed patch and report back if it is adequate for your needs. If it is still significantly worse than the simpler cache, I'd like your help to understand (1) the use case that benefits from this microbenchmark and (2) the magnitude of the difference. Thanks again for the report and patch. And of course, thanks to Garance as well for the productive discussion and suggestion to refer to ls(1).
11.2-STABLE, consistent across several runs: Conrad's patch: 8.792 real, 4.506 user, 4.788 sys; page: 0 hard/13520 soft, swap: 0, I/O: 8818/0 My patch: 8.577 real, 4.547 user, 4.525 sys; page: 0 hard/13264 soft, swap: 0, I/O: 8818/0 Looks good, particularly given it makes the code more straight-forward. Reducing thread count improves performance with nscd, but at "threads 1" it's still 52s vs 22s uncached.
As for the use-case, here's the initial report: > I'm seeing really, *REALLY* bad performance from stat(1) on FreeNAS. Case in point: a `find . -print0 | xargs -0 shasum >file.sha` takes about 7.5h on dirtree with 1.9TiB and 24.5M > > while a `find . -print0 | xargs -0 stat -f '%d %i %#p %l %u %g %r %z %a %m %c %B %k %b %f %N%SY' >file.stat` takes 21 hours > > almost TRIPLE the time for reading just the metadata versus reading ALL THE DATA, seems really weird! Seems like they were making a catalogue of the metadata of a large set of small files.
(In reply to Thomas Hurst from comment #15) Thanks, that helps. It still seems like a weird way to go about it (smells like a hand-rolled backup solution or something) but it's more concrete.