The wc(1) command has several optimizations to run as fast as possible. However, it is still slow in some use cases, much slower than the GNU wc command Using the OpenStreetMap database dump planet-latest.osm.bz2 (from https://wiki.openstreetmap.org/wiki/Planet.osm) which it is a 61GB bzip'd XML file. I checked how large the uncompressed XML is, on a 32 CPU machine: # FreeBSD wc $ pbzip2 -dc planet-latest.osm.bz2 | time wc -c 908171295050 4729.53 real 4400.69 user 199.34 sys the wc(1) command was running at 100% CPU time, and pbzip2 was using only 500% CPU time. I run the tests again with GNU wc. The wc command was using only 20% CPU time, and pbzip2 around 3000%. # GNU wc $ pbzip2 -dc planet-latest.osm.bz2 | time gwc -c 908171295050 2003.15 real 8.86 user 355.53 sys The FreeBSD wc(1) command is using 500 times more user time (4400 <-> 9) than the GNU wc, and a little bit less system time (199 <-> 355). The bottleneck was not pbzip2, it was wc. We should check why the optimization for wc -c for reading from stdin is not working.
In case you don’t want download 61GB for testing, here is a much simpler test case using dd(1) # FreeBSD wc $ dd if=/dev/zero of=/dev/stdout bs=1m count=2000 | time wc -c 2000+0 records in 2000+0 records out 2097152000 bytes transferred in 9.587758 secs (218732260 bytes/sec) 2097152000 9.58 real 9.15 user 0.39 sys # GNU wc $ dd if=/dev/zero of=/dev/stdout bs=1m count=2000 | time gwc -c 2000+0 records in 2000+0 records out 2097152000 bytes transferred in 0.821867 secs (2551693015 bytes/sec) 2097152000 0.82 real 0.04 user 0.40 sys
wc(1) uses a stack buffer of size MAXBSIZE, or 64kB. Increasing this may help (move it to the heap). Secondly, there is an optimization for counting lines, and that same optimization counts characters, but it is not used if wc is only asked to count characters! Silly. It's also not used if wc is asked to count stdin! Stupid. Just fixing stdin + character count optimization gives much better results, comparable to GNU wc: 2097152000 ~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c 0.01s user 0.43s system 45% cpu 0.964 total Bumping the buffer size to 4 MB yields big improvement in system time. (Note that the dd size was increased 10x.) Before: 20971520000 ~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c 0.14s user 3.99s system 42% cpu 9.653 total After: 20971520000 ~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c 0.12s user 1.90s system 40% cpu 4.954 total GNU wc is actually worse: 20971520000 gwc -c 0.21s user 2.91s system 48% cpu 6.490 total Here is the PoC patch (whitespace changes elided (-w) for legibility). Note that it leaks memory. 4 MB may be totally inappropriate for small devices, too. --- a/usr.bin/wc/wc.c +++ b/usr.bin/wc/wc.c @@ -199,15 +199,17 @@ cnt(const char *file) size_t clen; short gotsp; u_char *p; - u_char buf[MAXBSIZE]; + u_char *buf; wchar_t wch; mbstate_t mbs; +#define MY_BUF_SIZE (4 * 1024 * 1024) + buf = malloc(MY_BUF_SIZE); + linect = wordct = charct = llct = tmpll = 0; if (file == NULL) fd = STDIN_FILENO; - else { - if ((fd = open(file, O_RDONLY, 0)) < 0) { + else if ((fd = open(file, O_RDONLY, 0)) < 0) { xo_warn("%s: open", file); return (1); } @@ -218,8 +220,8 @@ cnt(const char *file) * lines than to get words, since the word count requires some * logic. */ - if (doline) { - while ((len = read(fd, buf, MAXBSIZE))) { + if (doline || dochar) { + while ((len = read(fd, buf, MY_BUF_SIZE))) { if (len == -1) { xo_warn("%s: read", file); (void)close(fd); @@ -230,6 +232,7 @@ cnt(const char *file) llct); } charct += len; + if (doline) { for (p = buf; len--; ++p) if (*p == '\n') { if (tmpll > llct) @@ -239,7 +242,9 @@ cnt(const char *file) } else tmpll++; } + } reset_siginfo(); + if (doline) tlinect += linect; if (dochar) tcharct += charct; @@ -270,13 +275,12 @@ cnt(const char *file) return (0); } } - } /* Do it the hard way... */ word: gotsp = 1; warned = 0; memset(&mbs, 0, sizeof(mbs)); - while ((len = read(fd, buf, MAXBSIZE)) != 0) { + while ((len = read(fd, buf, MY_BUF_SIZE)) != 0) { if (len == -1) { xo_warn("%s: read", file != NULL ? file : "stdin"); (void)close(fd);
Hi Conrad, thanks for your patch. It works great for me. Some notes: I think that changing MAXBSIZE is a red herring. The real limit is in the kernel and IPC/pipe system call - not my department. We should not change the defaults until there is a good reason. I run your patch with 64k, 1MB and 4MB bufsize and I do not see a difference at all. You wrote that your patch “leaks memory”. How did you measure this? I see that the memory usage goes up from 2MB to 6MB, but stays at this value. I’m using `time -l’ to check the maximum resident set size dd if=/dev/zero of=/dev/stdout bs=1m count=30000 | time -l /usr/bin/wc -c
(In reply to Wolfram Schneider from comment #3) > thanks for your patch. It works great for me. Good :-). > I think that changing MAXBSIZE is a red herring. The real limit is in the > kernel and IPC/pipe system call - not my department. My reading of the kernel pipe code suggests larger transfers are possible and MAXBSIZE is not a kernel limit. If there is a kernel limit, I'm not sure what it is or where to find it. > We should not change the > defaults until there is a good reason. I run your patch with 64k, 1MB and 4MB > bufsize and I do not see a difference at all. I saw hugely better sys and total time with a larger buffer (on CURRENT), as noted in my comment: > Before: > wc -c 0.14s user 3.99s system 42% cpu 9.653 total > After: > wc -c 0.12s user 1.90s system 40% cpu 4.954 total ^^^^^^^^^^^^ ^^^^^^^^^^^ However, this was tested on a GENERIC kernel, and perhaps all of that overhead per-syscall is due to INVARIANTS/WITNESS. The larger buffer just lowers the number of syscalls invoked. NODEBUG kernels will see similar syscall reduction, but with less overhead per call, it may be less noticeable. Note that I also bumped up the input size from dd by a factor of ten until there was an observable difference. > You wrote that your patch “leaks memory”. How did you measure this? I see that > the memory usage goes up from 2MB to 6MB, but stays at this value. No need to measure; it is a fact. The heap-allocated buffer "buf" is never released. This could be observed if you invoke my patched wc against many input files. Each input file has a buffer allocated and never freed. For a single input file (or stdin) it doesn't make a difference, as the process exit happens right after the free() would have. Good enough for a PoC but needs to be fixed before commit.
A commit references this bug: Author: cem Date: Sat Dec 9 21:55:20 UTC 2017 New revision: 326736 URL: https://svnweb.freebsd.org/changeset/base/326736 Log: wc(1): Extend non-controversial optimizations to '-c' mode wc(1)'s slow path for counting words or multibyte characters requires conversion of the 8-bit input stream to wide characters. However, a faster path can be used for counting only lines ('-l' -- newlines have the same representation in all supported encodings) or bytes ('-c'). The existing line count optimization was not used if the input was the implicit stdin. Additionally, it wasn't used if only byte counting was requested. This change expands the fast path to both of these scenarios. Expanding the buffer size from 64 kB helps reduce the number of read(2) calls needed, but exactly what impact that change has and what size to expand the buffer to are still under discussion. PR: 224160 Tested by: wosch (earlier version) Sponsored by: Dell EMC Isilon Changes: head/usr.bin/wc/wc.c
I run a test and I see no differences for the buf size. dd(1) writes with 5GBytes/s, wc(1) can read now with 5GByte/s. The kernel is 100% busy with system time, 50% for each process. I don’t see much room for improvement here. my test results for 64k, 256k, 1m and 4m buf size, for 10,20,50,100,500GByte data: $ make -f Makefile.test clean all test 10000 wc-64k: 1.61 real 0.01 user 0.85 sys 10000 wc-256k: 1.62 real 0.03 user 0.85 sys 10000 wc-1m: 1.66 real 0.08 user 0.81 sys 10000 wc-4m: 1.66 real 0.03 user 0.84 sys 20000 wc-64k: 3.24 real 0.11 user 1.62 sys 20000 wc-256k: 3.30 real 0.16 user 1.58 sys 20000 wc-1m: 3.29 real 0.12 user 1.64 sys 20000 wc-4m: 3.27 real 0.05 user 1.70 sys 50000 wc-64k: 8.04 real 0.23 user 4.07 sys 50000 wc-256k: 8.38 real 0.22 user 4.22 sys 50000 wc-1m: 8.23 real 0.20 user 4.19 sys 50000 wc-4m: 8.13 real 0.30 user 4.07 sys 100000 wc-64k: 16.34 real 0.34 user 8.44 sys 100000 wc-256k: 16.33 real 0.39 user 8.42 sys 100000 wc-1m: 16.31 real 0.43 user 8.41 sys 100000 wc-4m: 16.37 real 0.43 user 8.28 sys 500000 wc-64k: 81.72 real 2.30 user 41.53 sys 500000 wc-256k: 83.30 real 2.25 user 42.07 sys 500000 wc-1m: 82.31 real 2.27 user 42.11 sys 500000 wc-4m: 80.62 real 2.46 user 40.72 sys
Created attachment 188685 [details] makefile for testing buffer size
PS: I run the tests on FreeBSD 12.0-CURRENT with a GENERIC kernel (no debug)
A commit references this bug: Author: cem Date: Sun Dec 10 17:56:04 UTC 2017 New revision: 326749 URL: https://svnweb.freebsd.org/changeset/base/326749 Log: wc(1): Restore regular file char count fast path fstat(2) is going to be a lot faster than reading all of the bytes in a file, if we just need a character count for a regular file. This fast path was accidentally broken in r326736. PR: 224160 Reported by: bde Sponsored by: Dell EMC Isilon Changes: head/usr.bin/wc/wc.c
(In reply to Wolfram Schneider from comment #6) Ok, thanks. It must be the DEBUG stuff that is making such a difference for me. If we don't need to extend the buffer size, I think we're all set. We could still move the buffer off the stack, but I don't feel the need to do so now. If you agree this is resolved, I'll close this as fixed.
Here are the updated test results for the patched wc. As you can see the FreeBSD wc performs now slightly better than the GNU wc (less user and system time). # GNU wc nice -5 pbzip2 -dc planet-latest.osm.bz2 | time gwc -c; 927854543741 1854.25 real 9.72 user 353.53 sys # FreeBSD wc nice -5 pbzip2 -dc planet-latest.osm.bz2 | time fwc -c 927854543741 1896.93 real 8.01 user 288.43 sys