224160 – [patch] wc -c is slow

Bug 224160 - [patch] wc -c is slow

Summary: [patch] wc -c is slow

Status:	Closed FIXED

Alias:	None

Product:	Base System
Classification:	Unclassified
Component:	bin (show other bugs)
Version:	CURRENT
Hardware:	Any Any

Importance:	--- Affects Only Me
Assignee:	Conrad Meyer

URL:
Keywords:	patch

Depends on:
Blocks:

Reported:	2017-12-07 13:07 UTC by Wolfram Schneider
Modified:	2017-12-12 11:25 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
makefile for testing buffer size (923 bytes, text/plain) 2017-12-10 12:03 UTC, Wolfram Schneider	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Wolfram Schneider freebsd_committer

2017-12-07 13:07:28 UTC

The wc(1) command has several optimizations to run as fast as possible. However, it is still slow in some use cases, much slower than the GNU wc command

Using the OpenStreetMap database dump planet-latest.osm.bz2
(from https://wiki.openstreetmap.org/wiki/Planet.osm)
which it is a 61GB bzip'd XML file.

I checked how large the uncompressed XML is, on a 32 CPU machine:

# FreeBSD wc
$ pbzip2 -dc planet-latest.osm.bz2 | time wc -c
908171295050
    4729.53 real      4400.69 user       199.34 sys

the wc(1) command was running at 100% CPU time, and pbzip2 was using only 500% CPU time.


I run the tests again with GNU wc. The wc command was using only 20% CPU time, and pbzip2 around 3000%.

# GNU wc
$ pbzip2 -dc planet-latest.osm.bz2 | time gwc -c
908171295050
    2003.15 real         8.86 user       355.53 sys

The FreeBSD wc(1) command is using 500 times more user time (4400 <-> 9) than the GNU wc, and a little bit less system time (199 <-> 355). The bottleneck was not pbzip2, it was wc. 

We should check why the optimization for wc -c for reading from stdin is not working.

Comment 1 Wolfram Schneider freebsd_committer

2017-12-08 08:28:27 UTC

In case you don’t want download 61GB for testing, here is a much simpler test case using dd(1)

# FreeBSD wc
$ dd if=/dev/zero of=/dev/stdout bs=1m count=2000 | time wc -c
2000+0 records in
2000+0 records out
2097152000 bytes transferred in 9.587758 secs (218732260 bytes/sec)
 2097152000
        9.58 real         9.15 user         0.39 sys

# GNU wc
$ dd if=/dev/zero of=/dev/stdout bs=1m count=2000 | time gwc -c
2000+0 records in
2000+0 records out
2097152000 bytes transferred in 0.821867 secs (2551693015 bytes/sec)
2097152000
        0.82 real         0.04 user         0.40 sys

Comment 2 Conrad Meyer freebsd_committer

2017-12-08 14:34:52 UTC

wc(1) uses a stack buffer of size MAXBSIZE, or 64kB.  Increasing this may help (move it to the heap).

Secondly, there is an optimization for counting lines, and that same optimization counts characters, but it is not used if wc is only asked to count characters!  Silly.  It's also not used if wc is asked to count stdin!  Stupid.

Just fixing stdin + character count optimization gives much better results, comparable to GNU wc:

 2097152000
~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c  0.01s user 0.43s system 45% cpu 0.964 total

Bumping the buffer size to 4 MB yields big improvement in system time.  (Note that the dd size was increased 10x.)

Before:
 20971520000
~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c  0.14s user 3.99s system 42% cpu 9.653 total
After:
 20971520000
~/obj/usr/home/conrad/src/freebsd/amd64.amd64/usr.bin/wc/wc -c  0.12s user 1.90s system 40% cpu 4.954 total

GNU wc is actually worse:
20971520000
gwc -c  0.21s user 2.91s system 48% cpu 6.490 total


Here is the PoC patch (whitespace changes elided (-w) for legibility).  Note that it leaks memory.  4 MB may be totally inappropriate for small devices, too.

--- a/usr.bin/wc/wc.c
+++ b/usr.bin/wc/wc.c
@@ -199,15 +199,17 @@ cnt(const char *file)
        size_t clen;
        short gotsp;
        u_char *p;
-       u_char buf[MAXBSIZE];
+       u_char *buf;
        wchar_t wch;
        mbstate_t mbs;

+#define MY_BUF_SIZE (4 * 1024 * 1024)
+       buf = malloc(MY_BUF_SIZE);
+
        linect = wordct = charct = llct = tmpll = 0;
        if (file == NULL)
                fd = STDIN_FILENO;
-       else {
-               if ((fd = open(file, O_RDONLY, 0)) < 0) {
+       else if ((fd = open(file, O_RDONLY, 0)) < 0) {
                xo_warn("%s: open", file);
                return (1);
        }
@@ -218,8 +220,8 @@ cnt(const char *file)
         * lines than to get words, since the word count requires some
         * logic.
         */
-               if (doline) {
-                       while ((len = read(fd, buf, MAXBSIZE))) {
+       if (doline || dochar) {
+               while ((len = read(fd, buf, MY_BUF_SIZE))) {
                        if (len == -1) {
                                xo_warn("%s: read", file);
                                (void)close(fd);
@@ -230,6 +232,7 @@ cnt(const char *file)
                                    llct);
                        }
                        charct += len;
+                       if (doline) {
                                for (p = buf; len--; ++p)
                                        if (*p == '\n') {
                                                if (tmpll > llct)
@@ -239,7 +242,9 @@ cnt(const char *file)
                                        } else
                                                tmpll++;
                        }
+               }
                reset_siginfo();
+               if (doline)
                        tlinect += linect;
                if (dochar)
                        tcharct += charct;
@@ -270,13 +275,12 @@ cnt(const char *file)
                        return (0);
                }
        }
-       }

        /* Do it the hard way... */
 word:  gotsp = 1;
        warned = 0;
        memset(&mbs, 0, sizeof(mbs));
-       while ((len = read(fd, buf, MAXBSIZE)) != 0) {
+       while ((len = read(fd, buf, MY_BUF_SIZE)) != 0) {
                if (len == -1) {
                        xo_warn("%s: read", file != NULL ? file : "stdin");
                        (void)close(fd);

Comment 3 Wolfram Schneider freebsd_committer

2017-12-09 11:17:02 UTC

Hi Conrad,

thanks for your patch. It works great for me.

Some notes:

I think that changing MAXBSIZE is a red herring. The real limit is in the kernel and IPC/pipe system call - not my department. We should not change the defaults until there is a good reason. I run your patch with 64k, 1MB and 4MB bufsize and I do not see a difference at all.

You wrote that your patch “leaks memory”. How did you measure this? I see that the memory usage goes up from 2MB to 6MB, but stays at this value.

I’m using `time -l’ to check the maximum resident set size
dd if=/dev/zero of=/dev/stdout bs=1m count=30000 | time -l /usr/bin/wc -c

Comment 4 Conrad Meyer freebsd_committer

2017-12-09 17:34:01 UTC

(In reply to Wolfram Schneider from comment #3)
> thanks for your patch. It works great for me.

Good :-).

> I think that changing MAXBSIZE is a red herring. The real limit is in the 
> kernel and IPC/pipe system call - not my department.

My reading of the kernel pipe code suggests larger transfers are possible and MAXBSIZE is not a kernel limit.  If there is a kernel limit, I'm not sure what it is or where to find it.

> We should not change the
> defaults until there is a good reason. I run your patch with 64k, 1MB and 4MB
> bufsize and I do not see a difference at all.

I saw hugely better sys and total time with a larger buffer (on CURRENT), as noted in my comment:

> Before:
> wc -c  0.14s user 3.99s system 42% cpu 9.653 total
> After:
> wc -c  0.12s user 1.90s system 40% cpu 4.954 total
                    ^^^^^^^^^^^^         ^^^^^^^^^^^

However, this was tested on a GENERIC kernel, and perhaps all of that overhead per-syscall is due to INVARIANTS/WITNESS.  The larger buffer just lowers the number of syscalls invoked.  NODEBUG kernels will see similar syscall reduction, but with less overhead per call, it may be less noticeable.

Note that I also bumped up the input size from dd by a factor of ten until there was an observable difference.

> You wrote that your patch “leaks memory”. How did you measure this? I see that
> the memory usage goes up from 2MB to 6MB, but stays at this value.

No need to measure; it is a fact.  The heap-allocated buffer "buf" is never released.  This could be observed if you invoke my patched wc against many input files.  Each input file has a buffer allocated and never freed.  For a single input file (or stdin) it doesn't make a difference, as the process exit happens right after the free() would have.  Good enough for a PoC but needs to be fixed before commit.

Comment 5 commit-hook freebsd_committer

2017-12-09 21:55:38 UTC

A commit references this bug:

Author: cem
Date: Sat Dec  9 21:55:20 UTC 2017
New revision: 326736
URL: https://svnweb.freebsd.org/changeset/base/326736

Log:
  wc(1): Extend non-controversial optimizations to '-c' mode

  wc(1)'s slow path for counting words or multibyte characters requires
  conversion of the 8-bit input stream to wide characters.  However, a faster
  path can be used for counting only lines ('-l' -- newlines have the same
  representation in all supported encodings) or bytes ('-c').

  The existing line count optimization was not used if the input was the
  implicit stdin.  Additionally, it wasn't used if only byte counting was
  requested.  This change expands the fast path to both of these scenarios.

  Expanding the buffer size from 64 kB helps reduce the number of read(2)
  calls needed, but exactly what impact that change has and what size to
  expand the buffer to are still under discussion.

  PR:		224160
  Tested by:	wosch (earlier version)
  Sponsored by:	Dell EMC Isilon

Changes:
  head/usr.bin/wc/wc.c

Comment 6 Wolfram Schneider freebsd_committer

2017-12-10 12:02:17 UTC

I run a test and I see no differences for the buf size. dd(1) writes with 5GBytes/s, wc(1) can read now with 5GByte/s. The kernel is 100% busy with system time, 50% for each process. I don’t see much room for improvement here.

my test results for 64k, 256k, 1m and 4m buf size, for 10,20,50,100,500GByte data:

$ make -f Makefile.test clean all test


10000 wc-64k: 
        1.61 real         0.01 user         0.85 sys
10000 wc-256k: 
        1.62 real         0.03 user         0.85 sys
10000 wc-1m: 
        1.66 real         0.08 user         0.81 sys
10000 wc-4m: 
        1.66 real         0.03 user         0.84 sys

20000 wc-64k: 
        3.24 real         0.11 user         1.62 sys
20000 wc-256k: 
        3.30 real         0.16 user         1.58 sys
20000 wc-1m: 
        3.29 real         0.12 user         1.64 sys
20000 wc-4m: 
        3.27 real         0.05 user         1.70 sys

50000 wc-64k: 
        8.04 real         0.23 user         4.07 sys
50000 wc-256k: 
        8.38 real         0.22 user         4.22 sys
50000 wc-1m: 
        8.23 real         0.20 user         4.19 sys
50000 wc-4m: 
        8.13 real         0.30 user         4.07 sys

100000 wc-64k: 
       16.34 real         0.34 user         8.44 sys
100000 wc-256k: 
       16.33 real         0.39 user         8.42 sys
100000 wc-1m: 
       16.31 real         0.43 user         8.41 sys
100000 wc-4m: 
       16.37 real         0.43 user         8.28 sys

500000 wc-64k: 
       81.72 real         2.30 user        41.53 sys
500000 wc-256k: 
       83.30 real         2.25 user        42.07 sys
500000 wc-1m: 
       82.31 real         2.27 user        42.11 sys
500000 wc-4m: 
       80.62 real         2.46 user        40.72 sys

Comment 7 Wolfram Schneider freebsd_committer

2017-12-10 12:03:16 UTC

Created attachment 188685 [details]
makefile for testing buffer size

Comment 8 Wolfram Schneider freebsd_committer

2017-12-10 12:06:17 UTC

PS: I run the tests on FreeBSD 12.0-CURRENT with a GENERIC kernel (no debug)

Comment 9 commit-hook freebsd_committer

2017-12-10 17:57:04 UTC

A commit references this bug:

Author: cem
Date: Sun Dec 10 17:56:04 UTC 2017
New revision: 326749
URL: https://svnweb.freebsd.org/changeset/base/326749

Log:
  wc(1): Restore regular file char count fast path

  fstat(2) is going to be a lot faster than reading all of the bytes in a
  file, if we just need a character count for a regular file.  This fast path
  was accidentally broken in r326736.

  PR:		224160
  Reported by:	bde
  Sponsored by:	Dell EMC Isilon

Changes:
  head/usr.bin/wc/wc.c

Comment 10 Conrad Meyer freebsd_committer

2017-12-10 18:04:55 UTC

(In reply to Wolfram Schneider from comment #6)
Ok, thanks.  It must be the DEBUG stuff that is making such a difference for me.

If we don't need to extend the buffer size, I think we're all set.  We could still move the buffer off the stack, but I don't feel the need to do so now.

If you agree this is resolved, I'll close this as fixed.

Comment 11 Wolfram Schneider freebsd_committer

2017-12-12 11:25:35 UTC

Here are the updated test results for the patched wc. As you can see the FreeBSD wc performs now slightly better than the GNU wc (less user and system time).

# GNU wc
nice -5 pbzip2 -dc planet-latest.osm.bz2 | time gwc -c; 
927854543741
     1854.25 real         9.72 user       353.53 sys
 
# FreeBSD wc
nice -5 pbzip2 -dc planet-latest.osm.bz2 | time fwc -c
 927854543741
     1896.93 real         8.01 user       288.43 sys