Created attachment 223801 [details] access_log.txt and exclude_ips.txt files With 13.0-RC4 I notice that I get very slow speed using grep with the follow parameters. With 12.2-RELEASE-p4 it is fast. Files used for the test are included in the attachment. ------------- FreeBSD 13.0-RC4: /usr/bin/time -l cat access_log.txt | grep -F -v -f exclude_ips.txt > output.log 358.92 real 0.00 user 0.00 sys 2200 maximum resident set size 8 average shared memory size 4 average unshared data size 128 average unshared stack size 148 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 1104 messages sent 0 messages received 0 signals received 138 voluntary context switches 0 involuntary context switches ------------- FreeBSD 12.2-RELEASE-p4: /usr/bin/time -l cat access_log.txt | grep -F -v -f exclude_ips.txt > output.log 0.18 real 0.00 user 0.00 sys 0 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 137 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 70 voluntary context switches 0 involuntary context switches
Another case of too slow a search is with -i option. Should I file a separate PR? On FreeBSD 13-STABLE, grep 2.6.0 search of /usr/ports/INDEX-13 with -i option is very slow compared to without ... # \time egrep -i --color '^yubi.?oath' INDEX-13 ... 47.53 real 44.65 user 0.41 sys # \time egrep --color '^yubi.?oath' INDEX-13 ... 0.14 real 0.01 user 0.12 sys ... option --color does not affect the time as there was only 1 result. Perl 5.32.1 (from Ports) does not color but the time output is similar to the second case of grep search above, but with case insensitive search ... # \time perl -n -E 'm{^yubi.?oath}i and say' INDEX-13
(In reply to parv from comment #1) Are you being careful to avoid the first grep from loading the file contents into OS buffers so the next grep does not have to read the media (as much)? For example, rebooting between tests? Or using files that are much larger than the RAM? One option is to reboot after the sequence and redo the sequence in a different order, such as after what you wrote doing the order: # \time egrep --color '^yubi.?oath' INDEX-13 # \time egrep -i --color '^yubi.?oath' INDEX-13 Variations in the result might give clues about the various contributions to the time taken.
(In reply to parv from comment #1) No, please. All of the slowness will have the same root cause for the most part.
(In reply to Mark Millard from comment #2) (A tolerable search would finish within ~1-2 second.) Re: caching of file resulting in faster search ... I created a 12 GB file filled with 0; searched for single letter via fgrep; caused swap to increase to ~1 GB (swap size is 8 GB); lots of crunchy noise ensued. When I had enough, I ran egrep sans -i option ... 1.16 real 0.07 user 0.30 sys ... and result after loading RAM as mentioned above for "egrep -i" ... 49.13 real 45.28 user 0.63 sys Relative time difference between two type of searches is rather consistent. There is ~4-5 second improvement (time result hover around 43-45 second) for "egrep -i" if it is run just after "egrep" or "egrep -i". FWIW, configuration here ... - FreeBSD 13-STABLE as guest of Windows 10 host in VirtualBox 5.2.44; - /src-build/ports/INDEX-13 file is on separate pool of ZFS than the pool of /usr/bin/egrep; - file systems are file-backed, allocated as SATA disks; - real memory: 8663 MB, avail memory: 7809 MB (of 16 GB RAM); - 2 "CPU"s have been allocated of Intel i5 6300U (Thinkpad X260).
(In reply to parv from comment #4) I should have also mentioned that above egrep were tested with 2 pairs of locale configuration (time result is same enough) ... # root user. LANG=C.UTF-8 # Not root user. LANG=en_US.UTF-8 LC_COLLATE=C LC_CTYPE=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 User CPU time of "egrep -i" improves with C locale ("env LANG=C LC_ALL=C egrep -i ...") ... 38.24 real 36.44 user 0.61 sys # Just after loading RAM as mentioned in #4. 44.32 real 38.10 user 0.59 sys # Successive runs ... 39.88 real 39.00 user 0.59 sys # ... after few minutes.
(In reply to parv from comment #4) Intersting. Another comparison/contrast would be to try a fairly large file but one that would fit in RAM buffers/caching, say 4 GB based on your description, again both orders, with a PC reboot between. Going in a different direction . . . My vague memories lead to looking up: VBoxManage storagectl "VM name" --name <controllername> --hostiocache off Have you tried both hostiocache off and on? If I remember right on was a default. There might be more virtualbox settings that might contribute to how the I/O is handled. I do not remember much about such things. But such settings might be appropriate to explore for if they make any notable differences.
(In reply to Mark Millard from comment #6) For the first part of comment #6: I think I misinterpreted your test notes last night. Ignore it. Sorry. The "--hostiocache off" part of the note would still apply. (Possibly other configuration settings as well.)
If you compile grep with gnuregex (like FreeBSD 12) it's okay. With gnuregex; (edit grep.h to include gnuregex.h instead of gnuregex.h) [root@SuperBSD /tmp/test]# cc -o grep *.c -lgnuregex -L/usr/local/lib -I/usr/local/include [root@SuperBSD /tmp/test]# time ./grep -i 'bla' /usr/ports/INDEX-13 | wc -l 1040 real 0m0.388s user 0m0.385s sys 0m0.008s [root@SuperBSD /tmp/test]# time grep -i 'bla' /usr/ports/INDEX-13 | wc -l 1040 real 0m3.428s user 0m3.430s sys 0m0.008s
For not being noted earlier here, I will paraphrase Kyle Envans kaevans91 on Twitter ... - root issue is known in regex(3) library (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223553 also); - grep(1) looks at each byte & for new lines; - replacing regex(3) is a huge task, requires massive time & motivation. ... I would be happy to be corrected if I had misstated. *In the mean time*, one could use instead ... - ack, textproc/p5-ack (Perl), https://github.com/beyondgrep/ack3; - ag, textproc/the_silver_searcher (C), https://github.com/ggreer/the_silver_searcher; - GNU grep, textproc/gnugrep (C), https://www.gnu.org/software/grep/; - ripgrep, textproc/ripgrep (Rust), https://github.com/BurntSushi/ripgrep. Other workaround option is to build system grep with one of external regex libraries. For that matter, one could also install GNU grep at /usr/bin.
Same issue here. grep -vE -f whitelist.re /path/to/nginx-access.log Whitelist.re is a huge compiled exclusion list. In the form: ^(128\.177\.243\.|128\.177\.244\.100 | etc. etc.) This was done in about 5 seconds with a pretty large access.log in FreeBSD 11.4 Still not finished after 30 minutes on FreeBSD 13.0.
I was about to open a new PR, but I realized the problem has already been reported, so I'm commenting here. My particular issue: fgrep -i is terribly slow on 13.0. More in detail: Searching for a string though a one day's worth of mail log (100 MB, postfix) takes unreasonably long time with BSD grep when option -i (case-insensitive search) is specified: $ time /usr/bin/fgrep -i fwzhkqwfoherbfqo mail.log real 59.664s, user 58.907s, sys 0.222s $ time /usr/bin/fgrep fwzhkqwfoherbfqo mail.log real 0.276s, user 0.097s, sys 0.173s With option -i it takes 200-times longer than without -i. Compared to pcregrep or GNU grep, it takes 500-times longer. fgrep and grep and egrep are the same in this respect. An interesting observation: as the length of a search string increases, so does the run time. It would be worthwhile investing some time in finding and fixing this particular hotspot in BSD grep. (the machine above was a slightly older 13.0-RELEASE-p6 amd64, with SSD disks and ZFS, the file is cached (cache warmed up)) ---------------------- The rest below is some more benchmarking, this time on a faster machine with nvme disks, ZFS, same log file. Absolute times are shorter, but the ratio is about the same. $ uname -a FreeBSD xxx 13.0-RELEASE-p6 FreeBSD ... amd64/sys/GENERIC amd64 $ ls -l mail.log xxx 99999998 Jan 18 22:49 mail.log BSD grep: ========= $ time /usr/bin/fgrep -i fwzh mail.log real 0m8.733s, user 0m8.664s, sys 0m0.061s $ time /usr/bin/fgrep -i fwzhkqwf mail.log real 0m12.759s, user 0m12.666s, sys 0m0.057s $ time /usr/bin/fgrep -i fwzhkqwfoherbfqo mail.log real 0m18.922s, user 0m18.813s, sys 0m0.064s $ time /usr/bin/fgrep -i fwzhkqwfoherbfqojhkqnsazmzlwknhg mail.log real 0m32.593s, user 0m32.438s, sys 0m0.056s BSD grep without -i: $ time /usr/bin/fgrep fwzhkqwfoherbfqojhkqnsazmzlwknhg mail.log real 0m0.112s, user 0m0.033s, sys 0m0.079s GNU grep (textutils/gnugrep): ============================= $ time /usr/local/bin/fgrep -i fwzh mail.log real 0m0.204s, user 0m0.073s, sys 0m0.022s $ time /usr/local/bin/fgrep -i fwzhkqwf mail.log real 0m0.203s, user 0m0.060s, sys 0m0.015s $ time /usr/local/bin/fgrep -i fwzhkqwfoherbfqo mail.log real 0m0.175s, user 0m0.036s, sys 0m0.017s $ time /usr/local/bin/fgrep -i fwzhkqwfoherbfqojhkqnsazmzlwknhg mail.log real 0m0.217s, user 0m0.083s, sys 0m0.006s pcregrep: ========= $ time pcregrep -i fwzh mail.log real 0m0.249s, user 0m0.114s, sys 0m0.026s $ time pcregrep -i fwzhkqwf mail.log real 0m0.227s, user 0m0.094s, sys 0m0.013s $ time pcregrep -i fwzhkqwfoherbfqo mail.log real 0m0.128s, user 0m0.073s, sys 0m0.055s $ time pcregrep -i fwzhkqwfoherbfqojhkqnsazmzlwknhg mail.log real 0m0.126s, user 0m0.079s, sys 0m0.047s
Improving the speed of grep(1) is largely rototilling it (probably a 20+ hour job, easily, once you've made sure you didn't break some weird corner cases); this isn't something I can do in my spare time, and I've not found anyone willing to fund it.
(In reply to Kyle Evans from comment #12) So far as I know, the one area that I've see the behavior's consequences somewhat systematically is for poudriere bulk activity were it tries to go through the log to identify the "reason" for stoppage --when the log happens to be huge, as it can be for some port builds. /usr/local/share/poudriere/processonelog.sh and its use of bzgrep has a couple of examples of using -qi . Lots more use of -qE and -q . One -v . One plain grep. May be someday this issue will lead to some funding for spending less time on the kind of activity.
(In reply to Mark Millard from comment #13) Indicating a wording correction/intended-interpretation: "plain grep": no "bz". Actually: grep -qE
Is this a duplicate of bug #223553 ?