Bug 195763 - bsdgrep, empty matches and -o
Summary: bsdgrep, empty matches and -o
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 10.1-RELEASE
Hardware: Any Any
: --- Affects Many People
Assignee: Kyle Evans
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2014-12-06 23:35 UTC by dt71
Modified: 2017-08-24 14:18 UTC (History)
3 users (show)

See Also:


Attachments
Proposed patch to address zero length match issues with bsdgrep (2.56 KB, patch)
2017-01-20 06:27 UTC, Kyle Evans
no flags Details | Diff
Proposed patch to address line matching issues (3.56 KB, patch)
2017-01-21 02:43 UTC, Kyle Evans
no flags Details | Diff
Proposed patch to address line matching issues (3.64 KB, patch)
2017-01-21 05:32 UTC, Kyle Evans
no flags Details | Diff
Proposed patch to address line matching issues (3.67 KB, patch)
2017-01-21 06:26 UTC, Kyle Evans
no flags Details | Diff
Proposed patch to address line matching issues (3.67 KB, patch)
2017-01-22 20:21 UTC, Kyle Evans
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description dt71 2014-12-06 23:35:11 UTC
bsdgrep(1) behaves weirdly when, apparently, an empty (sub)string is matched.

=== a session on FreeBSD (10.1-RELEASE) begins ===
$ echo '01:1:01' | grep -Eo '(^|:)0*'
0
$ # hey, what gives?
$ echo '01:1:01' | bsdgrep -Eo '(^|:)0*'
0
:
:0
$ # ah, old bugz :}
$ echo '1:1:01' | grep -Eo '(^|:)0*'
$ # surely, this is the same bug
$ echo '1:1:01' | bsdgrep -Eo '(^|:)0*'

$ # wtf?
=== a session on FreeBSD (10.1-RELEASE) ends ===

OK, let's see Linux.

=== a session on a GNU/Linux distrubution begins ===
$ echo '01:1:01' | grep -Eo '(^|:)0*'
0
:
:0
$ # OK, just like BSD-grep
$ echo '1:1:01' | grep -Eo '(^|:)0*'
:
:0
$ # correct!
=== a session on a GNU/Linux distrubution ends ===

More testing:

=== a session on FreeBSD (10.1-RELEASE) begins ===
$ echo 'bla bla' | bsdgrep -Eo '[[:alnum:]]*'
bla

$ # ???
$ printf 'bla\nbla\n' | bsdgrep -Eo ''
bla

bla

$ # ...
=== a session on FreeBSD (10.1-RELEASE) ends ===

=== a session on a GNU/Linux distrubution ends ===
$ echo 'bla bla' | bsdgrep -Eo '[[:alnum:]]*'
bla
bla
$ # most sensible
$ printf 'bla\nbla\n' | bsdgrep -Eo ''
$ # also sensible
=== a session on a GNU/Linux distrubution ends ===
Comment 1 Kyle Evans freebsd_committer freebsd_triage 2017-01-20 06:27:46 UTC
Created attachment 179126 [details]
Proposed patch to address zero length match issues with bsdgrep

Hi,

Attached is a patch that seems to fix the behavior of zero-length matches. procline() was previously using something along the lines of 'next start (st) == match start' to indicate failure, but this isn't right when zero-length matches are possible. Therefore, we make a couple of critical changes:

1.) Keep track of the number of total matches we came across
2.) Don't add zero-length matches to our output
3.) If we had any matches, keep going. If st didn't advance beyond our first match, we only had zero-length matches

To fix the empty expression case, I changed the argv pattern matching bits in main() to actually skip invalid expressions. 

Your test expressions now yield the following results:
root@ghost:/usr/obj/usr/src/usr.bin/grep# echo '01:1:01' | ./bsdgrep -Eo '(^|:)0*'
0
:
:0
# Good
root@ghost:/usr/obj/usr/src/usr.bin/grep# echo '1:1:01' | ./bsdgrep -Eo '(^|:)0*'
:
:0
# Better
root@ghost:/usr/obj/usr/src/usr.bin/grep# echo 'bla bla' | ./bsdgrep -Eo '[[:alnum:]]*'
bla
bla
# Good
root@ghost:/usr/obj/usr/src/usr.bin/grep# printf 'bla\nbla\n' | ./bsdgrep -Eo ''
# Good

I think this approach is mostly right, except for perhaps the bits in grep.c:main() to make sure we don't recognize invalid patterns as input patterns.
Comment 2 Kyle Evans freebsd_committer freebsd_triage 2017-01-20 14:03:46 UTC
(In reply to Kyle Evans from comment #1)

The 'ignore invalid patterns' change my previous patch makes seems definitively wrong -- I'll have to revise this part and I'm just going to leave this here:

$ echo 'test' | grep ''
test
$ echo 'test' | grep -o ''
$ echo 'test' | bsdgrep ''
test
$ echo 'test' | bsdgrep -o ''

The regex bits declare this as a 4-character match, not sure off-hand how to treat this but I don't believe ignoring the pattern is the right thing to do.
Comment 3 Kyle Evans freebsd_committer freebsd_triage 2017-01-21 02:43:10 UTC
Created attachment 179171 [details]
Proposed patch to address line matching issues

emaste@ - I believe this to be my final iteration of this patch. Because this and bug #209116 are both somewhat major issues with procline(), this patch addresses both PRs and obsoletes both previous patches.

I've added to printline() a bail-out if -o and matchall (empty pattern) are set -- this coincides with GNU grep's behavior.

Assuming I ran the tests correctly (make && make install && kyua test -k /usr/tests/usr.bin/grep/Kyuafile), I haven't yet introduced any regressions there. One of my next steps might be to review the unit test coverage and make sure it's sufficient, though, since it was passing before despite having all kinds of problems.
Comment 4 Kyle Evans freebsd_committer freebsd_triage 2017-01-21 05:32:30 UTC
Created attachment 179173 [details]
Proposed patch to address line matching issues

The failure in bug #181263 case 1 also ended up being fairly straightforward to address -- pmatch gets clobbered by a non-matching case and ended up with bogus values, so now we look at the last *actual match* (if st isn't advancing) and see if that ends up being a zero-length match.

This catches corner cases like the one found where .* can infinitely match but the 'a' pattern comes second. It shouldn't affect any legitimate cases, since those will cause an advancement in nst prior to this.
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2017-01-21 06:26:00 UTC
Created attachment 179175 [details]
Proposed patch to address line matching issues

Ugh, sorry -- this really should be the final version. Shortly after the last update, I caught a mailing list entry that mentioned broken beginning of line matching behavior and figured it would be wise to catch this as well.

REG_NOTBOL needs to be manually set if we're past the beginning of the string because we also use REG_STARTEND to limit the scope of our matching, which makes the regex functions think we're beginning of line all the time.

We do not need to do the same with EOL since we limit the scope by adjusting rm_so, resetting rm_eo to the end of the line. It might not be a bad idea to later set it anyways, just in case we change our mind, but for now that's out of scope.
Comment 6 Kyle Evans freebsd_committer freebsd_triage 2017-01-22 20:21:03 UTC
Created attachment 179227 [details]
Proposed patch to address line matching issues

Address one last nit that I found in some more extensive testing -- if we're going to replace the last match made (due to overlap), it should be an earlier or strictly longer match than the current one.

Test case: echo "abcdef" | grep -o -e "ab" -e "bc"
Swap the order of -e arguments, "ab" and "bc" would previously yield different results when they should both yield "ab".
Comment 7 commit-hook freebsd_committer freebsd_triage 2017-04-03 23:17:28 UTC
A commit references this bug:

Author: emaste
Date: Mon Apr  3 23:16:51 UTC 2017
New revision: 316477
URL: https://svnweb.freebsd.org/changeset/base/316477

Log:
  bsdgrep: fix matching behaviour

  - Set REG_NOTBOL if we've already matched beginning of line and we're
    examining later parts

  - For each pattern we examine, apply it to the remaining bits of the
    line rather than (potentially) smaller subsets

  - Check for REG_NOSUB after we've looked at all patterns initially
    matching the line

  - Keep track of the last match we made to later determine if we're
    simply not matching any longer or if we need to proceed another byte
    because we hit a zero-length match

  - Match the earliest and longest bit of each line before moving the
    beginning of what we match to further in the line, past the end of the
    longest match; this generally matches how gnugrep(1) seems to behave,
    and seems like pretty good behavior to me

  - Finally, bail out of printing any matches if we were set to print all
    (empty pattern) but -o (output matches) was set

  PR:		195763, 180990, 197555, 197531, 181263, 209116
  Submitted by:	"Kyle Evans" <kevans91@ksu.edu>
  Reviewed by:	cem
  MFC after:	1 month
  Relnotes:	Yes
  Differential Revision:	https://reviews.freebsd.org/D10104

Changes:
  head/usr.bin/grep/util.c
Comment 8 commit-hook freebsd_committer freebsd_triage 2017-04-05 18:42:08 UTC
A commit references this bug:

Author: emaste
Date: Wed Apr  5 18:41:46 UTC 2017
New revision: 316536
URL: https://svnweb.freebsd.org/changeset/base/316536

Log:
  bsdgrep: create additional tests for coverage on recent fixes

  Create additional tests to cover regressions that were discovered by
  PRs linked to reviews D10098, D10102, and D10104.

  It is worth noting that neither bsdgrep(1) nor gnugrep(1) in the base
  system currently pass all of these tests, and gnugrep(1) not quite being
  up to snuff was also noted in at least one of the PRs.

  PR:		175314 202022 195763 180990 197555 197531 181263 209116
  Submitted by:	Kyle Evans <kevans91@ksu.edu>
  Reviewed by:	cem, ngie, emaste
  MFC after:	1 month
  Differential Revision:	https://reviews.freebsd.org/D10112

Changes:
  head/contrib/netbsd-tests/usr.bin/grep/d_color_a.in
  head/contrib/netbsd-tests/usr.bin/grep/d_color_a.out
  head/contrib/netbsd-tests/usr.bin/grep/d_color_b.in
  head/contrib/netbsd-tests/usr.bin/grep/d_color_b.out
  head/contrib/netbsd-tests/usr.bin/grep/d_color_c.out
  head/contrib/netbsd-tests/usr.bin/grep/d_escmap.in
  head/contrib/netbsd-tests/usr.bin/grep/d_f_file_empty.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_a.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_a.out
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_b.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_b.out
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_c.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_c.out
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_d.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_e.in
  head/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_e.out
  head/contrib/netbsd-tests/usr.bin/grep/t_grep.sh
  head/usr.bin/grep/tests/Makefile
Comment 9 commit-hook freebsd_committer freebsd_triage 2017-08-16 00:12:41 UTC
A commit references this bug:

Author: kevans
Date: Wed Aug 16 00:12:25 UTC 2017
New revision: 322555
URL: https://svnweb.freebsd.org/changeset/base/322555

Log:
  bsdgrep: Fix matching behavior and add regression tests

  MFC r316477: bsdgrep: fix matching behaviour

  - Set REG_NOTBOL if we've already matched beginning of line and we're
    examining later parts

  - For each pattern we examine, apply it to the remaining bits of the
    line rather than (potentially) smaller subsets

  - Check for REG_NOSUB after we've looked at all patterns initially
    matching the line

  - Keep track of the last match we made to later determine if we're
    simply not matching any longer or if we need to proceed another byte
    because we hit a zero-length match

  - Match the earliest and longest bit of each line before moving the
    beginning of what we match to further in the line, past the end of the
    longest match; this generally matches how gnugrep(1) seems to behave,
    and seems like pretty good behavior to me

  - Finally, bail out of printing any matches if we were set to print all
    (empty pattern) but -o (output matches) was set

  MFC r316489: bsdgrep: Initialize vars to avoid a false positive GCC warning

  MFC r316491: bsdgrep: revert color changes from r316477

  r316477 changed the color output to match exactly the in-tree GNU grep,
  but introduces unnecessary escape sequences.

  MFC r316536: bsdgrep: create additional tests for coverage on recent fixes

  Create additional tests to cover regressions that were discovered by
  PRs linked to reviews D10098, D10102, and D10104.

  It is worth noting that neither bsdgrep(1) nor gnugrep(1) in the base
  system currently pass all of these tests, and gnugrep(1) not quite being
  up to snuff was also noted in at least one of the PRs.

  MFC r317052: bsdgrep: fix zero-length matches without the -o flag

  r316477 broke zero-length matches when not using the -o flag, by
  skipping over them entirely.

  Add a regression test so that it doesn't break again in the future.

  PR:		175314, 180990, 181263, 195763, 197531, 197555, 202022, 209116
  Approved by:	emaste (mentor, blanket MFC)
  Relnotes:	yes

Changes:
_U  stable/11/
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_color_a.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_color_a.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_color_b.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_color_b.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_color_c.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_escmap.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_f_file_empty.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_a.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_a.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_b.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_b.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_c.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_c.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_d.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_e.in
  stable/11/contrib/netbsd-tests/usr.bin/grep/d_oflag_zerolen_e.out
  stable/11/contrib/netbsd-tests/usr.bin/grep/t_grep.sh
  stable/11/usr.bin/grep/tests/Makefile
  stable/11/usr.bin/grep/util.c
Comment 10 Mark Linimon freebsd_committer freebsd_triage 2017-08-24 14:18:12 UTC
MFCed Wed Aug 16 00:12:25 UTC 2017.