Bug 166861 - bsdgrep(1)/sed(1): bsdgrep -E and sed handle invalid {} constructs strangely
Summary: bsdgrep(1)/sed(1): bsdgrep -E and sed handle invalid {} constructs strangely
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
Depends on: 218495
  Show dependency treegraph
Reported: 2012-04-12 00:00 UTC by Jim Pryor
Modified: 2019-01-20 01:19 UTC (History)
3 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Jim Pryor 2012-04-12 00:00:28 UTC
grep version line:
/* $FreeBSD: src/usr.bin/grep/grep.c,v 2011/10/20 16:08:11 gabor Exp $

sed version line:
$FreeBSD: src/usr.bin/sed/main.c,v 2011/09/23 00:51:37 kensmith Exp $

(1) FreeBSD grep without -E will reject unmatched '\{' as an invalid pattern, but treat unmatched '\}' as a literal '}'. So far, so good. This is also how Gnu grep and BusyBox grep handle these; POSIX-2008 doesn't specify what to do here.

BusyBox's egrep sticks to the same pattern. But FreeBSD's egrep diverges: it treats unmatched { and unmatched } both as literals. These are perverse patterns and no one should be relying on this behavior; however, FreeBSD's change of behavior here seems unmotivated. Admittedly, Gnu egrep does the same as FreeBSD.

(2) FreeBSD grep without -E follows the other greps in rejecting 'a\{1,2,3\}b' as an invalid pattern. The other egreps likewise reject 'a{1,2,3}b'. However, FreeBSD grep accepts 'a{1,2,3}b', and moreover will match it against the text "a{1,2,3}b"; however, the match is zero-length. Again, a perverse pattern whose interpretation no one should be relying on. However, FreeBSD's handling of it seems strange.

(3) The pattern among other sed implementations is:
     without -r: reject unmatched \{ as error, accept unmatched \} as literal
                 reject \{\}, \{2,1\}, and \{1,2,3\}
        with -r: reject unmatched { as error, accept unmatched } as literal
                 reject {}, {2,1}, and {1,2,3}

However, FreeBSD sed without -r diverges from the pattern in rejecting unmatched \} as error.

(4) Also, FreeBSD sed with -r diverges from the pattern in accepting {} as those two literal characters.

How-To-Repeat: See above.
Comment 1 Jim Pryor 2012-04-12 00:45:57 UTC
(5) FreeBSD sed without -r, as well as the other sed implementations,
reject unmatched \( and unmatched \) both as invalid patterns. So too do
all these grep implementations without -E.

Similarly, gnu sed with -r rejects unmatched ( and unmatched ) both as
invalid patterns. And so too do all these egrep implementations.

Like everyone else, FreeBSD sed with -r rejects unmatched ( as invalid.
But it diverges in treating unmatched ) as literal.
Comment 2 Kyle Evans freebsd_committer 2017-04-07 04:05:01 UTC
(In reply to dubiousjim from comment #1)

Summary of work needed at the bottom, feel free to skip ahead and only look back for intermediate results/notes.

Some relevant notes:

As of GNU grep 2.27, GNU SED 4.3 on Debian, and BSD grep @ r316566-ish:

(1) and (2) behavior between the two seem to match
$ echo "a{1,2,3}b" | sed -r "s/{/_/"
$ echo "a{1,2,3}b" | sed -r "s/}/_/"

$ echo "a{1,2,3}b" | sed -r "s/{/_/"
# Error, invalid preceding expression
# Whoops
$ echo "a{1,2,3}b" | sed -r "s/a{/_/"
# Error, unmatched \{
$ echo "a{1,2,3}b" | sed -r "s/}/_/"

We do have a test case for this at lib/libc/regex/grot/tests:205 where { is explicitly meant to be a literal match in both BREs and EREs. We have no case expression } being a literal match.

$ echo "a{1,2,3}b" | sed "s/\}/_/"
# Error, parentheses not balanced

$ echo "a{1,2,3}b" | sed "s/\}/_/"
# Ah, also prefer GNU behavior

This one, it's worth noting, has no test either. It does have the obvious test for the other side, \{ alone, but no \}.

$ echo "a{1,2,3}b" | sed -r "s/{}/_/"

$ echo "a{1,2,3}b" | sed -r "s/{}/_/"
# Error, invalid preceding expression
# Whoops
$ echo "a{1,2,3}b" | sed -r "s/a{}/_/"
# Error, invalid content
# Reasonable

This one is .... technically correct behavior. Technically, according to re_format(7), the following "}" is *not* a digit, and therefore this is not a bounds statement. I think this is really not correct, though. Letting {} take a literal interpretation leaves us too much room for error getting in if a digit was expected by the pattern-creator, and I would prefer the GNU approach on this matter.

We'll probably want to update re_format(7) to be more explicit in this matter, as well as add a corresponding test case.

$ echo "a{1,2,3}b" | sed -r "s/)/_/"
$ echo "a{1,2,3}b" | sed "s/\)/_/"
# Error, parentheses not balanced

This is clearly covered in tests:54 (silenced, though) and with slight anger expressed in the context around it. I lean towards taking the GNU/sane approach on this one and making this work as one probably expects nowadays.

===== Summary of work needed

Problem: { in ERE uses literal interpretation
Needed: { throw error
Needed: Fix test case at tests:205 to separate out BRE and ERE cases and adjust ERE case to meet expectations

Problem: \} in BRE throws an error
Needed: \} match literal

Problem: {} in ERE uses literal interpretation
Needed: {} throw error
Needed: Consider re_format(7) update to explicitly note {} as illegal
Needed: Test case

Problem: ) in ERE should throw error
Needed: ) throw error
Needed: Adjust test cases (tests:54)

I think that sums it up -- I'll take a look at these things in the next week or so.
Comment 3 Kyle Evans freebsd_committer 2017-04-07 13:50:17 UTC
I can't actually find any evidence that grot/ bits are still used/functional, so ignore those parts of last comment. Will look at the netbsd-tests instead.
Comment 4 commit-hook freebsd_committer 2017-08-08 04:11:48 UTC
A commit references this bug:

Author: kevans
Date: Tue Aug  8 04:10:47 UTC 2017
New revision: 322211
URL: https://svnweb.freebsd.org/changeset/base/322211

  regex(3): Handle invalid {} constructs consistently and adjust tests

  Currently, regex(3) exhibits the following wrong behavior as demonstrated
  with sed:

   - echo "a{1,2,3}b" | sed -r "s/{/_/"     (1)
   - echo "a{1,2,3}b" | sed "s/\}/_/"       (2)
   - echo "a{1,2,3}b" | sed -r "s/{}/_/"    (3)

  Cases (1) and (3) should throw errors but they actually succeed, and (2)
  throws an error when it should match the literal '}'. The correct behavior
  was decided by comparing to the behavior with the equivalent BRE (1)(3) or
  ERE (2) and consulting POSIX, along with some reasonable evaluation.

  Tests were also adjusted/added accordingly.

  PR:		166861
  Reviewed by:	emaste, ngie, pfg
  Approved by:	emaste (mentor)
  MFC after:	never
  Differential Revision:	https://reviews.freebsd.org/D10315

Comment 5 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:59:43 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

- Set Status to "Open"
Comment 6 Oleksandr Tymoshenko freebsd_committer freebsd_triage 2019-01-20 01:19:42 UTC
There was a commit referencing this bug, but it's still not closed and has been inactive for some time. Closing as fixed. Please re-open it if the issue hasn't been completely resolved.