Bug 130504 - [libc] Serious bug in regular expression library (regex) affected sed
Summary: [libc] Serious bug in regular expression library (regex) affected sed
Status: Closed Unable to Reproduce
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 6.3-PRERELEASE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-13 14:00 UTC by Chris Kuklewicz
Modified: 2014-06-23 20:32 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Kuklewicz 2009-01-13 14:00:01 UTC
I have tracked the source of a bug originally seen on OS X back to the same bug in FreeBSD.

The bug is in the libc implementation of regular expressions, used via the API in "regex.h" specifically regexec.

Programs which use this library are affected, including "sed" which I will use to illustrate the bug below.

The bug effects both the whole text matched and the choice of text reported as matching parenthesized subexpressions.  The bug violates the leftmost longest matching rule for POSIX regular expressions.  The bug violates the principle that the order of subexpressions around the "|" operator should not affect the whole text matched.

A terminal sessions using three sed commands with extended regular expressions:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
a[b]
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]

The first two commands differ only in the order of "()" and "." around the "|" operator.  According to the POSIX specification this must not affect the whole match, which is report by "&" in sed's replacement specification.  The leftmost longest match is "ab" and is correctly reported by the second command.  The first command incorrect does not match starting with the "a" and matches just the "b".

This establishes that the bug violates the POSIX specification and can affect a program that only cares about the whole text matched (ignoring captured parenthesized subexpressions).

The third command shows an internal inconsistency in the output.  The (aba|b|ab)* operator should capture the text matched by the last repetition of the (aba|b|ab) subexpression and report that in \6. In particular \6 should match "aba" just like \3 does.

As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z]
The correct output should have been  [XababaYababaZ] => [X][ab][aba][Y][abc][Z]

Could there be a way to explain the FreeBSD output? Consider the following questions:

If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the text?
If the "b" reported by \6 is the last "b" in the text then there is no part of the pattern that can match the next "a".
If the "b" reported by \6 is the next-to-last "b" in the text then there is no part of the pattern that can match the preceding "a".
Checking the indices returned by regex shows that \6 was matched against the last "b" in the text.
Thus reporting \6 as matching "b" is contradictory, there is no consistent interpretation of what regexec returned.

This establishes that the bug is can violate not only the POSIX specification but also violate any rational and consistent alternate specification.

I found the first example by using Haskell's QuickCheck to randomly generate test cases.  The second example was found by using Glenn Fowler's "AT&T Labs Research regex(3) regression tests" from http://www.research.att.com/~gsf/testregex/
I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the #freebsd channel at irc.freenode.net

Fix: 

I do not have the time to be able to fix the regex.h part of libc.
How-To-Repeat: [chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)

If there is no bug then the first two both output "[ab]" and the last output "[XababaYababaZ] => [X][ab][aba][Y][abc][Z]".

Currently the bug causes the first the output "a[b]" and the last to output "[XababaYababaZ] => [X][ab][aba][Y][b][Z]".
Comment 1 Diomidis Spinellis freebsd_committer freebsd_triage 2009-09-15 21:26:25 UTC
State Changed
From-To: open->analyzed

Reproduced the bug and created test case 


Comment 2 Diomidis Spinellis freebsd_committer freebsd_triage 2009-09-15 21:26:25 UTC
Responsible Changed
From-To: freebsd-bugs->dds

I think I can fix it.
Comment 3 dfilter service freebsd_committer freebsd_triage 2009-09-15 22:16:37 UTC
Author: dds
Date: Tue Sep 15 21:15:29 2009
New Revision: 197234
URL: http://svn.freebsd.org/changeset/base/197234

Log:
  Add two test cases from PR 130504.
  An additional one coming from http://www.research.att.com/~gsf/testregex/
  was not added; at some point the entire AT&T regression test harness
  should be imported here.
  But that would also mean commitment to fix the uncovered errors.
  
  PR:		130504
  Submitted by:	Chris Kuklewicz

Modified:
  head/lib/libc/regex/grot/tests

Modified: head/lib/libc/regex/grot/tests
==============================================================================
--- head/lib/libc/regex/grot/tests	Tue Sep 15 20:28:29 2009	(r197233)
+++ head/lib/libc/regex/grot/tests	Tue Sep 15 21:15:29 2009	(r197234)
@@ -472,3 +472,6 @@ abcdefghijklmnop	i	abcdefghijklmnop	abcd
 abcdefghijklmnopqrstuv	i	abcdefghijklmnopqrstuv	abcdefghijklmnopqrstuv
 (ALAK)|(ALT[AB])|(CC[123]1)|(CM[123]1)|(GAMC)|(LC[23][EO ])|(SEM[1234])|(SL[ES][12])|(SLWW)|(SLF )|(SLDT)|(VWH[12])|(WH[34][EW])|(WP1[ESN])	-	CC11	CC11
 CC[13]1|a{21}[23][EO][123][Es][12]a{15}aa[34][EW]aaaaaaa[X]a	-	CC11	CC11
+# PR 130504
+(.|())(b)	-	ab	ab
+(()|.)(b)	-	ab	ab
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 4 dfilter service freebsd_committer freebsd_triage 2009-09-16 07:32:37 UTC
Author: dds
Date: Wed Sep 16 06:32:23 2009
New Revision: 197246
URL: http://svn.freebsd.org/changeset/base/197246

Log:
  Fix an off-by-one error in the marking of the O_CH operator
  following an OOR2 operator.
  
  PR:		130504
  MFC after:	2 weeks

Modified:
  head/lib/libc/regex/engine.c

Modified: head/lib/libc/regex/engine.c
==============================================================================
--- head/lib/libc/regex/engine.c	Wed Sep 16 06:29:23 2009	(r197245)
+++ head/lib/libc/regex/engine.c	Wed Sep 16 06:32:23 2009	(r197246)
@@ -1075,7 +1075,7 @@ step(struct re_guts *g,
 						OP(s = g->strip[pc+look]) != O_CH;
 						look += OPND(s))
 					assert(OP(s) == OOR2);
-				FWD(aft, aft, look);
+				FWD(aft, aft, look + 1);
 			}
 			break;
 		case OOR2:		/* propagate OCH_'s marking */
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Comment 5 Chris Kuklewicz 2009-09-21 13:09:29 UTC
The patch looks like a fix for the first bug.  I assume that it is not a
fix for the bug that produces:

> $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
> [XababaYababaZ] => [X][ab][aba][Y][b][Z]

> $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)*(Y)(aba|b|ab)(aba|b|ab)(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
> [XababaYababaZ] => [X][b][Y][ab][aba][Z]
> 

I have not read the code, but I have developed a theory that explains
mistakes like the one above.

The (aba|b|ab)* can fit the ababa in more than one way, and this is an
ambiguous repetition.  Handling subgroup captures for ambiguous
repetitions is really tricky.  The bug above makes me suspect the code
does this:

1) Form a finite automaton to find the leftmost-longest match, ignoring
substring captures (at least for the ambiguous repetition)

2) Do something to report the non-ambiguous repetitions correctly, e.g.
all but the \6 in the first or \2 in the second example.  Since this
works, I can deduce little about how it does it.

3) Start at the beginning of the ambiguous repetitions and perform
left-to-right greedy matching.  Thus the "aba" is matched against
"ababa" first, leaving "ba" which is matched with "b", leaving "a".
This reports the last greedy iteration as the captured substring (here "b").

4a) If (3) covers the whole substring (here "ababa") then (3) is
correct.  This is equivalent to "does (3) terminate at the correct final
character/index?"

4b) If (3) does not cover the whole substring then the algorithm has
failed to find the correct final iteration.  One way to recover from
this would be to backup and try other possible captures; this search
will work if done in the right order, but is potentially exponentially slow.

So the problem with the current code seems to be a failure to recognize
that (3) has failed and to then do something about it.  Note that I have
not read the actual code or studied the actual algorithm.

There is another alternative: implement a non-exponential algorithm to
get the correct substring captures.  Such algorithms exists, and NetBSD
is looking to include one such: http://laurikari.net/tre/ is the home
page.  A few years ago I made a variant (in Haskell) of this algorithm
and have achieved linear performance and POSIX substring captures for
ambiguous repetitions.  I found the the (perhaps now old) "tre" library
did NOT find the correct POSIX substring captures for ambiguous
repetitions; I have not re-tested the newest "tre" version.

-- 
Chris Kuklewicz
Comment 6 Mark Linimon freebsd_committer freebsd_triage 2010-09-22 15:54:53 UTC
Responsible Changed
From-To: dds->freesd-bugs

dds' bit has been returned for safekeeping.
Comment 7 Mark Linimon freebsd_committer freebsd_triage 2010-09-27 12:44:56 UTC
Responsible Changed
From-To: freesd-bugs->freebsd-bugs

Canonicalize assignment.
Comment 8 Eitan Adler freebsd_committer freebsd_triage 2012-07-01 17:10:26 UTC
State Changed
From-To: analyzed->open

unowned PRs must not be in analyzed state
Comment 9 Pedro F. Giffuni freebsd_committer freebsd_triage 2014-06-23 20:32:59 UTC
Cannot reproduce:
$ uname -a
FreeBSD kakumen 10.0-STABLE FreeBSD 10.0-STABLE #4 r267707: Sat Jun 21 19:40:06 COT 2014     pfg@kakumen:/usr/obj/usr/src/sys/GENERIC  amd64

$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[ab]
$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]