I have tracked the source of a bug originally seen on OS X back to the same bug in FreeBSD. The bug is in the libc implementation of regular expressions, used via the API in "regex.h" specifically regexec. Programs which use this library are affected, including "sed" which I will use to illustrate the bug below. The bug effects both the whole text matched and the choice of text reported as matching parenthesized subexpressions. The bug violates the leftmost longest matching rule for POSIX regular expressions. The bug violates the principle that the order of subexpressions around the "|" operator should not affect the whole text matched. A terminal sessions using three sed commands with extended regular expressions: [chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/" a[b] [chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/" [ab] [chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' [XababaYababaZ] => [X][ab][aba][Y][b][Z] The first two commands differ only in the order of "()" and "." around the "|" operator. According to the POSIX specification this must not affect the whole match, which is report by "&" in sed's replacement specification. The leftmost longest match is "ab" and is correctly reported by the second command. The first command incorrect does not match starting with the "a" and matches just the "b". This establishes that the bug violates the POSIX specification and can affect a program that only cares about the whole text matched (ignoring captured parenthesized subexpressions). The third command shows an internal inconsistency in the output. The (aba|b|ab)* operator should capture the text matched by the last repetition of the (aba|b|ab) subexpression and report that in \6. In particular \6 should match "aba" just like \3 does. As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z] The correct output should have been [XababaYababaZ] => [X][ab][aba][Y][abc][Z] Could there be a way to explain the FreeBSD output? Consider the following questions: If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the text? If the "b" reported by \6 is the last "b" in the text then there is no part of the pattern that can match the next "a". If the "b" reported by \6 is the next-to-last "b" in the text then there is no part of the pattern that can match the preceding "a". Checking the indices returned by regex shows that \6 was matched against the last "b" in the text. Thus reporting \6 as matching "b" is contradictory, there is no consistent interpretation of what regexec returned. This establishes that the bug is can violate not only the POSIX specification but also violate any rational and consistent alternate specification. I found the first example by using Haskell's QuickCheck to randomly generate test cases. The second example was found by using Glenn Fowler's "AT&T Labs Research regex(3) regression tests" from http://www.research.att.com/~gsf/testregex/ I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the #freebsd channel at irc.freenode.net Fix: I do not have the time to be able to fix the regex.h part of libc. How-To-Repeat: [chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/" [chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/" [chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab) If there is no bug then the first two both output "[ab]" and the last output "[XababaYababaZ] => [X][ab][aba][Y][abc][Z]". Currently the bug causes the first the output "a[b]" and the last to output "[XababaYababaZ] => [X][ab][aba][Y][b][Z]".
State Changed From-To: open->analyzed Reproduced the bug and created test case
Responsible Changed From-To: freebsd-bugs->dds I think I can fix it.
Author: dds Date: Tue Sep 15 21:15:29 2009 New Revision: 197234 URL: http://svn.freebsd.org/changeset/base/197234 Log: Add two test cases from PR 130504. An additional one coming from http://www.research.att.com/~gsf/testregex/ was not added; at some point the entire AT&T regression test harness should be imported here. But that would also mean commitment to fix the uncovered errors. PR: 130504 Submitted by: Chris Kuklewicz Modified: head/lib/libc/regex/grot/tests Modified: head/lib/libc/regex/grot/tests ============================================================================== --- head/lib/libc/regex/grot/tests Tue Sep 15 20:28:29 2009 (r197233) +++ head/lib/libc/regex/grot/tests Tue Sep 15 21:15:29 2009 (r197234) @@ -472,3 +472,6 @@ abcdefghijklmnop i abcdefghijklmnop abcd abcdefghijklmnopqrstuv i abcdefghijklmnopqrstuv abcdefghijklmnopqrstuv (ALAK)|(ALT[AB])|(CC[123]1)|(CM[123]1)|(GAMC)|(LC[23][EO ])|(SEM[1234])|(SL[ES][12])|(SLWW)|(SLF )|(SLDT)|(VWH[12])|(WH[34][EW])|(WP1[ESN]) - CC11 CC11 CC[13]1|a{21}[23][EO][123][Es][12]a{15}aa[34][EW]aaaaaaa[X]a - CC11 CC11 +# PR 130504 +(.|())(b) - ab ab +(()|.)(b) - ab ab _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
Author: dds Date: Wed Sep 16 06:32:23 2009 New Revision: 197246 URL: http://svn.freebsd.org/changeset/base/197246 Log: Fix an off-by-one error in the marking of the O_CH operator following an OOR2 operator. PR: 130504 MFC after: 2 weeks Modified: head/lib/libc/regex/engine.c Modified: head/lib/libc/regex/engine.c ============================================================================== --- head/lib/libc/regex/engine.c Wed Sep 16 06:29:23 2009 (r197245) +++ head/lib/libc/regex/engine.c Wed Sep 16 06:32:23 2009 (r197246) @@ -1075,7 +1075,7 @@ step(struct re_guts *g, OP(s = g->strip[pc+look]) != O_CH; look += OPND(s)) assert(OP(s) == OOR2); - FWD(aft, aft, look); + FWD(aft, aft, look + 1); } break; case OOR2: /* propagate OCH_'s marking */ _______________________________________________ svn-src-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscribe@freebsd.org"
The patch looks like a fix for the first bug. I assume that it is not a fix for the bug that produces: > $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' > [XababaYababaZ] => [X][ab][aba][Y][b][Z] > $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)*(Y)(aba|b|ab)(aba|b|ab)(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' > [XababaYababaZ] => [X][b][Y][ab][aba][Z] > I have not read the code, but I have developed a theory that explains mistakes like the one above. The (aba|b|ab)* can fit the ababa in more than one way, and this is an ambiguous repetition. Handling subgroup captures for ambiguous repetitions is really tricky. The bug above makes me suspect the code does this: 1) Form a finite automaton to find the leftmost-longest match, ignoring substring captures (at least for the ambiguous repetition) 2) Do something to report the non-ambiguous repetitions correctly, e.g. all but the \6 in the first or \2 in the second example. Since this works, I can deduce little about how it does it. 3) Start at the beginning of the ambiguous repetitions and perform left-to-right greedy matching. Thus the "aba" is matched against "ababa" first, leaving "ba" which is matched with "b", leaving "a". This reports the last greedy iteration as the captured substring (here "b"). 4a) If (3) covers the whole substring (here "ababa") then (3) is correct. This is equivalent to "does (3) terminate at the correct final character/index?" 4b) If (3) does not cover the whole substring then the algorithm has failed to find the correct final iteration. One way to recover from this would be to backup and try other possible captures; this search will work if done in the right order, but is potentially exponentially slow. So the problem with the current code seems to be a failure to recognize that (3) has failed and to then do something about it. Note that I have not read the actual code or studied the actual algorithm. There is another alternative: implement a non-exponential algorithm to get the correct substring captures. Such algorithms exists, and NetBSD is looking to include one such: http://laurikari.net/tre/ is the home page. A few years ago I made a variant (in Haskell) of this algorithm and have achieved linear performance and POSIX substring captures for ambiguous repetitions. I found the the (perhaps now old) "tre" library did NOT find the correct POSIX substring captures for ambiguous repetitions; I have not re-tested the newest "tre" version. -- Chris Kuklewicz
Responsible Changed From-To: dds->freesd-bugs dds' bit has been returned for safekeeping.
Responsible Changed From-To: freesd-bugs->freebsd-bugs Canonicalize assignment.
State Changed From-To: analyzed->open unowned PRs must not be in analyzed state
Cannot reproduce: $ uname -a FreeBSD kakumen 10.0-STABLE FreeBSD 10.0-STABLE #4 r267707: Sat Jun 21 19:40:06 COT 2014 pfg@kakumen:/usr/obj/usr/src/sys/GENERIC amd64 $ echo "ab" | sed -E "s/(()|.)(b)/[&]/" [ab] $ echo "ab" | sed -E "s/(.|())(b)/[&]/" [ab] $ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/' [XababaYababaZ] => [X][ab][aba][Y][b][Z]