WG15 Defect Report Ref: 9945-2-43
Topic: Regular Expressions - imprecise specification


This is an approved interpretation of 9945-2:1993.

.

Last update: 1997-05-20


								9945-2-43

 _____________________________________________________________________________


	Topic:			Regular Expressions - imprecise specification
	Relevant Sections:	2.8


Defect Report:
-----------------------

(from Andrew Hume Doug McIlroy)

Issue D

     These questions address areas  of	imprecision  in	 the
specifications	of REs.	With regard to question	[13], 9945-2
uses various  terms roughly synonymously  (leftmost,  first,
earliest)   (subpattern,  subexpression)  for  describing  a
left-to-right order across a line of text.  It would be	well
if the text (and rationale) used one term consistently.
	  ________________________________________

 [12] Can a duplicated subexpression match the null  string?
     If	 so,  will  the	 duplication  be  repeated until the
     expression	does match the null string?

Proposed Solution:

     A subexpression that can match a null string shall	 not
be duplicated.

Rationale:

     Although adjacent duplication symbols are illegal	(for
no  apparent  reason,  particularly for	EREs),	\(x*\)*	is a
legal expression, in which the	*s are	not  adjacent,	that
raises	the  question:	how  many  times  is the null string
matched?

     Suppose that  \(x*\)* were	allowed.  Does	matching  it
to  xxx	yield  \1 =  xxx or  \1	= null?	 The latter alterna-
tive is	consistent with	the  rule  that	 a  null  string  is
longer	than no	match.	By extending  xxx with a null string
instead	of nothing, one	gets  a	 longer	 match.	  More	null
strings	make even longer matches.

     To	avert metaphysical questions about the last  element
of an infinite sequence, or an element following an infinite
sequence, one could forbid null	iterations.  But this has an
unsatisfying corollary that  \(x*\)* wouldn't match the	null
string.	 Compromise positions might forbid  null  iterations
after the first	iteration or after the first null iteration.
Such special pleading and the resultant	implementation	com-
plexity	is not worth the return.

     Lord and McIlroy  have  implemented  a  no-null-unless-
first  policy;	Spencer	 has  implemented repeat-until-null.
Spencer's interpretation seems perhaps less strained,  until
you realize that it makes references ( \1,  \2,	etc) to	con-
tents of duplications useless because they  will  always  be
null or	undefined.

     The proposed solution also	outlaws	EREs like  (a?|b)*,
(a?)\{2,2\},  and  .   ([a-z]*(|^))* The set of	regular	lan-
guages is unchanged, however.  There would  still  be  legal
equivalents to these and all other outlawed expressions.


     An	alternative, to	 leave	the  meaning  undefined,  is
unacceptable.	Users will be unaware of the exact bounds of
portability, and their biggest	intellectual  investments  -
the  hardest  expressions and sed scripts - are	liable to be
unportable.
	  ________________________________________

 [13] What is the meaning of the BRE  \(\)?  [2.8.5.2]

Proposed Solution:

     Forbid this expression.

     It	would be also acceptable, but  rather  less  so,  to
define that the	pattern	 \(\) matches the null string.

Rationale:

     The pattern has no	analog in EREs,	no utility,  and  to
our  knowledge	no  basis  in  history.	  It  should thus be
banned.	Otherwise, at least define what	it means.
	  ________________________________________

 [14] On line 2792, what is left-to-right order	in a match?

Proposed Solution:

     Insert the	following text at the end of line 2792:

     A nested subpattern is to the left	of the	subpat-
     tern  that	contains it.  An earlier iteration of a
     duplicated	subpattern is to the left  of  a  later
     iteration.	  Neither  side	of an alternation is to
     the left of the other.  Length ties in an alterna-
     tion are broken in	favor of the left side.


Rationale:

     Consider the pattern  \(\(...\)*\(.....\)*.\).* matched
against	 the string  xxxxxx.  Is the leftmost subpattern the
first complete subpattern, namely  ...,	or the pattern	that
lexically  starts  first,  namely  \1?	Call these two cases
``first-finished'' and ``first-begun''.	 The proposed  solu-
tion adopts first-finished.

     A first-finished match yields  \1 =  xxxx,	 \2 =	xxx,
and    \3  unmatched.	A  first-begun	match  yields  \1 =
xxxxxx,	 \2 unmatched, and  \3 =  xxxxx.  Lord	and  Spencer
both do	first-finished matching, although Spencer argues for
first-begun.  In many cases, including the present  example,
first-finished	matching can be	done with less backtracking.
McIlroy	implemented both and  found  first-begun  much	more
awkward	 than  first-finished.	 Historical  practice favors
first-finished.

     The left-associativity of concatenation (line 3256) and
the  transparency  of parentheses (line	2977) together imply
that   \(...\)*\(.....\)*.*  and    \(\(...\)*\(.....\)*\).*
should	match  the same	way.  Under first-begun	matching to
xxxxx, in the former case subpattern  \(...\)* matches	xxx;
in  the	latter it matches the null string.  Thus first-begun
matching entails a contradiction.
	  ________________________________________

 [15] To which match is	 a  backreference  to  a  duplicated
     subexpression bound? [2.8.3.3(4)]

Proposed Solution:

     A backreference  to  a  subexpression  contained  in  a
duplication  is	bound to the (possibly nonexistent) match to
the subexpression in the most recent iteration of the dupli-
cation.	 Thus  \(\(.\)\2\)* matches  xxyyzz, with  \2 refer-
ring to	 x,  y,	and  z respectively in the three  iterations
of   the  outer	 subexpression.	  However,    \(\(.\)*\2#\)*
matches	only the first six characters of   xx#yy##,  because
in  a  third  iteration	of the outer subexpression,  . would
match nothing (as distinct from	matching a null	string)	 and
hence  \2 would	match nothing.

Rationale:

     Current implementations agree on  this  interpretation,
which  is  a  natural generalization of	binding	in  regmatch
structures by  regexec().  [B.5.2]


WG15 response for 9945-2:1993 
-----------------------------------

Q12

        The standard does not require the successful
        matching of a null string after a successful match on a non null
        string for duplication symbols.  However, the
        standard does not clearly say that you are not allowed to match
        null strings after a successful match.  Therefore the standard
        in ambiguous.

        The definition of a back reference expression in section 2.8.3.3
        does not specify which match of a subexpression followed by a
        duplication symbol is to be returned.  We note however that the
        definition regcomp() defined in section B5.1 pg 728 344-346
        indicate that the back reference expression will match the last
        string matched by the sub expression.

        The standard is unclear on this issue, and no conformance
        distinction can be made between alternative implementations
        based on this.  This is being referred to the sponsor.    

Q13

        Pg 89 line 3263 requires this form to be accepted.  But, the
        text 2.8.3 does not describe its meaning.

        Concerns about the wording of this part of the standard have
        been forwarded to the sponsor.


This section New to this revision
----------------------------------

Part 14:
The text on page 82, lines 2975-2979, and page 77, lines 2791-2796 are in
conflict and as such the standard is unclear on this issue, and no 
conformance distinction can be made between alternative implementations 
based on this.  This is being referred to the sponsor.

Part 15:
The standard does not speak to this issue, and as such no conformance 
distinction can be made between alternative implementations based on this.  
This is being referred to the sponsor.



Rationale:
	None.
 
Proposed resolution forwarded: Aug 11 1995
Finalized: Sept 12 1995