WG15 Defect Report Ref: 9945-2-60
Topic: regular expressions, 9945-2-92/INT #2 Regular expressions


This is an approved interpretation of 9945-2:1993.

.

Last update: 1997-05-20


								9945-2-60

 _____________________________________________________________________________


	Topic:			regular expressions, 9945-2-92/INT #2 
	Relevant Sections:	B.5.2
	Classification: defect


Defect Report:
-----------------------

A recent interpretation 9945-2-92/INT #2 appears to be
an incorrect change in terms of the meaning of the words in POSIX.2,
could ISO/IEC clarify the situation.

>WG15 response for 9945-2:1993  
>-----------------------------------
>The subexpression representing the entire RE is to be included in the
>count represented in the re_nsub member. No change in wording is
>necessary.

POSIX.2 is clear on line 285 on Page 727 that re_nsub contains the 
number of PARENTHESIZED subexpressions, which is different
from the total number of subexpressions because pattern itself counts as a
subexpression (see line 337 on Page 728).  

The interpretation given adds one to the value stored in re_nsub to cover 
the subexpression which encompasses the whole expression but which is 
not parenthesized. We do not believe that this is correct.

The original interpretation request was as follows:

>	Topic:			Regular expressions
>	Relevant Sections:	B.5.2
>
>Interpretation Request:
>-----------------------
>          In Section B.5.2 - Description {of  C  Binding  for  Regular 
>          Expression Matching}, the standard states that  the  re_nsub 
>          member of the regex_t structure  represents  the  number  of 
>          parenthesized subexpressions found in pattern.  [Draft 12 of 
>          ISO/IEC 9945-2:1993 (July 1992), p. 766, lines 329-331] 
> 
>          The standard then states that the pmatch argument 
> 
>               shall point to  an  array  with  at  least  nmatch 
>               elements, and regexec() shall fill in the elements 
>               of that array with offsets of  the  substrings  of 
>               string  that  correspond  to   the   parenthesized 
>               subexpressions of pattern:  pmatch[i].rm_so  shall 
>               be  the  byte  offset   of   the   beginning   and 
>               pmatch[i].rm_eo shall be one greater than the byte 
>               offset of the end of substring i.   (Subexpression 
>               i begins at  the  ith  matched  open  parenthesis, 
>               counting from  1.)   Offsets  in  pmatch[0]  shall 
>               identify the substring  that  corresponds  to  the 
>               entire regular expression. 
> 
>          [Ibid., p. 766-767, lines 339-346] 
> 
>          Thus, if pmatch[] contains nmatch elements, it can only hold 
>          nmatch-1  parenthesized  subexpressions  of  string,   since 
>          pmatch[0] represents the entire regular expression. 
> 
>          The standard also states  that  ``if  there  are  more  than 
>          nmatch subexpressions in pattern (pattern itself counts as a 
>          subexpression), then regexec() [...] shall record  only  the 
>          first nmatch substrings.'' [Ibid., p. 767, lines 347-350] 
> 
>          Lines 347-350 appear to contradict lines 339-346; the latter 
>          talks about parenthesized subexpressions, while  the  former 
>          mentions  plain  subexpressions.   Is  the  intent  of   the 
>          standard  to  allow  the  re_nsub  member  to  include   the 
>          subexpression representing the entire regular expression  in 
>          the count (since it is considered a  subexpression  on  page 
>          767, lines  347-350),  or  does  it  only  count  explicitly 
>          parenthesized  subexpressions?   We  believe  this  is   the 
>          easiest way to rectify the ambiguity. 
 
There is no contradiction.  The two paragraphs are discussing two different 
functions--regcomp and regexec.
 
It is VERY clear that the value for re_nsub as set by regcomp is the
number of actual groupings present in the RE.
 
In the second paragraph (discussing regexec), it is merely making it clear
that pmatch[0] describes the entire RE matched, and that nmatch must take
into account that fact.  For example, if there are two parenthesized REs,
then one needs to have at least three regmatch_t's to have all the sub
matches recorded.
 
Let's assume for the moment that the entire RE counted as a parenthesized
RE, then re_nsub would be one higher than today.  It is still the case that
the entire RE's match is recorded in pmatch[0], but it would ALSO have to
be recorded in pmatch[1]!  This is because the first subexpression has number
1, and it must be placed in pmatch[1].
 


A new problem has also been noticed, on looking at the POSIX.2
rationale, I note that Page 1040 lines 11926 and 11927 suggest that nmatch
should not be larger that re_nsub.  This statement seems to be inaccurate
since nmatch should equal re_nsub+1 if all subexpression data is to be
captured.  It may, however, have influenced the interpretation.


WG15 response for 9945-2:1993 
-----------------------------------

The interpretation 9945-2-92/INT #2  is incorrect  as noted above,
and has been withdrawn.

There is an error in the rationale (page 1040 lines 11926-11927),
a future revision should change "re_nsub" to "re_nsub+1".

Rationale for Interpretation:
-----------------------------

This is a "defect" situation and the previous interpretation has been
withdrawn. 

It is expected that a future revison of the standard will address
the problem in the rationale.



 _____________________________________________________________________________