ISO/ IEC JTC1/SC22/WG14 N898

WG14/N898       DR212

C99 Defect Report

Author: Clive Feather <clive@demon.net>
Date: 1999-10-20

Subject: binding of multibyte conversion state objects


Summary
-------

At present an mbstate_t object can only ever be used to make one
conversion. This is not desirable, and changes are proposed in this area.


Discussion
----------

Clause 7.24.6 paragraph 3 reads, in part:

    If an mbstate_t object has been altered by any of the functions
    described in this subclause, and is then used with a different
    multibyte character sequence, or in the other conversion direction,
    or with a different LC_CTYPE category setting than on earlier
    function calls,the behavior is undefined.

Put another way, each mbstate_t object is initially "unbound" (if it is
initialized to zero) and then becomes "bound" by any call to a function
such as mbrtowc or wcrtomb. When "bound" it can only be used in the
same direction with the same string as originally bound, and only when the
LC_CTYPE category is that in effect when it was bound.

With ordinary mbstate_t objects this is a annoyance; one implication is
that a new object must be created every single time a new string is to be
converted (the Standard does not provide any way to "unbind" the object).
With the mbstate_t object inside a FILE structure it is even worse, because
it makes it impossible to (for example) write to a file, rewind it, and
then read the same file. Similarly, the internal mbstate_t objects used
when the mbstate_t pointer argument is set to NULL can be used for only
one string in the entire program !

Users of mbstate_t objects (including those in FILE structures) expect
to be able to use them for more than a single purpose.

Proposed solution
-----------------

The changes introduce the concept that an mbstate_t object is either
"unbound" or "bound". When set to an all-zero value (which can be at
initialization or explicitly later on) it is unbound. As soon as the
object is used for a conversion it becomes bound to that string, locale,
and direction. Returning to the initial state does not unbind the
object (in other words, while all unbound objects are in the initial
state the converse is not necessarily true).

The special cases of mbrtowc and wcrtomb are defined to always result
in an unbound state. This both provides more consistent behaviour (the
special case resets everything to a known state) and also allows the
internal mbstate_t objects associated with these functions to be unbound.

The mbstate_t object hidden in a file is returned to the unbound state
whenever end of file is reached on input, and by any call to fseek
(these choices were made to correspond with the requirements of 7.19.5.3
paragraph 6 for changing I/O direction).

The internal mbstate_t objects associated with the mbrlen, mbrtowc,
wcrtomb, mbsrtowcs, and wcsrtombs functions can only be used with the
locale they initially bind to. Other changes deal with the first three;
a previously impossible case is used for the last two to force the
object to the unbound state.


Suggested Technical Corrigendum
-------------------------------

(Changes concerning explicit mbstate_t objects.)

Change 7.24.6 paragraph 3 to:

    [#3]  The  initial  conversion  state  corresponds,  for   a
    conversion  in  either  direction, to the beginning of a new
    multibyte character in the initial shift state. An mbstate_t
    object may be "unbound" or "bound". A zero-valued mbstate_t
    object is (at least) one way to describe an unbound object,
    and if an mbstate_t object is assigned such a value it it
    becomes unbound. All unbound mbstate_t objects are in the
    initial conversion state (but the converse is not necessarily
    true).

    [#3a] An unbound object can be used to initiate conversion
    involving any multibyte character sequence, in any LC_CTYPE
    category setting, in either direction; once used for a conversion,
    it becomes bound to that sequence, category setting, and direction.
    If a bound mbstate_t object is used with a different multibyte
    character sequence, a different LC_CTYPE category setting, or in
    the other conversion direction to that it is bound to, the
    behavior is undefined.290)

Append to footnote 290:

    Furthermore, provided that the object is unbound, and thus in
    the initial conversion state, it can then be used in converting
    a new string, a new locale, or in the other direction.
 
Change 7.24.6.3 paragraph 1 and 7.24.6.4 paragraph 1 from:

    [...] which is initialized at program startup to the initial
    conversion state. [...]
to:
    [...] which is initialized at program startup to the unbound
    state. [...]

Change 7.24.6.3.2 paragraph 2 to:

    [#2]  If  s  is  a  null  pointer,  the  mbrtowc function is
    equivalent to the call:

                    mbrtowc(NULL, "", 1, ps)

++  except that the resulting state described is unbound even if
++  an encoding error occurred.

    In this case, the values of the parameters  pwc  and  n  are
    ignored.

Change 7.24.6.3.3 paragraph 2 to:

    [#2] If s  is  a  null  pointer,  the  wcrtomb  function  is
    equivalent to the call

                    wcrtomb(buf, L'\0', ps)

    where buf is an internal buffer
++  except that the resulting state described is always unbound even
++  if an encoding error occurred 291a; the value of wc is ignored.

++  291a The effect is reliably to make *ps unbound.

Append to 7.24.6.4 paragraph 2:

    As a special case, if src is a null pointer then the normal
    behaviour of the function is ignored and instead ps becomes
    unbound irrespective of its previous state; an unspecified
    value is returned.


(Changes associated with streams.)

Append to 7.19.2 paragraph 6:

    If a wide character input function encounters end-of-file, or
    after a successful call to the fseek function, the mbstate_t
    object associated with the stream is unbound.

Append to the last sentence of 7.19.9.2 paragraph 5:

    and if the stream is wide-oriented the associated mbstate_t
    object shall be unbound.

In 7.24.3.1 paragraph 2, change:

to:
    [...] If the stream
    is at end-of-file, the end-of-file indicator for the  stream
++  is set, the mbstate_t object associated with the stream is unbound,
    and fgetwc returns WEOF. [...]