musl - Re: Re: First feedback on new C locale problems

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150928185837.GD17773@brightrain.aerifal.cx>
Date: Mon, 28 Sep 2015 14:58:37 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re: First feedback on new C locale problems

On Sun, Sep 27, 2015 at 12:59:25PM -0400, Rich Felker wrote:
> On Sun, Sep 27, 2015 at 03:49:02PM +0200, Felix Janda wrote:
> > Rich Felker wrote:
> > > On Sun, Sep 27, 2015 at 08:17:38AM +0200, Felix Janda wrote:
> > > > Rich Felker wrote:
> > > > > On Sat, Sep 26, 2015 at 06:58:36AM +0200, Felix Janda wrote:
> > > > > > On 2015-09-09 05:56:48 GMT, Rich Felker wrote:
> > > > > > > On Tue, Sep 01, 2015 at 02:32:35AM -0400, Rich Felker wrote:
> > > > > > > > What I'd like to do to fix it is just always return "UTF-8" for
> > > > > > > > nl_langinfo(CODESET) regardless of locale (rather than returning
> > > > > > > > "UTF-8-CODE-UNITS" when in C locale). POSIX places no requirements on
> > > > > > > > nl_langinfo that would preclude this, and it seems like it would
> > > > > > > > restore the desired properties and fix all the regressions.
> > > > > > >
> > > > > > > Committed.
> > > > > > >
> > > > > > > Rich
> > > > > > 
> > > > > > GNU sed seems to care about the output from nl_langinfo:
> > > > > > 
> > > > > > https://bugs.gentoo.org/show_bug.cgi?id=560728
> > > > > > 
> > > > > > More specifically, so does lib/localecharset.c, which is used in
> > > > > > the replacement of re_compile_pattern.
> > > > > 
> > > > > I was able to reproduce this (with slightly different output, "a© a'")
> > > > > on Alpine. Clearly this is some sort of bug in the gnulib code or sed
> > > > > itself, since it's producing corrupt output. I think we should explore
> > > > > why that's happening and whether it's possible to fix there. But if
> > > > > there remain other reasons that returning "UTF-8" in the C locale is
> > > > > not practical then perhaps we could resort to returning "ASCII".
> > > > 
> > > > A possible fix is
> > > > 
> > > > --- ./a/sed-4.2.1/lib/regcomp.c
> > > > +++ ./a/sed-4.2.1/lib/regcomp.c
> > > > @@ -824,7 +824,7 @@ re_compile_internal (regex_t *preg, cons
> > > >  
> > > >  #ifdef RE_ENABLE_I18N
> > > >    /* If possible, do searching in single byte encoding to speed things up.  */
> > > > -  if (dfa->is_utf8 && dfa->mb_cur_max != 1 && !(syntax & RE_ICASE) && preg->translate == NULL)
> > > > +  if (dfa->is_utf8 && !(syntax & RE_ICASE) && preg->translate == NULL)
> > > >      optimize_utf8 (dfa);
> > > >  #endif
> > > >  
> > > > 
> > > > In our case is_utf8 is 1 and mb_cur_max is also 1. The function
> > > > optimize_utf8() would change "." to match utf8 characters instead of
> > > > bytes. For some reason I have not investigated further then "©" (or any
> > > > other non-ASCII) character is not matched, but in the C locale we want
> > > > "." also to match non-valid utf8 characters anyway.
> > > 
> > > I think this fix is misplaced; it looks like it would make GNU regex
> > > do UTF-8 character matching rather than byte matching in the C locale.
> > > Rather one of the other places that has an is_utf8 check also needs to
> > > have the mb_cur_max!=1 check added, I think.
> > 
> > Oh, sorry for the confusion. The patch is inverted...
> 
> Ah, ok. But in that case, it's probably best not to detect is_utf8 to
> begin with if MB_CUR_MAX==1.
> 
> I should probably read the code and try to get a better understanding
> of what it's doing.

I think the actual error is here:

http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regcomp.c#n903

In the _LIBC code path, they check MB_CUR_LEN==6 (glibc's nonstandard
value they use for UTF-8) perhaps just as an optimization of the
non-UTF-8 case, but they don't check it for !_LIBC; they just rely on
the CODESET name matching.

I'm still somewhat concerned that returning "UTF-8" is problematic
here, but I think gnulib also has a bug; trusting their interpretation
of the string returned by nl_langinfo(CODESET) seems to be leading to
corrupt results.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.