|
Message-ID: <20230626122833.7f0e73a8@inria.fr>
Date: Mon, 26 Jun 2023 12:28:33 +0200
From: Jₑₙₛ Gustedt <jens.gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: Re: [C23 string conversion 1/2] C23: implement the c8rtomb
and mbrtoc8 functions
Rich,
on Sun, 25 Jun 2023 13:26:20 -0400 you (Rich Felker <dalias@...c.org>)
wrote:
> Following up on some specifics of code I hadn't reviewed in enough
> detail yet...
>
> On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote:
> > The implementations each separate two cases: a fast path that
> > corresponds to calls with state variables that are in the initial
> > state and a leading input character that is ASCII; the other cases
> > are handled by a function for the slow path. These functions are
> > implemented such that the fast versions should not need stack
> > variables of their own, and thus can tail call into the slow path
> > with a jmp instruction if necessary. The fast versions could
> > typically be also used as shallow inline wrapper, if we like to go
> > that way.
> >
> > Only the implementation of mbrtoc8 is a bit tricky. This is because
> > of the weird conventions for dripping the output bytes one call
> > after the other. This is handled by using the state variable as a
> > stack for the up to three characters that are still to be sent to
> > the output. ---
> > include/uchar.h | 6 +++++
> > src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++
> > src/multibyte/mbrtoc8.c | 51
> > +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88
> > insertions(+) create mode 100644 src/multibyte/c8rtomb.c
> > create mode 100644 src/multibyte/mbrtoc8.c
> >
> > diff --git a/include/uchar.h b/include/uchar.h
> > index 7e5c4d40..9f6a3706 100644
> > --- a/include/uchar.h
> > +++ b/include/uchar.h
> > @@ -9,6 +9,9 @@ extern "C" {
> > typedef unsigned short char16_t;
> > typedef unsigned char32_t;
> > #endif
> > +#if __cplusplus < 201811L
> > +typedef unsigned char char8_t;
> > +#endif
> >
> > #define __NEED_mbstate_t
> > #define __NEED_size_t
> > @@ -16,6 +19,9 @@ typedef unsigned char32_t;
> > #include <features.h>
> > #include <bits/alltypes.h>
> >
> > +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict);
> > +size_t mbrtoc8(char8_t *__restrict, const char *__restrict,
> > size_t, mbstate_t *__restrict); +
> > size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict);
> > size_t mbrtoc16(char16_t *__restrict, const char *__restrict,
> > size_t, mbstate_t *__restrict);
> > diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c
> > new file mode 100644
> > index 00000000..8645e112
> > --- /dev/null
> > +++ b/src/multibyte/c8rtomb.c
> > @@ -0,0 +1,31 @@
> > +#include <uchar.h>
> > +#include <wchar.h>
> > +
> > +__attribute__((__noinline__))
>
> Where we use this so far is conditioned on #ifdef __GNUC__. I'm not
> sure if it's actually helping, but I think it'd make sense to be
> consistent.
ok, no pb
> > +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8,
> > mbstate_t *__restrict st) +{
> > + // We need an internal state different from mbrtowc.
> > + static mbstate_t internal_state;
> > + if (!st) st = &internal_state;
> > + wchar_t wc;
> > +
> > + // mbrtowc has return values -2, -1, 0, 1.
> > + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);
> > + switch (res) {
> > + // our implementation of wcrtomb ignores the state
> > + case 1: res = wcrtomb(s, wc, 0); break;
> > + case 0: res = 1; if (s) *s = 0; break;
> > + }
> > + return res;
> > +}
>
> At first I was confused by the conversion back and forth, but indeed
> it absolutely makes sense to handle the weird buffering this interface
> imposes a requirement for.
it is weird, indeed
> However, it seems to be malfunctioning in the C locale, where
> arbitrary bytes of c8 input would be accepted (despite not being valid
> UTF-8, as the interface requires).
>
> Logically, the right thing to do is to set the calling thread's locale
> to __c_dot_utf8_locale for the duration of the mbrtowc only. However,
> I think there may be simpler solutions depending on what you interpret
> the interface requirement as being.
>
> If c8rtomb is required to accept valid UTF-8 up to the penultimate
> byte then error only on storing it due to the locale not being able to
> represent it, I don't see a good solution except what I just said.
>
> But if it's permitted to preemptively say "sorry, nope, not gonna be
> able to represent that" as soon as it sees the first byte, then when
> MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ
> seems like the best and easiest approach.
Good question. We should just be consistent and do the same whatever
we do when we pass a surrogate into `c16rtomb`. If I read that
correctly this does fail with `EILSEQ`.
The intent (IIRC what I did some weeks ago ;-) of this
> > + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);
was to fail the same if `c8` is not start or continue of a valid
mb-character.
> > +__attribute__((__noinline__))
> > +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const
> > char *__restrict src, size_t n, mbstate_t *__restrict st) +{
> > + static unsigned internal_state;
> > + wchar_t wc;
> > + unsigned* pending = st ? (void*)st : &internal_state;
> > +
> > + // The high bit is set if there is still missing input. If
> > it
> > + // is not set and the value is not zero, a previous call
> > has
> > + // stacked the missing output bytes withing the state.
> > + if ((-*pending) > INT_MIN) {
> > + if (pc8) *pc8 = *pending;
> > + (*pending) >>= 8;
> > + return -3;
> > + }
> > +
> > + // mbrtowc has return values -2, -1, 0, 1, ..., 4.
> > + size_t res = mbrtowc(&wc, src, n, (void*)pending);
> > + if (res <= 4) {
> > + _Alignas(unsigned) unsigned char s[8] = { 0 };
> > + // Write the result bytes to s over the word
> > boundary
> > + // as we need it. Our wcrtomb implementation
> > ignores
> > + // the state variable, there will be no errors and
> > the
> > + // return value can be ignored.
> > + wcrtomb((void*)(s+3), wc, 0);
> > + // Depending on the endianess this maybe a single
> > load
> > + // instruction. We want to be sure that the bytes
> > are
> > + // in this order, and that the high-order byte is
> > 0.
> > + *pending = (0u|s[4])<<000 | (0u|s[5])<<010 |
> > (0u|s[6])<<020 | (0u|s[7])<<030;
> > + if (pc8) *pc8 = s[3];
> > + }
> > + return res;
> > +}
>
> Does _Alignas work the way you want it on an array like this?
It is supposed to, yes.
> I was under the impression you would need a structure to do that in
> a way that actually meshes with the standard semantics for _Alignas
> and not the GCC attribute's special behaviors.
In the contrary, I think the support for this feature in `struct` had
first been forgotten in C11 (was it) and then added in C17.
> A cleaner approach might be a union where the presence of the unsigned
> union member yields the alignment. This would also be good since the
> code is written in C99 which doesn't have _Alignas.
This may indeed be better readable and portable to oldish C. I'll look
into this.
> Overall it seems like a lot of complexity effort to set things up for
> an efficient store, but if anyone's going to use this interface it's
> probably worthwhile.
Great.
Yes, this is supposed to be used to process streams of input.
I observed quite nice optimization on my platform with that
trick.
> > +weak_alias(__mbrtoc8, mbrtoc8);
>
> I think we already discussed this in another place, but in case I'm
> misremembering, these aliases aren't needed unless you need a
> namespace-safe way to call the function internally.
Yes, this is probably a left-over, sorry.
Thanks
Jₑₙₛ
--
:: ICube :::::::::::::::::::::::::::::: deputy director ::
:: Université de Strasbourg :::::::::::::::::::::: ICPS ::
:: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus ::
:: :::::::::::::::::::::::::::::::::::: ☎ +33 368854536 ::
:: https://icube-icps.unistra.fr/index.php/Jens_Gustedt ::
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.