Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151230155848.GK238@brightrain.aerifal.cx>
Date: Wed, 30 Dec 2015 10:58:48 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Patches: Timezone in %c and POT file

On Wed, Dec 30, 2015 at 11:56:33AM +0100, Markus Wichmann wrote:
> Hi all,
> 
> Now I have subscribed, so CC'ing me is no longer necessary.
> 
> Today I worked on two things: Firstly, I put the timezone into
> strftime's %c output. The reason is that glibc's strftime() does the
> same. That means, that an application dev currently can't depend on
> either behavior (so strftime("%c %Z") will give me the timezone twice on
> glibc, but only once on musl, and my app won't be able to tell without
> inspecting the resulting string).
> 
> No biggie, changing that one is easy. Of course, a heated argument can
> be had over whether or not we want it one way or the other. And it'll
> come down to personal taste, because as far as I'm aware, POSIX isn't
> mandating anything about this.

The format for %c in the C locale is strictly specified by ISO C as
"%a %b %e %T %Y"; see 7.27.3.5 ΒΆ 7. If glibc does not match this it's
a bug in glibc. POSIX is of course aligned with ISO C and says the
same thing. In other locales it's permitted to differ.

> Then I noticed, that for quite some time now, musl has been supporting
> ..mo files, but no infrastructure is in place for them (i.e. no POT file
> nor any PO file is shipped). I tried searching around for POT or PO
> file, but I couldn't find any. So I added a handwritten POT file and a
> German PO file (I'm not proficient enough in any languages besides
> English and German to want to create that file for any other languages.
> And an English PO file would be kind of redundant.)
> 
> I filled the POT file with all the strings I could find, that would ever
> be plugged into __lctrans(). That gives me strerror(), strsignal(),
> gai_strerror(), hstrerror(), and __getopt_msg() strings.

Have you read the thread "Call for locales maintainer & contributors"
from when locale support launched? Here's a link to the start of it:

http://www.openwall.com/lists/musl/2014/07/24/14

It might have some useful ideas. The main one I'd like to point out is
the idea to develop and maintain locales as a separate repo outside of
the source repo. Unlike glibc, we don't have a lot of messages that
should be expected to change frequently, so I think the issues with
keeping sync are minimal, while there are several advantages:

- Not having translation progress stalled-by/tied-to code release
  cycles.

- Saving users who don't want locales from having to download them.

- No need to have locale patches go through me.

> Unfortunately this design is running into some problems: At the moment
> several strings are empty in the C locale (which is fine), but they
> could translate to something else in some other locale (nl_langinfo()'s
> ERA* and THOUSEP come to mind).

Yes. Those are unsupported right now, along with a lot of related
functionality. There's also no way to set the fields of localeconv(),
which come mostly (entirely, I think) from LC_MONETARY and LC_NUMERIC.
Depending on how we end up representing that data in the locale file,
it might make sense to use some sort of preprocessing script to
generate this part of the .po file, but I'd like to have the format
just be simple and natural to do in .po if that doesn't impose heavy
code or runtime overhead.

> Some strings in the C locale are the
> exact same and might translate to something else in some language (the
> long and short forms of "May" for instance). I think glibc solves that
> problem with another file format for libc's locales, which is a headache
> I don't want to think about this year anymore.

"May" is a good example. Yes, I've never much liked the gettext model
of keying by untranslated/English string, but for translation it's the
only one that's translator-friendly, and for musl it was the only
choice that saved us from having to develop a new file format and code
to handle it.

The easiest solution I've come up with is prefixing and doing
something like __lctrans("<prefix>string")+prefixlen, ideally with a
prefixlen of 1, e.g. __lctrans("\5May")+1. This would just add 1 byte
to each string in the built-in C locale data and one inc/add
instruction, not a significant cost. Do you have any other good ideas?

BTW "\5" was for "month 5" but I'm not sure there's any usefulness to
such a convention. All that's really needed is a way to identify it as
the abbreviated name or the full name.

> The second patch might be applicable to musl without the first by
> accident: I left the format for ERA_D_T_FMT untouched, so looking up the
> unchanged D_T_FMT would give that as a result. (Though I did translate
> ERA_D_T_FMT.)
> 
> To make it clear to people who don't like locale: I changed nothing
> about the build system. Locale MOs still aren't built and installed
> automatically. That wouldn't even be possible without mandating a
> default value for MUSL_LOCPATH.
> 
> However, I tried it out (installed it somewhere, put that location into
> MUSL_LOCPATH and called strftime("%c")) and that seems to work. If we
> got this far, then that means the file is found and can be mapped and
> the lookup works. So the only thing that might be broken now is bad
> translations or typos in the PO file. (Well, or anything unforseen, as
> usual.)

Thanks for your work on this! I'm glad to see some interest in making
use of the locale support code. Indeed, I got it to work initially
with some dummy translation data, but didn't go far with it.

> Also, as usual, criticisms and comments are welcome.

One more below in the patch itself:

> [...]
> +
> +msgid "."
> +msgstr ","

musl explicitly does not support changing the radix point; there's an
old thread on this topic I can dig up if you'd like to read it. It
looks to me like nl_langinfo(RADIXCHAR) will return a replacement if
the locale file defines one, but then you get inconsistent results
since it won't be used (e.g. by printf or strtod). Probably
nl_langinfo should avoid passing the "." to __lctrans at all so that
this inconsistency can't arise. This would also allow us to support
"mon_decimal_point" (which would otherwise be a duplicate untranslated
string) if desired, I think.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.