Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140722184932.GA4914@brightrain.aerifal.cx>
Date: Tue, 22 Jul 2014 14:49:32 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Locale bikeshed time

I've got the next phase of the locale work pretty much ready to
commit, but since it needs some policy for how to load locales, I want
to continue the discussion first rather than having commits that
change the behavior back and forth as we discuss this.

Overall, my plan at this point is to disallow any absolute/relative
pathnames in the LC_* vars and restrict them purely to locale names,
and have the path in a separate variable outside the scope of the
standard. This is basically how glibc does it, and the idea is that
you can allow locale names from an untrusted source (e.g. for suid,
for remote apps acting on behalf of a user such as web apps or
gitolite, or for apps that process mixed-locale data with uselocale
and have locale names in their data) as long as the locale path does
not contain malicious locales.

So, the first bikeshed decision to be made is what environment
variable to use for the locale path, and what fallback should be if
it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to
use the same var (since apps are already aware of the need to treat it
specially), but on the other it's undesirable to have them tied
together (e.g. if you're using musl as a non-root installation and
can't write to /usr/lib) and to avoid clashing with glibc's files we
would need to choose a subdirectory under $LOCPATH rather than using
it directly. All of these aspects make it a lot less attractive.

The second issue is how locale categories are split up. Glibc has each
category in a separate file, except for the "locale-archive" file
which stores everything in one file for easy mapping. My leaning so
far is to put the whole locale -- time format and translations,
message translations, ... in a single file. This avoids the need for
multiple mappings (and syscall overhead, and vma overhead, ...) if
you're using the same value for all categories. But on the other hand,
if you wanted to have lots of subtle variants of a locale, you might
end up with largely-duplicate files on disk. Fortunately I think
they'll all be very small anyway so this may not matter.

Of course making this work is contingent on finding a good way to
encode LC_MONETARY and LC_COLLATE data in a .mo file, since if the
whole locale is unified into one file, it would be a .mo file. My
leaning is to simply use "int_cur_symbol", etc. as gettext keys for
the string fields of LC_MONETARY and then put all the numeric fields
of lconv into a single string that could be parsed with scanf or a
tiny integer parser in localeconv() on the first usage. While not the
most efficient, it avoids needing nasty special tools to generate
locale files; a po-to-mo converter is all you need. For LC_COLLATE,
obviously one solution would be to have keys for each collation
element and use gettext to convert collation elements to the symbols
strxfrm is supposed to output. I'm not sure if the efficiency of this
method is tolerable however. We could go with it for now and later add
something more advanced if needed (e.g. mapping to a DFA represented
as a byte arrary that does the conversions).

I probably have some more issues to discuss with this too but I'll
just go ahead and send now to get discussion started, and hopefully
get back to adding some more code first.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.