musl - Re: validation of utf-8 strings passed as system call arguments

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131213064923.GC24286@brightrain.aerifal.cx>
Date: Fri, 13 Dec 2013 01:49:23 -0500
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: validation of utf-8 strings passed as system call
 arguments

On Fri, Dec 13, 2013 at 07:36:51AM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@...ifal.cx> [2013-12-12 23:39:41 -0500]:
> > that filenames can contain arbitrary byte sequences. And Linus in
> > particular is opposed to changing this, though there's been some
> > indicastion (I don't have references right off) that he might be open
> > to optional restrictions at the kernel level.
> 
> he didnt look very persuadable some time ago
> http://yarchive.net/comp/linux/utf8.html

Yes, that was a long time ago though. I forget where I saw an
indication that this could change (perhaps the Austin Group list? in
the thread about newlines...) but the general idea, if I recall, was
that restrictions would take place in the framework of a generic layer
for restricting malicious content in filenames that's not UTF-8
specific.

> (i actually like the kernel that way: what would you do when
> mounting a filesystem with invalid filenames? would you also
> reject surrogate pairs, pua codes or do unicode normalization?)

"Surrogate pairs" aren't even a question; surrogates aren't encodable
at all in UTF-8. So they would automatically be gone just by mandating
well-formed UTF-8.

Normalization (which Apple does) is absolutely wrong and
non-conforming to POSIX; it causes multiple distinct names to refer to
the same file (despite having a link count of 1, BTW), which is just
as dangerous as issues like "over-long sequence" decoding and
URL-escaped dots and slashes. The only "correct" way to do
normalization at the FS level is disallowing non-normalized filenames.
But normalization is actually just broken and harmful anyway, since
there are languages for which bugs in Unicode have made the normalized
form contrary to the actual semantic ordering of characters in the
language (characters were incorrectly assigned combining classes such
that letters reorder contrary to their actual semantic order, and due
to stability policy this can't be fixed, so the only solution is to
forget about using normalization).

As for PUA, it wouldn't be forbidden by enforcing UTF-8. Per the
definition, a "UTF" is a bijective mapping between the Unicode scalar
values (0 through 0xD7FF and 0xE000 through 0x10FFFF) and legal
sequences of code units. Whether a character identity is assigned to a
scalar value is irrelevant to UTFs.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.