musl - Re: [PATCH v2] IDNA support in name lookups

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170423150747.GN17319@brightrain.aerifal.cx>
Date: Sun, 23 Apr 2017 11:07:47 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH v2] IDNA support in name lookups

On Sun, Apr 23, 2017 at 10:14:24AM +0200, Joakim Sindholt wrote:
> > > There might still be bugs and character ranges that need to be rejected.
> > 
> > As far as I can tell, no normalization is done. This might be
> > problematic for strings where the natural way users would type it does
> > not match the normalized form required in IDN's, but it would also be
> > expensive to handle. I think it's okay to punt on this until it proves
> > to actually be a problem.
> 
> It does ASCII normalization (tolower) but not unicode nomalization. I
> did a quick search for it and came up with RFC5892[1], which specifies a
> rather intricate set of characters to be allowed.
> Maybe I'll go through it one day but it will most certainly be in
> another patch.

AFAIK, the allowed set is not relevant for us to check; it's up to
domain registrars to enforce it, and there's no need to enforce it on
subdomains if the owner of the higher-level domain doesn't care to.

On the other hand, normalization is about transformation to NFKC (as
well as case folding, IIRC), and is relevant if users might enter
domain names in a decomposed form, or (and this is nasty) for content
that has no composed form, in a form where the order of combining
characters is different that what NF[K]C specifies due to bugs in
Unicode's assignments of combining classes. For example if the normal
way to write X composed of base B and combining C1 and C2 is B C1 C2,
but combining classes are wrongly assigned such that NFC is B C2 C1,
the name entered in the way it's normally written will fail to
resolve.

I'm not sure if there are real-world cases where this happens (yet) so
we can probably just wait and see if it does...

> > > @@ -61,12 +230,25 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> > >  		return EAI_SYSTEM;
> > >  	}
> > >  	while (fgets(line, sizeof line, f) && cnt < MAXADDRS) {
> > > -		char *p, *z;
> > > +		char idna[256];
> > > +		ssize_t r;
> > > +		char *p, *z, c;
> > >  
> > >  		if ((p=strchr(line, '#'))) *p++='\n', *p=0;
> > > -		for(p=line+1; (p=strstr(p, name)) &&
> > > -			(!isspace(p[-1]) || !isspace(p[l])); p++);
> > > -		if (!p) continue;
> > > +		/* skip ip address and canonicalize names */
> > > +		for (p=line; *p && !isspace(*p); p++);
> > > +		while (*p) {
> > > +			for (; *p && isspace(*p); p++);
> > > +			for (z=p; *z && !isspace(*z); z++);
> > > +			c = *z;
> > > +			*z = 0;
> > > +			r = idnaenc(idna, p);
> > > +			*z = c;
> > > +			if (r == l && memcmp(idna, name, l) == 0)
> > > +				break;
> > > +			p = z;
> > > +		}
> > > +		if (!*p) continue;
> > >  
> > >  		/* Isolate IP address to parse */
> > >  		for (p=line; *p && !isspace(*p); p++);
> > > @@ -86,7 +268,7 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> > >  		for (; *p && isspace(*p); p++);
> > >  		for (z=p; *z && !isspace(*z); z++);
> > >  		*z = 0;
> > > -		if (is_valid_hostname(p)) memcpy(canon, p, z-p+1);
> > > +		if ((r = idnaenc(idna, p)) > 0) memcpy(canon, idna, r);
> > >  	}
> > >  	__fclose_ca(f);
> > >  	return cnt ? cnt : badfam;
> > > @@ -285,15 +467,17 @@ static int addrcmp(const void *_a, const void *_b)
> > 
> > Is there any reason this needs to be done, or should be done, for
> > lookups from the hosts file? IDN/punycode is a hack for transporting
> > unicode names on top of DNS protocol. For hosts file you can just put
> > the proper unicode strings directly in the file.
> 
> My logic was that some people might have/want to have punycode in their
> hosts file, and some might even (accidentally or otherwise) have mixed
> punycode-unicode names written down. In any case I wanted it to Just
> Work™ so decoding the host from punycode before comparing seemed to be
> the easiest way to ensure it catches everything.
> This was prompted by a paper wedding invitaion I received where the
> couple had listed their gift registry in punycode form. This says to me
> that people just dont know or care about this, and since the hosts file
> is used extensively by non-developers as well I would personally prefer
> if this worked regardless of what deranged things people might put in
> there.
> In fact I could imagine a lot of people shoving the punycode form in
> there under the assumption that that would work better.
> Also, suppose you have callers from all over the system dialing out to a
> server but some of them call xn--foo-bar.com and others dial the unicode
> version. Is it really reasonable that you should need to list this
> domain twice in the hosts file for it to work?

If you want the punycode to resolve to the unicode name in hosts file,
I would do the opposite: decode it to proper utf-8 and match that.
Nobody should have to deal with wacky punycode encodings to make
non-latin hostnames work. They should just work transparently and
punycode should be an implementation detail inside layers that
users/programmers don't see.

If I'm not mistaken, your patch as-is actually _breaks_ support for
literal utf-8 hostnames in the hosts file.

> > >  int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags)
> > >  {
> > > +	char _name[256];
> > >  	int cnt = 0, i, j;
> > >  
> > >  	*canon = 0;
> > >  	if (name) {
> > > -		/* reject empty name and check len so it fits into temp bufs */
> > > -		size_t l = strnlen(name, 255);
> > > -		if (l-1 >= 254)
> > > +		/* convert unicode name to RFC3492 punycode */
> > > +		ssize_t l;
> > > +		if ((l = idnaenc(_name, name)) <= 0)
> > >  			return EAI_NONAME;
> > > -		memcpy(canon, name, l+1);
> > > +		memcpy(canon, _name, l+1);
> > > +		name = _name;
> > >  	}
> > 
> > If it's not needed for hosts backend, this code probably belongs
> > localized to the dns lookup, rather than at the top of __lookup_name.
> > 
> > BTW there's perhaps also a need for the opposite-direction
> > translation, both for ai_canonname (when a CNAME points to IDN) and
> > for getnameinfo reverse lookups. But that can be added as a second
> > patch I think.
> 
> I have already written the code for decoding as well if need be :)

Yay!

> The only problem as I see it is that a unicode name can be a hair under
> 4 times larger (in bytes) than the punycode equivalent. Select any 4
> byte UTF-8 character and make labels exclusively containing that. All
> subsequent characters to the first will be encoded as an 'a'.
> 
> This, by the way, also means that we should probably mess with the
> buffering when reading the hosts file.

You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this
is ABI, I don't think we can return names longer than 255 bytes. Such
names probably just need to be left as punycode when coming from dns,
and not supported in hosts files.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.