|
Message-ID: <20170423165614.GP17319@brightrain.aerifal.cx> Date: Sun, 23 Apr 2017 12:56:14 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: [PATCH v2] IDNA support in name lookups On Sun, Apr 23, 2017 at 06:38:24PM +0200, Joakim Sindholt wrote: > > If I'm not mistaken, your patch as-is actually _breaks_ support for > > literal utf-8 hostnames in the hosts file. > > No, it supports all combinations. It encodes every hostname found to the > punycode equivalent and compares only punycode. The encoding is done on > a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated > entirely separately. > Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach > would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their > hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk. OK, that makes more sense and it's non-harmful. > There are a couple of reasons that I chose to do it this way, with the > primary being that it meant I only had to include the encoding function > and the secondary being that the encoded name will always fit in 254 > characters, whereas the decoded equivalent might be larger (see below). > > I don't have any ideological opposition to doing the comparison entirely > in unicode though. This was just the more practical solution for the > time being. It's probably okay the way you're doing it, then. I failed to understand that it accepts either. Main other consideration might be whether it's noticably slow on large hosts files, but I don't think the existing code was terribly fast anyway. > > > > > int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags) > > > > > { > > > > > + char _name[256]; > > > > > int cnt = 0, i, j; > > > > > > > > > > *canon = 0; > > > > > if (name) { > > > > > - /* reject empty name and check len so it fits into temp bufs */ > > > > > - size_t l = strnlen(name, 255); > > > > > - if (l-1 >= 254) > > > > > + /* convert unicode name to RFC3492 punycode */ > > > > > + ssize_t l; > > > > > + if ((l = idnaenc(_name, name)) <= 0) > > > > > return EAI_NONAME; > > > > > - memcpy(canon, name, l+1); > > > > > + memcpy(canon, _name, l+1); > > > > > + name = _name; > > > > > } > > > > > > > > If it's not needed for hosts backend, this code probably belongs > > > > localized to the dns lookup, rather than at the top of __lookup_name. > > > > > > > > BTW there's perhaps also a need for the opposite-direction > > > > translation, both for ai_canonname (when a CNAME points to IDN) and > > > > for getnameinfo reverse lookups. But that can be added as a second > > > > patch I think. > > > > > > I have already written the code for decoding as well if need be :) > > > > Yay! > > > > > The only problem as I see it is that a unicode name can be a hair under > > > 4 times larger (in bytes) than the punycode equivalent. Select any 4 > > > byte UTF-8 character and make labels exclusively containing that. All > > > subsequent characters to the first will be encoded as an 'a'. > > > > > > This, by the way, also means that we should probably mess with the > > > buffering when reading the hosts file. > > > > You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this > > is ABI, I don't think we can return names longer than 255 bytes. Such > > names probably just need to be left as punycode when coming from dns, > > and not supported in hosts files. > > I'm not sure that's the correct interpretation. For example when doing > HTTP requests the browser will fill in the HTTP Host header with the > punycode name. I think the intent is to do everything with the punycoded > name and ditch utf8 permanently :( That's ugly and backwards but fits with the header encoding being ASCII. However it's not universal. The most important other place domain names are used, in certificates, they're proper unicode; punycode is not used there. > What I meant here though was when reading from /etc/hosts the line > buffer used is 512 bytes which isn't necessarily enough for one full > domain in utf8. But it is enough assuming we honor HOST_NAME_MAX. > Take any cuneiform character of your choice, repeat it > 56 times and you get a label of 63 chars that looks something like > xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa > but has a utf8 size of 224 bytes. You can have almost 4 of those in an > otherwise completely valid FQDN. The last one would have to lose 2 'a's > (8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total > of 224+1+224+1+224+1+216+1=892 bytes. > > Now I will freely admit that I don't know for sure if this is legal but > I believe that it is, and that the maximum length of 254 applies solely > to the encoded name. > I can at least say that dig will happily accept that massive domain > name and complain on a domain with one more cuneiform in the last label. > It will also complain if any one label exceeds 63 bytes post-encoding. > > For the record, glibc has #define NI_MAXHOST 1025, if it matters at all. Yeah, so far I've considered that a mistake. For non-DNS names it's rather up to the implementation what they support, but I don't think there's significant merit in supporting gratuitously/maliciously long hostnames. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.