|
Message-ID: <20230313163558.GP4163@brightrain.aerifal.cx> Date: Mon, 13 Mar 2023 12:35:58 -0400 From: Rich Felker <dalias@...c.org> To: Briner Cédric (DI) <Cedric.Briner@...t.ge.ch> Cc: "musl@...ts.openwall.com" <musl@...ts.openwall.com> Subject: Re: resolv.conf and ndots:5 On Mon, Mar 13, 2023 at 03:30:44PM +0000, Briner Cédric (DI) wrote: > Hi, > > We do run kubernetes. And some of the applications are shipped with > the musl libc library brought by alpine images. We have discovered a > bad behaviour on how the system "nss" is interpreting our > /etc/resolv.conf I followed up as you were leaving IRC: <dalias> cedric-briner, ok let me go thru what seems to be happening... <dalias> because of ndots:5, ge.ch is first looked up as ge.ch.ceti.etat-ge.ch, yes? <dalias> i just performed those queries and it should work fine, provided your nameserver is giving truthful answers <dalias> if you're in a kubernetes environment, maybe there's some (coredns or otherwise?) nameserver in between that's messing things up and replying NODATA in place of NxDomain <dalias> i.e. saying "ge.ch.ceti.etat-ge.ch exists but doesn't have an address" instead of "ge.ch.ceti.etat-ge.ch does not exist" <dalias> that would lead to the behavior you're seeing And indeed based on your strace that's what's happening: > cat /etc/resolv.conf > # search app-5580-capitastra-rec-01.svc.cluster.local svc.cluster.local cluster.local ceti.etat-ge.ch > # nameserver 10.177.0.10 > # options ndots:5 > > For checking the behaviour of libc, we have used the command "getent hosts".. Below are the command that we have tested and the result o fit > > getent hosts ge.ch > # <no response> > > getent hosts ge.ch. > # 160.53.144.68 ge.ch ge.ch ge.ch. > > As you can see, the fully qualified ge.ch. (with the leading dot) is resolving. And the ge.ch (without the leading dot) is not resolving. > > We have done some test for the following domain to better test and you can see on the right if is succeed or not > - ge.ch. sucesss > - ge.ch failed > - etat-ge.ch. success > - etat-ge.ch failed > - ge.fr success > - tsr.ch success > > We have looked at the calls with the help of strace for this different command. The ones that failed has such pattern on the strace > - try to solve ge.ch. app-5580-capitastra-rec-01.svc.cluster.local > - try to solve ge.ch.svc.cluster.local > - try to solve ge.ch.cluster.local > - try to solve ge.ch.ceti.etat-ge.ch > - Do not try to solve ge.ch > > And the one that works well have the same trace but try to solve the last and they solve it. > - try to solve ge.fr. app-5580-capitastra-rec-01.svc.cluster.local > - try to solve ge.fr.svc.cluster.local > - try to solve ge.fr.cluster.local > - try to solve ge.fr.ceti.etat-ge.ch > - try to solve ge.fr -> success > > A bit puzzled by this, we imagined that the problem came from the > libc and the fact that the last search domain "ceti.etat-ge.ch" does > finish as the domain (ge.ch) we want to resolve. > > That seems a bit strange.. but.. we saw the same when trying to > solve "etat-ge.ch" > > Furthermore, we have discovered that "getent hosts ge.ch" is working > well if we change the line "options ndots:1" in the resolv.conf > > Below the signature are the capture made with strace. The firsts are > the sparse one with the element you should look at, the second one > (full) shows all the call > > Regards, > Cédric BRINER > > > # Strace capture > ## sparse and commented > ### ge.ch > > As said, the ge.ch doesn't resolve, and this is shown in the strace of `getent hosts ge.ch` > > ``` > .. > ge.ch.app-5580-capitastra-rec-01.svc.cluster.local > [pid 1228780] recvfrom(3, "\273\226\205\3\0\1\0\0\0\1\0\1\2ge\2ch\32app-5580-capi"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.177.0.10")}, [16]) = 172 > .. > ge.ch.app-5580-capitastra-rec-01.svc.cluster.local > [pid 1228780] recvfrom(3, "\377\16\205\3\0\1\0\0\0\1\0\1\2ge\2ch\3svc\7cluster\5l"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.177.0.10")}, [16]) = 145 > .. > ge.ch.app-5580-capitastra-rec-01.svc.cluster.local > [pid 1228780] recvfrom(3, "\310\v\205\3\0\1\0\0\0\1\0\1\2ge\2ch\7cluster\5local"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.177.0.10")}, [16]) = 141 > .. > ge.ch.app-5580-capitastra-rec-01.svc.cluster.local > [pid 1228780] sendto(3, "\2567\1\0\0\1\0\0\0\0\0\0\2ge\2ch\4ceti\7etat-ge\2"..., 39, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.177.0.10")}, 16) = 39 > .. > ge.ch.app-5580-capitastra-rec-01.svc.cluster.local > [pid 1228780] recvfrom(3, "\2567\205\200\0\1\0\0\0\1\0\1\2ge\2ch\4ceti\7etat-ge\2"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.177.0.10")}, [16]) = 140 ^^^^^^^^ The low 4 bits of the underlined part of the response packet form RCODE=0, success. This is 10.177.0.10 saying the queried name does exist and just does not have an A record. (It might have AAAA records, MX records, whatever -- we can't know.) Since normal recursive nameservers on the public internet correctly report RCODE=3 (NxDomain) for ge.ch.ceti.etat-ge.ch, something in your k8s cluster must be rewriting the answer to give an incorrect result. glibc (like most traditional stub resolvers) handles this case sloppily and just treats NODATA and NxDomain the same, continuing search. This makes the results potentially unstable depending on whether the caller requested both v4 and v6 results or just one or the other, and semantically mismatches what's in DNS. musl handles this very intentionally with the aim of delivering only consistent results. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.