|
Message-ID: <20220214182952.GI7074@brightrain.aerifal.cx> Date: Mon, 14 Feb 2022 13:29:52 -0500 From: Rich Felker <dalias@...ifal.cx> To: Satadru Pramanik <satadru@...il.com> Cc: musl@...ts.openwall.com Subject: Re: Re: musl getaddr info breakage on older kernels On Mon, Feb 14, 2022 at 12:24:30PM -0500, Satadru Pramanik wrote: > After multiple rebuilds of various versions, I'm hitting a wall. There is > no traffic when built against newer versions of musl, except when run under > strace. Strace (or ltrace) fixes the problem every single time. The program > also doesn't create any network traffic when run in gdb either. > > (Apparently, I'm not the first to discover the "medicinal effects" of > strace though, as per this abomination: > https://github.com/strace/strace/issues/14 ) > > I've even tried building and running this in a docker container inside a > VirtualBox VM running CentOS 7 so as to get the 3.10 kernel involved, and > that works too! > > [image: Screenshot 2022-02-14 at 12.22.33.png] > The screenshot is from a run on the actual hardware, in a crosh window. It > is not in a VM. > > Is there anything else I should try? Are you sure the "running under strace" has no differences in how the test program is invoked aside from just using strace? Rather than running it via the ruby machinery, can you just test under plain manual execution from the same shell instance for both? Can you report what Docker version you're using, and try executing with Docker's seccomp sandboxing disabled? This shouldn't happen, but it's plausible that your old kernel has bugs where seccomp filtering gets bypassed when the process is running under strace, thereby working around a buggy seccomp filter in Docker. If you know how to use gdb, you could also try setting some breakpoints to see what code is or isn't reached. > On Mon, Feb 7, 2022 at 4:02 PM Rich Felker <dalias@...ifal.cx> wrote: > > > On Mon, Feb 07, 2022 at 02:19:05PM -0500, Satadru Pramanik wrote: > > > The test programs are being run from... > > > glibc 2.23 -> bash (crosh shell) > > > crosh shell -> invokes ruby -> invokes bash to run the test programs. > > > > > > tcpdump on the router shows no network activity at all when running > > > the test program with tcpdump -i any -vvv host (IP address) > > > > There's reliably no network traffic when you run the test program not > > under strace? Is there any difference in how you're invoking it other > > than strace not being there? I'm running out of possible explanations > > unless there's some hidden details we don't know about. > > > > > When I run the test pogram with strace though I see this: > > > 14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > UDP > > > (17), length 56) > > > 192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A? > > google.com. > > > (28) > > > 14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF], > > proto > > > UDP (17), length 72) > > > office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 -> > > 0x7bc1!] > > > 16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44) > > > 14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > UDP > > > (17), length 56) > > > 192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A? > > google.com. > > > (28) > > > 14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF], > > proto > > > UDP (17), length 72) > > > office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 -> > > 0x4209!] > > > 35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44) > > > 14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > UDP > > > (17), length 56) > > > 192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA? > > > google.com. (28) > > > 14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF], > > proto > > > UDP (17), length 84) > > > office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c -> > > 0xc77e!] > > > 13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA > > > 2607:f8b0:4006:80b::200e (56) > > > > > > On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@...ifal.cx> wrote: > > > > > > > On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote: > > > > > Here are illustrative logs of output and strace logs. > > > > > > > > > > Note that while the musl toolchain is built in a container on a much > > more > > > > > powerful machine, this "musl_getaddrinfo_test" app is built locally > > on > > > > the > > > > > machine with the 3.8 kernel. > > > > > > > > > > I ran the following to get the output on the smaller i686 machine > > > > > immediately after the app is built. > > > > > Apologies for the ruby code wrapping the shell commands. > > > > > > > > > > @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null | > > head > > > > -2 > > > > > | tail -1 | awk '{print $2}'`.chomp > > > > > puts 'Testing the musl resolver to see if it can resolve > > google.com: > > > > > '.lightblue > > > > > system "./musl_getaddrinfo_test google.com set_ai_family 2>&1 > > |tee > > > > -a > > > > > /tmp/musl_#{@...l_ver}_getaddrinfo_test_google.com_set_ai_family.txt > > " > > > > > system "./musl_getaddrinfo_test google.com 2>&1 |tee -a > > > > > /tmp/musl_#{@...l_ver}_getaddrinfo_test_google.com.txt" > > > > > system "strace -o > > > > > > > > > > > /tmp/musl_#{@...l_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt > > > > > ../musl_getaddrinfo_test google.com set_ai_family" > > > > > system "strace -o > > > > > /tmp/musl_#{@...l_ver}_getaddrinfo_test_google.com_STRACE.txt > > > > > ../musl_getaddrinfo_test google.com" > > > > > > > > > > And here is the output for each run before running again via strace. > > Note > > > > > how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all > > > > shows > > > > > up, but everything works fine according to the strace logs. (Strace > > is > > > > > built against glibc 2.23.) > > > > > > > > > > ==> > > > > > > > musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt > > > > > <== > > > > > AF_INET: 142.251.40.110 > > > > > > > > > > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <== > > > > > AF_INET: 142.251.40.110 > > > > > > > > > > ==> > > > > > > > musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt > > > > > <== > > > > > AF_INET: 142.251.40.142 > > > > > > > > > > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <== > > > > > getaddrinfo: Try again > > > > > > > > > > ==> > > > > > > > musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt > > > > > <== > > > > > AF_INET: 142.251.40.206 > > > > > > > > > > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <== > > > > > AF_INET6: 2607:f8b0:4006:81f::200e > > > > > AF_INET: 142.251.40.206 > > > > > > > > > > ==> > > > > > > > musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt > > > > <== > > > > > AF_INET: 142.250.65.206 > > > > > > > > > > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <== > > > > > AF_INET: 142.250.65.206 > > > > > > > > > > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <== > > > > > AF_INET: 142.251.40.110 > > > > > > > > > > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <== > > > > > getaddrinfo: Try again > > > > > > > > > > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <== > > > > > getaddrinfo: Try again > > > > > > > > > > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <== > > > > > getaddrinfo: Try again > > > > > > > > > > Regards, > > > > > > > > OK, I don't see anything in the strace suggesting a cause. The kernel > > > > version (or whether a container was used) present on the system where > > > > you built musl or the test programs should make no difference > > > > whatsoever; musl has no build dependencies on the host kernel or > > > > kernel headers or anything like that (and doesn't even need to be > > > > built on a Linux host). > > > > > > > > A couple questions: > > > > > > > > Are the test programs on the i686 machine running under Docker or any > > > > other container environment? > > > > > > > > Can you tcpdump the traffic between the test program and the dnsmasq > > > > during a failing run, with verbose display of the packet contents > > > > (-vvv or something like that)? > > > > > > > > I don't see any plausible explanation for the result varying between > > > > runs and with timing like this unless dnsmasq is doing something > > > > odd/wrong. I thought it might be related to something blocking time64 > > > > syscalls but that doesn't seem to be the case -- according to the > > > > strace logs they're getting ENOSYS as expected with fallback to the > > > > legacy 32-bit clock_gettime etc. which is fine. > > > > > >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.