Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFrh3J-tsT=uftRao_0CjKg=7LhCjrxanEDBvB6LEPkK+bW8vA@mail.gmail.com>
Date: Thu, 17 Feb 2022 08:17:36 -0500
From: Satadru Pramanik <satadru@...il.com>
To: Rich Felker <dalias@...ifal.cx>
Cc: musl@...ts.openwall.com
Subject: Re: Re: musl getaddr info breakage on older kernels

Looks like reverting that commit works, but interestingly, only
stochastically.

chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@...alhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46

On Wed, Feb 16, 2022 at 11:14 PM Satadru Pramanik <satadru@...il.com> wrote:

> Oops. Looks like I reversed the patch wrong. Rebuilding again...
>
> On Wed, Feb 16, 2022 at 8:44 PM Satadru Pramanik <satadru@...il.com>
> wrote:
>
>> Looks like I need to do more than just reverse that commit to get this to
>> build...
>>
>> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
>> ‘__socketcall_cp’    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>>                                                             |
>>                                        ^~~~~~~~~~~~~~~
>>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
>> ‘socketcall_cp’
>>     6 |         return socketcall_cp(accept, fd, addr, len, 0, 0, 0);
>>       |                ^~~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:62:54: note: each undeclared
>> identifier is reported only once for each function it appears in
>>                                        62 | #define
>> __socketcall_cp(nm,a,b,c,d,e,f) __syscall_cp(SYS_##nm, a, b, c, d, e, f)
>>   |                                                      ^~~~
>>          ../src_musl/src/internal/syscall.h:55:53: note: in definition of
>> macro ‘__syscall_cp6’     55 | #define __syscall_cp6(n,a,b,c,d,e,f)
>> (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f))
>>                                                               |
>>                                          ^
>>  ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro
>> ‘__SYSCALL_DISP’     57 | #define __syscall_cp(...)
>> __SYSCALL_DISP(__syscall_cp,__VA_ARGS__)                    |
>>              ^~~~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro
>> ‘__syscall_cp’       62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
>>              ^~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
>> ‘__socketcall_cp’
>>    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>>                                                             |
>>                                        ^~~~~~~~~~~~~~~
>>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
>> ‘socketcall_cp’          6 |         return socketcall_cp(accept, fd, addr,
>> len, 0, 0, 0);                         |                ^~~~~~~~~~~~~
>> ../src_musl/src/network/accept.c:7:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>                                    7 | }
>>       | ^
>>                make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error
>> 1
>> make[2]: Leaving directory
>> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl'
>>                                                 make[1]: *** [Makefile:249:
>> obj_musl/.lc_built] Error 2                                 make[1]:
>> Leaving directory
>> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl'
>>                                                          make: ***
>> [Makefile:194: all] Error 2
>>
>> On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik <satadru@...il.com> wrote:
>>
>>> I was looking at that commit too. I've started a build with that
>>> reverted and should be able to check back on that tomorrow.
>>>
>>> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@...ifal.cx> wrote:
>>>
>>>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
>>>> > The only change to socket.c I'm seeing is use __socketcall to simplify
>>>> > socket()
>>>> > <
>>>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
>>>> >,
>>>> > so maybe it would make sense for me to try building with that
>>>> reversed?
>>>>
>>>> That should not be a functional change, but you may be overlooking
>>>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
>>>> new (added in 4.3) individual socket syscalls instead of the legacy
>>>> multiplexed SYS_socketcall. It's supposed to fall back to using the
>>>> old ones, but perhaps something goes wrong on your kernel that's
>>>> preventing it. I'm not sure what the mechanism by which it works when
>>>> straced/single-stepped could be, though, but if it's a weird kernel
>>>> bug anything is possible.
>>>>
>>>> Reverting that commit should be entirely safe, if it turns out to be
>>>> what's triggering your problem, but I'd like to get to the root cause
>>>> and see if there's anything we can do to ensure this doesn't come up
>>>> again.
>>>>
>>>>
>>>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@...il.com>
>>>> wrote:
>>>> >
>>>> > >
>>>> > >>
>>>> > >> - Whether any network traffic occurs when it fails (in the real
>>>> > >>   environment not a replicated one elsewhere).
>>>> > >>
>>>> > >>
>>>> > > There is no network traffic in the real environment.
>>>> > >
>>>> > >
>>>> > >> - Whether it fails or succeeds under strace (in the real
>>>> > >>   environment not a replicated one elsewhere).
>>>> > >>
>>>> > >> It succeeds in strace (in the real environment)
>>>> > >
>>>> > >
>>>> > >
>>>> > >> - Whether the real environment involves Docker or not.
>>>> > >>
>>>> > >> The real environment does not involve docker.
>>>> > >
>>>> > >
>>>> > >
>>>> > >> - What's in resolv.conf (in the real environment not a replicated
>>>> one
>>>> > >>   elsewhere) and what nameserver software (if known) is running on
>>>> the
>>>> > >>   nameserver(s) listed in there.
>>>> > >>
>>>> > >> The nameserver is picked up from dhcp. The contents of the file
>>>> are as
>>>> > > follows:
>>>> > > nameserver 192.168.0.1
>>>> > > search lan.
>>>> > > options single-request timeout:1 attempts:5
>>>> > >
>>>> > >
>>>> > >> - Anything else that might be relevant.
>>>> > >>
>>>> > >> DNS server is dnsmasq running on a current OpenWRT device.
>>>> > >
>>>> > >
>>>> > >> It's really hard to offer any productive advice when the problem is
>>>> > >> unclear.
>>>> > >>
>>>> > >> Apologies for the confusion.
>>>> > > I'm really just trying to debug this getaddrinfo breakage on this
>>>> older
>>>> > > hardware. The docker containers setup is something we use to build
>>>> packages
>>>> > > for this hardware, and our frustration is that the software works
>>>> perfectly
>>>> > > fine in the docker containers, but not on the hardware.
>>>> > >
>>>> > > > Any other suggestions on how to track down this issue?
>>>> > >>
>>>> > >> Rather than stepping through, I would put a single breakpoint at a
>>>> > >> place you want to see whether execution reaches before running the
>>>> > >> test program, then start it and see if the breakpoint fires or not.
>>>> > >> Then remove the breakpoint, add a different one, and repeat. For
>>>> > >> example, see if __res_msend is ever called, and if so, whether
>>>> > >> particular lines of it are reached (or just put breakpoints on
>>>> some of
>>>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to
>>>> see
>>>> > >> if they're called).
>>>> > >>
>>>> > >> It might also be useful to put a breakpoint on clock_gettime and
>>>> then
>>>> > >> 'finish' to see what it returns (in case the problem is something
>>>> > >> time64-related).
>>>> > >>
>>>> > >>
>>>> > > The only breakpoint which fixed the execution was for line 20 (which
>>>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and
>>>> then
>>>> > > continuing is the only way it does not result in failure.
>>>> > >
>>>> > > Any later breakpoints fail.
>>>> > >
>>>> > > I went though the other breakpoints as requested.
>>>> > > clock_gettime did not fire.
>>>> > >
>>>> > > Breakpoint 1 at 0x5c2f7: file
>>>> ../src_musl/compat/time32/clock_gettime32.c,
>>>> > > line 9.
>>>> > > __res_msend, setsockopt also did not fire.
>>>> > > The ones that did fire were: socket, bind, recvfrom, poll,
>>>> __res_msend_rc,
>>>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
>>>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
>>>> > >
>>>> > > When breaking on socket, stepping through the __kernel_vsyscall
>>>> call after
>>>> > > socket and then continuing succeeds.
>>>> > >
>>>> > > Is it possible that the socket is not waiting long enough for a
>>>> response
>>>> > > from __kernel_vsyscall? Has that changed?
>>>> > > Breaking, stepping, and continuing on every other function above
>>>> fails.
>>>> > >
>>>> > > The gdb log is attached.
>>>> > >
>>>> > > Regards,
>>>> > >
>>>> > > Satadru
>>>> > >
>>>> > >
>>>>
>>>

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.