|
Message-ID: <CAFrh3J_7NP0z2eyusSOmMnKc-5o-PjXG-E3OmncUUYB951Fvfg@mail.gmail.com>
Date: Wed, 16 Feb 2022 16:53:31 -0500
From: Satadru Pramanik <satadru@...il.com>
To: Rich Felker <dalias@...ifal.cx>
Cc: musl@...ts.openwall.com
Subject: Re: Re: musl getaddr info breakage on older kernels
I was looking at that commit too. I've started a build with that reverted
and should be able to check back on that tomorrow.
On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@...ifal.cx> wrote:
> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
> > The only change to socket.c I'm seeing is use __socketcall to simplify
> > socket()
> > <
> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
> >,
> > so maybe it would make sense for me to try building with that reversed?
>
> That should not be a functional change, but you may be overlooking
> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
> new (added in 4.3) individual socket syscalls instead of the legacy
> multiplexed SYS_socketcall. It's supposed to fall back to using the
> old ones, but perhaps something goes wrong on your kernel that's
> preventing it. I'm not sure what the mechanism by which it works when
> straced/single-stepped could be, though, but if it's a weird kernel
> bug anything is possible.
>
> Reverting that commit should be entirely safe, if it turns out to be
> what's triggering your problem, but I'd like to get to the root cause
> and see if there's anything we can do to ensure this doesn't come up
> again.
>
>
> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@...il.com>
> wrote:
> >
> > >
> > >>
> > >> - Whether any network traffic occurs when it fails (in the real
> > >> environment not a replicated one elsewhere).
> > >>
> > >>
> > > There is no network traffic in the real environment.
> > >
> > >
> > >> - Whether it fails or succeeds under strace (in the real
> > >> environment not a replicated one elsewhere).
> > >>
> > >> It succeeds in strace (in the real environment)
> > >
> > >
> > >
> > >> - Whether the real environment involves Docker or not.
> > >>
> > >> The real environment does not involve docker.
> > >
> > >
> > >
> > >> - What's in resolv.conf (in the real environment not a replicated one
> > >> elsewhere) and what nameserver software (if known) is running on the
> > >> nameserver(s) listed in there.
> > >>
> > >> The nameserver is picked up from dhcp. The contents of the file are as
> > > follows:
> > > nameserver 192.168.0.1
> > > search lan.
> > > options single-request timeout:1 attempts:5
> > >
> > >
> > >> - Anything else that might be relevant.
> > >>
> > >> DNS server is dnsmasq running on a current OpenWRT device.
> > >
> > >
> > >> It's really hard to offer any productive advice when the problem is
> > >> unclear.
> > >>
> > >> Apologies for the confusion.
> > > I'm really just trying to debug this getaddrinfo breakage on this older
> > > hardware. The docker containers setup is something we use to build
> packages
> > > for this hardware, and our frustration is that the software works
> perfectly
> > > fine in the docker containers, but not on the hardware.
> > >
> > > > Any other suggestions on how to track down this issue?
> > >>
> > >> Rather than stepping through, I would put a single breakpoint at a
> > >> place you want to see whether execution reaches before running the
> > >> test program, then start it and see if the breakpoint fires or not.
> > >> Then remove the breakpoint, add a different one, and repeat. For
> > >> example, see if __res_msend is ever called, and if so, whether
> > >> particular lines of it are reached (or just put breakpoints on some of
> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to see
> > >> if they're called).
> > >>
> > >> It might also be useful to put a breakpoint on clock_gettime and then
> > >> 'finish' to see what it returns (in case the problem is something
> > >> time64-related).
> > >>
> > >>
> > > The only breakpoint which fixed the execution was for line 20 (which
> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
> > > continuing is the only way it does not result in failure.
> > >
> > > Any later breakpoints fail.
> > >
> > > I went though the other breakpoints as requested.
> > > clock_gettime did not fire.
> > >
> > > Breakpoint 1 at 0x5c2f7: file
> ../src_musl/compat/time32/clock_gettime32.c,
> > > line 9.
> > > __res_msend, setsockopt also did not fire.
> > > The ones that did fire were: socket, bind, recvfrom, poll,
> __res_msend_rc,
> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
> > >
> > > When breaking on socket, stepping through the __kernel_vsyscall call
> after
> > > socket and then continuing succeeds.
> > >
> > > Is it possible that the socket is not waiting long enough for a
> response
> > > from __kernel_vsyscall? Has that changed?
> > > Breaking, stepping, and continuing on every other function above fails.
> > >
> > > The gdb log is attached.
> > >
> > > Regards,
> > >
> > > Satadru
> > >
> > >
>
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.