musl - Re: Re: musl getaddr info breakage on older kernels

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220217181705.GT7074@brightrain.aerifal.cx>
Date: Thu, 17 Feb 2022 13:17:08 -0500
From: Rich Felker <dalias@...ifal.cx>
To: Satadru Pramanik <satadru@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Re: musl getaddr info breakage on older kernels

On Thu, Feb 17, 2022 at 11:36:31AM -0500, Satadru Pramanik wrote:
>  This machine is a EOL Samsung Series 5 Chromebook
> <https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/samsung-series-5-chromebook/>
> code
> named Alex
> <https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/#:~:text=Series%205%20Chromebook-,Alex,-x86%2Dalex%20%26%20x86>
> ..
> It is the target device for our i686 builds for Chromebrew.
> 
> It is running a 3.8.11 kernel, and I believe the kernel source for that is
> here:
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-3.8
> 
> Getting a signed kernel update for an EOL kernel for an EOL machine is
> close to impossible from Google, so we're just trying to work around these

If these are machines you're in control of, you may be able to load a
module to patch it. If this is something you're deploying to users
stuck on that kernel who don't want to fix their systems, then of
course that's not a practical option.

> issues in userspace to maintain some functionality for any users who may
> still be using the device.
> 
> The simplest workaround possible would be ideal.

If you're shipping binaries specifically for these devices, the
simplest fix is just to emulate the failure that should happen in the
kernel in userspace, using the attached patch. DO NOT deploy this
patch in binaries meant to be used on modern systems, since they will
break when Y2038 rolls around. (Your old Chromebooks will break then
too.)

> It is interesting though
> that the sample program works fine when built against near-stock glibc
> 2.23, no?

No. If your kernel has a bug that makes something behave wildly wrong,
whether you do or don't see that as visible breakage with a particular
piece of software is not particularly interesting.

I'm pretty sure, however, that you just haven't tested enough to see
any failures. glibc 2.23 is from 2016, so any functionality in it
using syscalls added after 2011 (3.8 kernel) is going to blow up
badly, thinking the syscall succeeded and returned some positive value
when actually the kernel lacks it.

In the particular case of clock_gettime, it's just that your glibc
2.23 has a hard Y2038 EOL and does not use/support the missing time64
syscalls.


> On Thu, Feb 17, 2022 at 11:05 AM Rich Felker <dalias@...ifal.cx> wrote:
> 
> > On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
> > > On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > > > Apologies for not being as familiar with gdb as I ought to be.
> > > > I used the __clock_gettime64 breakpoint and did a backtrace and finish
> > > > repeatedly.
> > > > I couldn't figure out how to best get the timespec struct info.
> > > >
> > > > Alternately if you want to throw out a sample test program for me to
> > build
> > > > and run, and what gdb commands to run to get the right info, happy to
> > do
> > > > that too.
> > > >
> > > > gdb output is attached.
> > >
> > > If gdb reported it correctly, clock_gettime returned 403, which should
> > > be impossible. It can only return 0 or -1. Incidentally, 403 is the
> > > syscall number for SYS_clock_gettime64, which suggests your kernel is
> > > simply *returning the syscall number* instead of -ENOSYS for syscalls
> > > that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> > > have any sort of weird vendor patching? Any LSMs loaded?
> > >
> > > If you'd like to run a test just to make sure we're accurately seeing
> > > what's happening, the attached should work. It should print 0 followed
> > > by the current time in seconds and nanoseconds.
> >
> > It looks like you hit the bug introduced in commit
> > 554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
> > 8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
> > former was a CVE fix, somebody backported it to the kernel you're
> > using, but they failed to backport the fix-for-the-fix, so you have a
> > kernel that operates dangerously incorrectly for syscall numbers it's
> > unaware of.
> >
> > This really needs to be fixed in the kernel if you can. On our side
> > (musl) we probably need to find out if such kernels are actually out
> > in the wild, and if so, whether there's any reasonable way to detect
> > the false success and treat it as failure.
> >
> > > > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@...ifal.cx> wrote:
> > > >
> > > > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > > > *This is a failure:*
> > > > > > tcpdump -i any -vvv host 192.168.0.115
> > > > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1),
> > capture
> > > > > > size 262144 bytes
> > > > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > proto
> > > > > UDP
> > > > > > (17), length 56)
> > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
> > google.com.
> > > > > (28)
> > > > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > UDP (17), length 72)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > > > > 0x5c7d!]
> > > > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110
> > (44)
> > > > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > proto
> > > > > UDP
> > > > > > (17), length 56)
> > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > > > google.com.
> > > > > > (28)
> > > > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > UDP (17), length 84)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > > > > 0xb42f!]
> > > > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > > > 2607:f8b0:4006:80d::200e (56)
> > > > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags
> > [none],
> > > > > > proto ICMP (1), length 112)
> > > > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > > > > unreachable, length 92
> > > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > >
> > > > > OK, this shows that the client has requested both answers and the
> > > > > nameserver replied almost immediately (about 0.5ms later), but when
> > > > > the second reply arrives (to the AAAA), the client has already closed
> > > > > the listening port, despite only a few ms having passed. The only way
> > > > > I see this could happen is by "timing out". This suggests that
> > > > > something is wrong with telling time.
> > > > >
> > > > > Can you either put a breakpoint in __clock_gettime64 (this is the
> > name
> > > > > you have to use for a breakpoint -- sorry I messed it up last time)
> > > > > and then see what it returns when you "finish" it and what's in the
> > > > > timespec struct after that? Or just write a test program to call
> > > > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > > > > use the time64 symbol name here) and print the results (return value
> > > > > and contents of the timespec struct).
> > > > >
> > > > >
> > > > >
> > > > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
> > UDP
> > > > > > (17), length 84)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > > > > google.com.
> > > > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > TCP (6), length 52)
> > > > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.],
> > cksum
> > > > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags
> > [none],
> > > > > > proto TCP (6), length 52)
> > > > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.],
> > cksum
> > > > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > > > > 2351306585
> > > > > > ecr 198377148], length 0
> > > > > > ^C
> > > > > > 7 packets captured
> > > > > > 7 packets received by filter
> > > > > > 0 packets dropped by kernel
> > > > >
> > >
> > >
> >
> > > #include <time.h>
> > > #include <stdio.h>
> > > int main()
> > > {
> > >       struct timespec ts;
> > >       printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
> > >       printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
> > > }
> >
> >

View attachment "broken_chromeos_kernel_hack.diff" of type "text/plain" (2380 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.