|
Message-ID: <81f188cf-3fb8-2899-5c24-ec72d38ad300@gmail.com> Date: Mon, 30 Jan 2017 13:30:00 -0800 From: Eric Hassold <hassold@...il.com> To: musl@...ts.openwall.com Subject: Re: Fix pthread_create on some devices failing to initialize guard area On 1/20/17 2:42 PM, Eric Hassold wrote: > > > On 1/20/17 1:29 PM, Rich Felker wrote: >> On Fri, Jan 20, 2017 at 01:04:28PM -0800, Eric Hassold wrote: >>> On 1/20/17 11:56 AM, Rich Felker wrote: >>>> On Fri, Jan 20, 2017 at 11:45:09AM -0800, Eric Hassold wrote: >>>>> Hi All, >>>>> >>>>> While deploying test static executable across farm of different >>>>> embedded systems, found out that pthread_create() is failing >>>>> systematically on some (very few) arm-linux devices whenever non >>>>> null stack guard is enabled (that is, also when calling >>>>> pthread_create with default - i.e. null - attributes since default >>>>> is a one page of guard). One of those device is for example a >>>>> Marvell Armada 375 running Linux 3.10.39. Same test code, built with >>>>> alternative libc implementations (glibc, uClibc) works as expected >>>>> on those devices. >>>>> >>>>> >>>>> Issue >>>>> >>>>> This occurs because of call to mprotect() in pthread_create fails. >>>>> In current implementation, if guard size is non null, memory for >>>>> (guard + stack + ...) is first allocated (mmap'ed) with no >>>>> accessibility (PROT_NONE), then mprotect() is called to re-enable >>>>> read/write access to (memory + guardsize). Since call to mprotect() >>>>> systematically fails in this scenario (returning error code EINVAL), >>>>> it is impossible to create thread. >>>> Failure is ignored and the memory is assumed to be writable in this >>>> case, since EINVAL is assumed to imply no MMU. Is this assumption >>>> wrong in your case, and if so, can you explain why? >>> In my case, devices exhibiting issue are not MMU-less, they are >>> Cortex-A9 devices with valid mmu / page protection working as >>> expected otherwise. Note that current Musl code assumes ENOSYS means >>> no MMU and handles it by assuming the system has no page protection >>> at all. For the case I observe, it is EINVAL which is returned, this >>> is not ignored, so memory is unmap'ed and pthread_create() fails. >> In that case I think this is a kernel bug. Do you know why EINVAL is >> happening? If there's an MMU, Linux should be able to replace the >> anon PROT_NONE pages with anon RW pages. > > Agree. Unfortunately, those are devices we don't built the kernel for, > so have been hardly able to track issue deeper. The point is however > that such devices with this issue in kernel might not be that > uncommon, and it concretely means impossibility at that moment to > deploy on them a functional static executable built with musl. >> >>>>> In proposed patch (attached below), memory for (guard + stack + ...) >>>>> is first mmap'ed with read/write accessibility, then guard area is >>>>> protected by calling mprotect() with PROT_NONE on guardsize first >>>>> bytes of returned memory. This call to mprotect() to remove all >>>>> accessibility on guard area, with guard area being at beginning of >>>>> previously mmap'ed memory, works correctly on those platforms having >>>>> issue with current implementation. Incidentally, this makes the >>>>> logic more concise to handle both cases (with or without guard) is a >>>>> more consistent way, and handle systems with partial/invalid page >>>>> protection implementation (e.g. mprotect() returning ENOSYS) more >>>>> gracefully since the stack is explicitly created with read/write >>>>> access. >>>> This doesn't work correctly on normal systems with mmu, because the >>>> size of the guard pages is accounted against commit charge. Linux >>>> should, but AFAIK doesn't, subtract it from commit charge once it's >>>> changed to PROT_NONE without having been dirtied, but even if this bug >>>> is fixed on the kernel side, there would still be a moment where >>>> excess commit charge is consumed and thus where pthread_create might >>>> spuriously fail or cause allocations in other processes/threads to >>>> fail. >>>> >>>> If the kernel is not allocating actually-usable address ranges for >>>> PROT_NONE on all nommu systems, I think the only solution is to handle >>>> EINVAL from mprotect by going back and re-doing the mmap with >>>> PROT_READ|PROT_WRITE. Do you have any better ideas? >>>> >>>> Rich >>> Had this "deja vu" feeling... reminds me conversation you had in >>> this thread some time ago elsewhere... >>> https://sourceware.org/ml/libc-alpha/2015-09/msg00447.html >>> >>> Your proposition seems reasonable on nommu system, but again, the >>> issue observed here is on legit systems with mmu, with mprotect >>> failing with EINVAL (and not ENOSYS), for some other reason than >>> system not supporting page protection. Catching EINVAL error >>> returned by mprotect and falling back to re-doing the mmap would >>> mean actually silently running without stack guard on system >>> supporting it, so I believe it is actually legitimate to fail and >>> return error in that case. But that's difference use case than the >>> issue I'm observing. >>> >>> I took note of Balazs's suggestion (in the thread referenced above) >>> to switch to a pattern similar to Musl's current one >>> (mmap(PROT_NONE) + mprotect(stack, READ|WRITE)) in order to avoid >>> those guard pages to actually occupy resources. But I can indeed >>> observe that this approach fails on some devices (which have valid >>> mmu), while I'm not sure I'm seeing the issue with first mapping >>> PROT_READ|PROT_WRITE then mprotect(PROT_NONE) guard area. Latter >>> approach (as implemented by patch) is, at least, consistent with all >>> the other implementations out there (I checked glibc's >>> allocatestack.c, but also e.g. bionic), and couldn't find report of >>> those failures you are envisaging. >> Consider the case of guard_size=128M stack_size=128k with >> Commit_Limit=128M. This will fail with your approach but works >> perfectly well now. > Right, makes sense. Though I would point out that such uncommon > scenario for an application would fail when linked with anything but > musl, since "my approach" is the widespread one across all other libc > implementations (glibc, bionic, ...). But I understand both approaches > aim at working around issue in some corner case while creating issue > in other rare one, and so at the end it's all about deciding which > edge case is more critical than the other, and whether > consistency/compatibility with existing 3rd parties solutions matters > or not. > > But in order to get closer to "having the cake and eat it too", please > find attached another patch implementing the "unmaping and re-doing" > strategy you initially suggested, i.e. starting with current approach, > then giving it a second chance using "my" approach if and only if > issue with mprotect() is detected. Should be consistent with current > behavior on any system working correctly today, and just provide a > plan B falling back to more common approach only when needed. Does > that sound more acceptable? > > Thanks, > > Eric > > ---- > [PATCH] set up guard protection after stack is mapped > > calling mprotect beyond the guard page of PROT_NONE mmap-ed stack > fails on > some devices with buggy kernel, making it impossible to create > thread. if such condition is met, fall back to allocating memory > for stack and guard first, with read and write permission, then > protect guard > area. > --- > src/thread/pthread_create.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c > index 49f2f72..d3c030b 100644 > --- a/src/thread/pthread_create.c > +++ b/src/thread/pthread_create.c > @@ -242,7 +242,17 @@ int __pthread_create(pthread_t *restrict res, > const pthread_attr_t *restrict att > if (__mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE) > && errno != ENOSYS) { > __munmap(map, size); > - goto fail; > + if (errno == EINVAL) { > + map = __mmap(0, size, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANON, -1, 0); > + if (map == MAP_FAILED) goto fail; > + if (__mprotect(map, guard, PROT_NONE) > + && errno != ENOSYS) { > + __munmap(map, size); > + goto fail; > + } > + } else { > + goto fail; > + } > } > } else { > map = __mmap(0, size, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANON, -1, 0); Hi Rich, Pinging... any comment, feedback or concern about latest version of the patch, attached above, keeping current behavior but falling back to (mmap(PROT_READ|PROT_WRITE) && mprotect(guard, none)) if and only if current approach detected to fail) ? Eric
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.