Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20151029021014.GA1768@brightrain.aerifal.cx>
Date: Wed, 28 Oct 2015 22:10:14 -0400
From: Rich Felker <dalias@...c.org>
To: nommu@...mu.org
Cc: musl@...ts.openwall.com
Subject: Behavior of mmap on Linux/nommu, & musl dynamic linker

Presently musl's dynamic linker is not behaving entirely correctly on
nommu systems, and I think I understand the issues, but 

Most of this applies only to non-FDPIC (plain ELF with constant
displacement between segments) loading, so I'll start with that part:

What we generally do to map libraries in this case on systems with MMU
is start out with one large mmap, starting at the beginning of the
lowest-address PT_LOAD segment, using the permissions of that segment,
but whose length is the total amount of address space that needs to be
reserved. (Note that some of this mapping may be past the end of the
file, in which case access may SIGBUS, but we never intend to access
it so that doesn't matter.) After that, we mmap additional segments
over top of parts of that address range using MAP_FIXED. This yields a
minimum number of mmap calls/changes to the vm layout of the process
and thus very efficient loading for small libraries where syscall time
dominates relocation time.

Unfortunately, MAP_FIXED is not accepted at all on Linux/nommu; it
unconditionally produces EINVAL. In principle it should be possible to
use MAP_FIXED like this to replace parts of private mappings, but it's
no more efficient than simply using read/memcpy to replace the data.
So what musl is doing right now is handling the failure of MAP_FIXED
with EINVAL by using read to load the file contents at the appropriate
addresses within the range obtained by the first mmap call.

Now here's where we hit the next problem: for non-writable private
mappings, Linux/nommu sets the VM_MAYSHARE flag (indicated in
/proc/%d/maps with a lowercase 's' instead of 'p') and in principle
the file operations backend is allowed to assign an address range
that's actually shared with other processes mapping the file (or the
actual rom/cache/whatever copy of the file).

This behavior actually justifies the choice to disallow MAP_FIXED; if
the address range obtained for a non-writable private map is actually
shared memory that other processes may be using for their own
non-writable private or shared maps of the same file, then using
MAP_FIXED to reassign the address range to different use is not
possible.

What's also likely not valid is the way musl is using read to fill in
the additional segments' contents. Since the first segment is
generally text (non-writable), the map returned by mmap could
potentially be a shared map, in which case we would clobber memory
shared with other processes. In practice, this isn't happening, but
I'm not sure why. The following comment in mm/nommu.c may suggest a
reason:

	/* if we want to share, we need to check for regions created by other
	 * mmap() calls that overlap with our proposed mapping
	 * - we can only share with a superset match on most regular files
	 * - shared mappings on character devices and memory backed files are
	 *   permitted to overlap inexactly as far as we are concerned for in
	 *   these cases, sharing is handled in the driver or filesystem rather
	 *   than here
	 */

For libraries without debug info or with large bss, the total mapping
length is likely to be larger than the total file length, in which
case the tests for shareability may always fail -- I haven't actually
checked this because the logic is complex and hard to follow, but it
seems plausible. However obviously something needs to be changed.

What I think I should do is detect the failure of MAP_FIXED the first
time it fails, unmap the (now-useless) initial map, and switch to
using private anonymous maps followed by read for setting up any
address ranges that need writable subranges. uClibc uses a slightly
simpler approach without first trying MAP_FIXED and falling back (just
allocating anonymous memory to begin with), but musl supports both
mmu-ful and nommu runtime environments, and dropping support for
shared text and COW on mmu-ful runtime environments is not a
reasonable option.

Only one small part of all this applies to musl's FDPIC ELF loader:
bss allocation. When we map a writable PT_LOAD segment with bss
(p_memsz>p_filesz), the initial mmap is potentially larger than the
file, and a second mmap with MAP_FIXED is used to replace the bss part
of the mapping with anonymous zero pages. This operation fails on
nommu right now, and we fall back to memset, which would potentially
SIGBUS in a runtime environment with mmu. So there is an assumption
encoded here that nommu does not SIGBUS, that the full mapping length
requested actually has memory underlying it. This seems reasonable,
but I'd welcome feedback if anyone has good reason to disagree.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.