|
Message-ID: <CAOS_Y6TPOFmCJNGYaJmwFNF5+Nq74FEUOKcRUHGKA+N9aJTPdw@mail.gmail.com> Date: Tue, 2 Jun 2015 01:09:47 -0500 From: Rob Landley <rob@...dley.net> To: Rich Felker <dalias@...ifal.cx> Cc: musl@...ts.openwall.com Subject: Re: Moving forward with sh2/nommu On Mon, Jun 1, 2015 at 11:04 AM, Rich Felker <dalias@...ifal.cx> wrote: > On Mon, Jun 01, 2015 at 01:19:32AM -0500, Rob Landley wrote: >> FYI, Jeff's response. >> >> We REALLY need to get this on the mailing list. >> >> Rob > > OK, done. Actually I was hoping you and jeff could repost your respective bits, but eh. >> > 2. Kernel insists on having a stack size set in the PT_GNU_STACK >> > program header; if it's 0 (the default ld produces) then execve >> > fails. It should just provide a default, probably 128k (equal to >> > MMU-ful Linux). MMU-ful linux preallocates 0k and then demand faults in pages. MMU-ful almost never has to worry about memory fragmentation because it can remap _and_ move physical pages around (if nothing else, evict through swap). This is dedicated contiguous allocation even if it's wasted in a system that's _very_ prone to fragmentation, meaning things like "true" can fail well before you're in OOM killer territory. It's not the same at all. >> Nooooo. 8k. uClinux programs cannot depend on a huge stack, because that >> means each instance needs to kmalloc() a huge block of memory. That is >> bad, but it leads to failure to load because of fragmentation (not being >> able to find contiguous memory blocks for all those stacks). > > My view here was just that the default, which none was specified while > building the program, should be something "safe". Failed execve > ("oops, need to use the right -Wl,-z,stack-size=XXX") is a lot easier > to diagnose than a stack overflow that clobbers the program code with > stack objects. Right now the default is "always fails to load" because > the kernel explicitly rejects any request for a default. I note that Rich was probably saying he wants the default at 128k for ELF, not for FDPIC. That said, I'm not sure you can have a big enough warning sign about vanilla elf being crappy in that case. There's 2 things to balance here: if it doesn't "just work" then people are frustrated getting their programs to run, but if the defaults are horrible for scalability people will go "nommu is crap, we can't use it" without ever learning what's actually _wrong_. (It's a lot easier to get people to fix obvious breakage than to performance tune something that becomes untenable after the fact. How big the performance hit has to be before requiring --no-i-really-mean-it on the command line is an open question, but this is up there.) Of the two ("just works" but is horrible, breaks for trivial things until you understand why), if they can't get "hello world" to work we can probably get them to read a very small HOWTO, so they at least know fixed stack size is an _issue_ in this context. (We're already in "fork does not work" territory. We are not in Kansas anymore, if you try to fake kansas _too_ hard you're doing developers a disservice.) That said, annotating every package in the build is silly. Probably there should be an enviornment variable or something that can set this default for entire builds, and something to actually _measure_ stack usage after a run would be awesome. (The kernel has checkstack.pl for example?) And the elf2flt command line option for setting stack size was just _awkward_, no idea what you've done for your fdpic binaries but the traditional UI for this is horrible. >> > Unfortunately I suspect fixing this might be controversial since >> > there may be existing binaries using brk that can't fall back to >> > mmap. >> >> No, look at what I did in uClibc for brk(). I think it fails, and everything >> in the past depends on that. > > OK. So do you think it would be safe/acceptable to make brk always > fail in the kernel? Fork does, this is _less_ intrusive than that. > As long as making it fail is left to userspace, > the userspace code has to know at runtime whether it's running on > nommu or not so it can make brk fail. (I'm assuming my goal of having > binaries that can run on both/either.) Gotta make a certain amount of historical usage work if we're to wean people off their weird bespoke builds. Not sure the right answer here. Then again a patch to make future kernels do this right and then depending on that is a problem that will solve itself in time. (We no longer care about 2.4. I've generally used 7 years as a rule of thumb for "that's too old to care about without a reason", and that gets us back to around 2.6.25 at the moment...) >> > 4. Syscall trap numbers differ on SH2 vs SH3/4. Presumably the reason >> > is that these two SH2A hardware traps overlap with the syscall >> > range used by SH3/4 ABI: >> > >> > # define TRAP_DIVZERO_ERROR 17 >> > # define TRAP_DIVOVF_ERROR 18 >> >> No, 2A is actually the -newest- SH. This is just gratuitous breakage, and it’s >> really unfortunate. >> >> > The path forward I'd like to see is deprecating everything but trap >> > numbers 22 and 38, which, as far as I can tell, are safe for both >> > the SH2 and SH3/4 kernel to treat as sys calls. >> >> Kawasaki-san? Thoughts? >> >> > These numbers >> > indicate "6 arguments"; there is no good reason to encode the >> > number of arguments in the trap number, so we might as well just >> > always use the "6 argument" code which is what the variadic >> > syscall() has to use anyway. User code should then aim to use the >> > correct value (22 or 38) for the model it's running on (SH3/4 or >> > SH2) for compatibility with old kernels, but will still run safely >> > on new kernels if it detects wrong. >> >> I say drop backward compatibility. > > That generally goes against kernel stability principles and seems > unlikely to be acceptable upstream. I think Jeff means software-level backward compatibility with sh2 wouldn't be missed. (Linux wasn't big on that architecture in the first place, and buildroot dropped support for it a release or two back.) Losing backward compatibility with sh4 would make us look really bad, especially since musl already supports it, but I don't think he's suggesting breaking sh4. If we make sh2 binaries accept sh4 syscalls, we can call it a new architecture variant. (Which we actually _are_ with sh2j.) The kernel patch to make that work would be tiny and the hardware doesn't change. Given that qemu guys implemented "qemu-system-sh4" instead of "qemu-system-sh", the current perception is that superh _is_ sh4. Not being compatible with sh4 probably has higher overhead than not being compatible with historical linux for sh2. >> > Toolchain issues: >> > >> > 1. We need static-PIE (with or without TEXTRELs) which gcc does not >> > support out of the box. I have complex command lines that produce >> > static-PIE, and I have specfile based recipes to convert a normal >> > toolchain to produce (either optionally or by default) static-PIE, >> > but these recipes conflict with using the same toolchain to build >> > the kernel. If static-PIE were integrated properly upstream that >> > would not be an issue. >> >> I’d like to see -exactly- what the position independence code generator >> is doing in all cases (there are some interesting ones). Embedded systems >> really do need to count every cycle, and while I’m good with making things >> as rational and standard as possible, if the function overhead is 50 cycles >> and blasts the iCache taking a trip through the trampoline, I think we >> need to reconsider. Some serious benchmarking is also in order. bFLT does >> not have any overhead at run time, which is why people still use it over >> FD-PIC on a lot of platforms... > > There's no trampoline (I assume you mean PLT?) for static-PIE. Since > the linker resolves all symbolic references at ld-time, it never > generates PLT thunks; instead, all the calls using relative @PLT > addresses turn into what you would get with @PCREL addresses. So in > terms of calls, the main difference versus non-PIC is that you get a > braf/bsrf with a relative address instead of a jmp/jsr with an > absolute address, and in principle this leads to fewer relocations > ('fixups') at runtime. > > However, if PIC is too expensive for other reasons, there's no > fundamnental reason it has to be used. This is what I meant above by > "with or without TEXTRELs". You're free to compile without -fPIE or > -fPIC, then link as [static-]PIE, and what you'll end up with is > runtime relocations in the .text segment. Unlike on systems with MMU, > this is not a problem because (1) it's not shareable anyway, so you're > not preventing sharing, and (2) there's no memory protection against > writes to .text anyway, so you're not sacrificing memory protection. > I believe doing it this way gets you the same results (even in terms > of number/type of relocations) that you would get with non-shareable > bFLT. Poke me on irc and I'll see what I can scrounge up. (Frantically preparing for thursday's talk about how we open sourced our code and vhdl and documented everything by beating our stuff into uploadable shape and documenting everything. :) >> > 2. Neither binutils nor gcc accepts "sh2eb-linux" as a target. Trying >> > to hack it in got me a little-endian toolchain. I'm currently just >> > using "sheb" and -m2 to use sh2 instructions that aren't in sh1. >> >> That is why you really want to configure —target=sh2-uclinux (or make >> sh2-linux do the right thing). SH2 I think is always Big... > > Well GCC seems to consider the plain sh2-linux target as > little-endian. They're crazy and broken about a lot of stuff. Sega Saturn was big endian. (Saturn was to sh2 what dreamcast was to sh4.) I also note that gcc 4.2.1 and binutils 2.17 parsed sh2eb. Current stuff not doing so is a regression. > I think a lot of the issue here is that, to you, sh2 means particular > hardware architecture, whereas to the GCC developers and to me, sh2 > means just the ISA, and from a non-kernel/non-baremetal perspective, > just the userspace part of the ISA, which is a subset of the sh3/4 > ISAs. Extra bit of fun: When Renesas was implementing the ELF spec, they used a version that had been translated into japanese. The translation program switched codepages, meaning _ and . got swapped. (This is why the superh prefixes are borked, the developers accurately implemented the documentation they had.) That said, "sh2eb-elf" and "sh2eb-unknown-linux" are different targets with different ELF prefixes, and our ROM bootloader code was written for the ELF one. (I looked at porting it and it's really painful and intrustive, and non-linux code tends to be written for -elf toochains instead of -linux toolchains in general.) Meaning I have to build _both_ toolchains and use one for the hardware build (which includes the ROM bootloader code) and one for the kernel and userspace builds (we boot vmlinux and the bootloader parsing the elf cares about the prefixes; yes the bootloader that has to build with one set of prefixes expects to parse code built with the _other_ set, don't get me started on that). So when you say "what gcc developers think" the answer is "they don't". It's an inconsistent mix of random historical crap and we have to make the best of it. I'd like to try to be the least amount of crazy we can going forward, please. > So when gcc is generating code for "sh2-linux", it's treating it > as using all the usual linux-sh conventions (default endianness, > psABI, etc.) but restricted to the sh2 instruction set (no later > instructions). There really _aren't_ usual "linux-sh" conventions, there's the perception that all the world's a <strike>vax</strike> sh4 and everybody who doesn't think that is largely still using gcc 3.4 because that never stopped working and the new stuff breaks every third release. (Chronic problem in the embedded world, getting people to upgrade off of what they first got working, let alone interact with upstream via anything other than an initial smash-and-grab and hightailing it to the hideout and then staying silent until the statute of limitations runs out. And that's _without_ factoring a language barrier into it.) >> > 3. The complex math functions cause ICE in all gcc versions I've tried >> > targetting SH2. For now we can just remove src/complex from musl, >> > but that's a hack. The cause of this bug needs to be found and >> > fixed in GCC. Can I get a test program I can build and try with Aboriginal's toolchain? >> Does it happen in 4.5.2 or the Code Sorcery chain? We need complex. > > I'm not sure. He's referring to http://sourcery.mentor.com/public/gnu_toolchain/sh-linux-gnu/renesas-2011.03-36-sh-uclinux.src.tar.bz2 and renesas-2011.03-36-sh-uclinux-i686-pc-linux-gnu.tar.bz2 built from that, both of which was there last month but seems to have gone down. Grrr. And of course mentor graphics put a robots.txt to block archive.org because they're SUCH an open source company it exudes from their pores. Right, I threw both on landley.net for the moment, probably take 'em down again this weekend. (It's GPL code and that's the corresponding source, there you go.) Anyway, buildroot used to use this stuff to build toolchains (ala http://git.busybox.net/buildroot/commit/?id=29efac3c23df9431375f26d1b240627f604f42ca) but there was serious whack-a-mole in tracking where they moved it this week (http://git.buildroot.net/buildroot/commit/?id=27404dad33a8f9068faa8be72916ed47f905b5e6) so... Building from source with vanilla is _so_ much nicer... >> > musl issues: >> > >> > 1. We need runtime detection for the right trap number to use for >> > syscalls. Right now I've got the trap numbers hard-coded for SH2 in >> > my local tree. >> >> I don’t agree. Just rationalise it. Why can SH3 and above not use the >> same traps as SH2? > > Because the kernel syscall interface is a stable API. Even if not for > that, unilaterally deciding to change the interface does not instill > confidence in the architecture as a stable target. And because the perception out there in linux-land is that sh4 was a real (if stale) processor and sh2 wasn't, so breaking sh4 to suit sh2 _before_ we've established our new open hardware thing as actually viable will get the door slammed on us so hard... > OTOH if we could change to using the SH2 trap range as the default and > just keep the old SH3/4 range as a 'backwards compatibility' thing on > SH3/4 hardware, I think that might be an acceptable solution too. > Existing SH3/4 binaries are never going to run on SH2 anyway. QEMU supports sh4 right now. If sh2 supported sh4 traps we _might_ be able to run some sh2 code on qemu-sh4 and/or qemu-system-sh4. (But then I dunno what the issues there are, I need to sit down and fight with it now that elf2flt isn't blocking me.) >> Don’t understand why we want to change personality? More info? > > There's one unexpected place where the kernel has to know whether > you're doing FDPIC or not. The sigaction syscall takes a function > pointer for the signal handler, and on FDPIC, this is a pointer to the > function descriptor containing the GOT pointer and actual code > address. So if the kernel has loaded non-FDPIC ELF via the FDPIC ELF > loader, it will have switched personality to FDPIC to treat the signal > handler pointer specially. And then when you give it an actual > function address instead of a function descriptor, it reads a GOT > address and code address from the first 8 bytes of the function code, > and blows up. > > If we patch the FDPIC ELF loaded to support normal ELF files (this > should be roughly a 10-line patch) then it would never set the FDPIC > personality for them to begin with, and no hacks to set it back would > be needed. Code/rodata segment sharing is actually really _nice_ for nommu systems. It would be nice if we could get that to work at some point. And then there's that XIP stuff that two different ELC presentations used for Cortex-M, the videos of which are now up at http://elinux.org/ELC_2015_Presentations (I refer to the talks from Jim Huang and Vitaly Wool.) My point is, running contorted but technically valid ELF on nommu is just a starting point, we eventually want to go beyond that to take advantage of stuff only FDPIC can do. >> > 5. The brk workaround I'm doing now can't be upstreamed without a >> > reliable runtime way to distinguish nommu. To put it in malloc.c >> > this would have to be a cross-arch solution. What might make more >> > sense is putting it in syscall_arch.h for sh >> >> No, this is a general nommu problem. It will also appear on ARM, >> ColdFire, an BlackFin, which are important targets for MUSL. > > Right, but I don't want to hard-code it for these archs either. In > principle it should be possible to run an i386 binary on a nommu i386 > setup, and if the (hypothetical) kernel had a dangerour/broken brk > there too, if also needs to be blacklisted. So what I'm looking for, > if the kernel can't/won't just remove brk support on nommu, is a way > to detect nommu/broken-brk and prevent it from being used. One > simple/stupid way is: > > if ((size_t)&local_var - (size_t)cur_brk < MIN_DIST_TO_STACK) > // turn off brk support > > This would ban use of brk on any system where there's a risk of brk > extending up into the stack. Or you can check your ELF header flags and see if you've got the fdpic bit set. >> > where we already >> > have to check for SH2 to determine the right trap number; the >> > inline syscall code can just do if (nr==SYS_brk&&IS_SH2) return 0; >> >> I think a look at uClibc is in order. I made it always return failure, >> after playing with having it return results from malloc(). It’s one of >> two things that we don’t do (or do poorly) with nommu, the other is the >> clone() and form() family being highly restricted. > > It's so broken I think it should just be fixed on the kernel side. I don't know what you mean by "fixed" here. linux/nommu.c currently has: /* * sys_brk() for the most part doesn't need the global kernel * lock, except when an application is doing something nasty * like trying to un-brk an area that has already been mapped * to a regular file. in this case, the unmapping will need * to invoke file system routines that need the global lock. */ SYSCALL_DEFINE1(brk, unsigned long, brk) { struct mm_struct *mm = current->mm; if (brk < mm->start_brk || brk > mm->context.end_brk) return mm->brk; if (mm->brk == brk) return mm->brk; /* * Always allow shrinking brk */ if (brk <= mm->brk) { mm->brk = brk; return brk; } /* * Ok, looks good - let it rip. */ flush_icache_range(mm->brk, brk); return mm->brk = brk; } If that should be replaced with "return -ENOSYS" we can submit a patch... Rob
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.