|
Message-ID: <CALCETrUuG0-tGNQ5iAEO2_gaK1eUq7AoALoBeQKcOP8cvxr=eA@mail.gmail.com> Date: Wed, 22 Jun 2016 18:22:17 -0700 From: Andy Lutomirski <luto@...capital.net> To: Linus Torvalds <torvalds@...ux-foundation.org> Cc: Andy Lutomirski <luto@...nel.org>, "the arch/x86 maintainers" <x86@...nel.org>, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>, Borislav Petkov <bp@...en8.de>, Nadav Amit <nadav.amit@...il.com>, Kees Cook <keescook@...omium.org>, Brian Gerst <brgerst@...il.com>, "kernel-hardening@...ts.openwall.com" <kernel-hardening@...ts.openwall.com>, Josh Poimboeuf <jpoimboe@...hat.com>, Jann Horn <jann@...jh.net>, Heiko Carstens <heiko.carstens@...ibm.com> Subject: Re: [PATCH v3 00/13] Virtually mapped stacks with guard pages (x86, core) On Mon, Jun 20, 2016 at 9:01 PM, Linus Torvalds <torvalds@...ux-foundation.org> wrote: > On Mon, Jun 20, 2016 at 4:43 PM, Andy Lutomirski <luto@...nel.org> wrote: >> >> On my laptop, this adds about 1.5µs of overhead to task creation, >> which seems to be mainly caused by vmalloc inefficiently allocating >> individual pages even when a higher-order page is available on the >> freelist. > > I really think that problem needs to be fixed before this should be merged. > > The easy fix may be to just have a very limited re-use of these stacks > in generic code, rather than try to do anything fancy with multi-page > allocations. Just a few of these allocations held in reserve (perhaps > make the allocations percpu to avoid new locks). I implemented a percpu cache, and it's useless. When a task goes away, one reference is held until the next RCU grace period so that task_struct can be used under RCU (look for delayed_put_task_struct). This means that free_task gets called in giant batches under heavy clone() load, which is the only time that any of this matters, which means that only get to refill the cache once per RCU batch, which means that there's very little benefit. Once thread_info stops living in the stack, we could, in principle, exempt the stack itself from RCU protection, thus saving a bit of memory under load and making the cache work. I've started working on (optionally, per-arch) getting rid of on-stack thread_info, but that's not ready yet. FWIW, the same issue quite possibly hurts non-vmap-stack performance as well, as it makes it much less likely that a cache-hot stack gets immediately reused under heavy fork load. So may I skip this for now? I think that the performance hit is unlikely to matter on most workloads, and I also expect the speedup from not using higher-order allocations to be a decent win on some workloads. --Andy
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.