|
Message-ID: <2236FBA76BA1254E88B949DDB74E612BA4BBA73C@IRSMSX102.ger.corp.intel.com> Date: Mon, 11 Feb 2019 06:39:04 +0000 From: "Reshetova, Elena" <elena.reshetova@...el.com> To: Andy Lutomirski <luto@...nel.org>, Jann Horn <jannh@...gle.com>, "Perla, Enrico" <enrico.perla@...el.com> CC: Peter Zijlstra <peterz@...radead.org>, "kernel-hardening@...ts.openwall.com" <kernel-hardening@...ts.openwall.com>, "tglx@...utronix.de" <tglx@...utronix.de>, "mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>, "keescook@...omium.org" <keescook@...omium.org>, "tytso@....edu" <tytso@....edu> Subject: RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon system call > On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena > <elena.reshetova@...el.com> wrote: > > > > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > > > > > > > Why can't we change the stack offset periodically from an interrupt or > > > > > so, and then have every later entry use that. > > > > > > > > Hm... This sounds more complex conceptually - we cannot touch > > > > stack when it is in use, so we have to periodically probe for a > > > > good time (when process is in userspace I guess) to change it from an > interrupt? > > > > IMO trampoline stack provides such a good clean place for doing it and we > > > > have stackleak there doing stack cleanup, so would make sense to keep > > > > these features operating together. > > > > > > The idea was to just change a per-cpu (possible per-task if you ctxsw > > > it) offset that is used on entry to offset the stack. > > > So only entries after the change will have the updated offset, any > > > in-progress syscalls will continue with their current offset and will be > > > unaffected. > > > > Let me try to write this into simple steps to make sure I understand your > > approach: > > > > - create a new per-stack value (and potentially its per-cpu "shadow") called > stack_offset = 0 > > - periodically issue an interrupt, and inside it walk the process tree and > > update stack_offset randomly for each process > > - when a process makes a new syscall, it subtracts stack_offset value from > top_of_stack() > > and that becomes its new top_of_stack() for that system call. > > > > Smth like this? > > I'm proposing somthing that is conceptually different. OK, looks like I fully misunderstand what you meant indeed. The reason I didn’t reply to your earlier answer is that I started to look into unwinder code & logic to get at least a slight clue on how things can be done since I haven't looked in it almost at all before (I wasn't changing anything with regards to it, so I didn't have to). So, I meant to come back with a more rigid answer that just "let me study this first"... You are, > conceptually, changing the location of the stack. I'm suggesting that > you leave the stack alone and, instead, randomize how you use the > stack. So, yes, instead of having: allocated_stack_top random_offset actual_stack_top pt_regs ... and so on We will have smth like: allocated_stack_top = actual_stack_top pt_regs random_offset ... So, conceptually we have the same amount of randomization with both approaches, but it is applied very differently. Security-wise I will have to think more if second approach has any negative consequences, in addition to positive ones. As a paranoid security person, you might want to merge both approaches and randomize both places (before and after pt_regs) with different offsets, but I guess this would be out of question, right? I am not that experienced with exploits , but we have been talking now with Jann and Enrico on this, so I think it is the best they comment directly here. I am just wondering if having pt_regs in a fixed place can be an advantage for an attacker under any scenario... In plain C, this would consist of adding roughly this snippet > in do_syscall_64() and possibly other entry functions: > > if (randomize_stack()) { > void *dummy = alloca(rdrand() & 0x7f8); > > /* Make sure the compiler doesn't optimize out the alloca. */ > asm volatile ("" :: "=rm" (dummy)); > } > > ... do the actual syscall work here. > > This has a few problems, namely that the generated code might be awful > and that alloca is more or less banned in the kernel. I suppose > alloca could be unbanned in the entry C code, but this could also be > done fairly easily in the asm code. You'd just need to use a register > to store whatever is needed to put RSP back in the exit code. Yes, I was actually thinking now on doing it in assembly since I think it would look smaller and more clearer in it. Just need to get details slowly in place. The > obvious way would be to use RBP, but it's plausible that using a > different callee-saved register would make the unwinder interactions > easier to get right. This is what I started looking into, bear with me please, all this stuff is new for my eyes, so I am slow... > > With this approach, you don't modify any of the top_of_stack() > functions or macros at all -- the top of stack isn't changed. Yes, understood. > > > > > I think it is close to what Andy has proposed > > in his reply, but the main difference is that you propose to do this via an interrupt. > > And the main reasoning for doing this via interrupt would be not to affect > > syscall performance, right? > > > > The problem I see with interrupt approach is how often that should be done? > > Because we don't want to end up with situation when we issue it too often, since > > it is not going to be very light-weight operation (update all processes), and we > > don't want it to be too rarely done that we end up with processes that execute > many > > syscalls with the same offset. So, we might have a situation when some processes > > will execute a number of syscalls with same offset and some will change their > offset > > more than once without even making a single syscall. > > I bet that any attacker worth their salt could learn the offset by > doing a couple of careful syscalls and looking for cache and/or TLB > effects. This might make the whole exercise mostly useless. Isn't > RDRAND supposed to be extremely fast, though? > > I usually benchmark like this: > > $ ./timing_test_64 10M sys_enosys > 10000000 loops in 2.53484s = 253.48 nsec / loop > > using https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/ Thank you for the pointer! With everyone's suggestions I am now having much better set of tools to do my next measurements. Best Regards, Elena.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.