kernel-hardening - Re: [RFC] syscalls: Restore address limit after a syscall

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrVS6J2_GPWBoxUFUhjdyce0TBdroA+FjfdNywr9_k6hew@mail.gmail.com>
Date: Thu, 9 Feb 2017 18:42:34 -0800
From: Andy Lutomirski <luto@...capital.net>
To: Thomas Garnier <thgarnie@...gle.com>
Cc: Kees Cook <keescook@...omium.org>, Dave Hansen <dave.hansen@...el.com>, 
	Arnd Bergmann <arnd@...db.de>, René Nyffenegger <mail@...enyffenegger.ch>, 
	Stephen Bates <stephen.bates@...s.com>, Jeff Moyer <jmoyer@...hat.com>, 
	Milosz Tanski <milosz@...in.com>, Linux API <linux-api@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, 
	"kernel-hardening@...ts.openwall.com" <kernel-hardening@...ts.openwall.com>, Will Deacon <will.deacon@....com>, 
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>, 
	"linux-s390@...r.kernel.org" <linux-s390@...r.kernel.org>
Subject: Re: [RFC] syscalls: Restore address limit after a syscall

On Thu, Feb 9, 2017 at 3:41 PM, Thomas Garnier <thgarnie@...gle.com> wrote:
> On Thu, Feb 9, 2017 at 3:05 PM, Andy Lutomirski <luto@...capital.net> wrote:
>> On Thu, Feb 9, 2017 at 11:31 AM, Kees Cook <keescook@...omium.org> wrote:
>>> On Thu, Feb 9, 2017 at 10:33 AM, Thomas Garnier <thgarnie@...gle.com> wrote:
>>>> This patch prevents a syscall to modify the address limit of the
>>>> caller. The address limit is kept by the syscall wrapper and restored
>>>> just after the syscall ends.
>>>>
>>>> For example, it would mitigation this bug:
>>>>
>>>> - https://bugs.chromium.org/p/project-zero/issues/detail?id=990
>>>>
>>>> Signed-off-by: Thomas Garnier <thgarnie@...gle.com>
>>>> ---
>>>> Based on next-20170209
>>>> ---
>>>>  include/linux/syscalls.h | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>>>> index 91a740f6b884..a1b6a62a9849 100644
>>>> --- a/include/linux/syscalls.h
>>>> +++ b/include/linux/syscalls.h
>>>> @@ -198,7 +198,10 @@ extern struct trace_event_functions exit_syscall_print_funcs;
>>>>         asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));      \
>>>>         asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))       \
>>>>         {                                                               \
>>>> -               long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));  \
>>>> +               long ret;                                               \
>>>> +               mm_segment_t fs = get_fs();                             \
>>>> +               ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));       \
>>>> +               set_fs(fs);                                             \
>>>>                 __MAP(x,__SC_TEST,__VA_ARGS__);                         \
>>>>                 __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));       \
>>>>                 return ret;                                             \
>>>> --
>>>> 2.11.0.483.g087da7b7c-goog
>>>>
>>>
>>> I have a memory of Andy looking at this before, and there was some
>>> problem with how a bunch of compat code would set fs and then re-call
>>> the syscall... but I can't quite find the conversation. Andy, do you
>>> remember the details?
>>>
>>> This seems like an entirely reasonable thing to enforce for syscalls,
>>> though I'm sure there's a gotcha somewhere. :)
>>
>> This sounds vaguely familiar, but that's about all.
>>
>> Anyway, it seems reasonable that the SyS_foobar wrappers are genuinely
>> only used for syscalls and not for other things, so the code should
>> *work*.  That being said, I think there's room for several
>> improvements.
>>
>> 1. Why save the old "fs" value?  For that matter, why restore it?
>> IOW, I'd rather see BUG_ON(get_fs() != USER_DS) at the end.
>>
>
> I guess that make sense in the wrapper.
>
>> 2. I'd rather see the mechanism be more general.  If we had, effectively:
>>
>> asmlinkage long SyS_foo(...) {
>>   sys_foo();
>>   verify_pre_usermode_state();
>> }
>>
>> and let verify_pre_usermode_state() potentially do more things, we'd
>> get a more flexible mechanism.  On arches like x86_32, we could save a
>> decent amount of code size by moving verify_pre_usermode_state() into
>> prepare_exit_to_usermode(), but that would have to be a per-arch
>> opt-in.  x86_64 probably would *not* select this due to the fast path
>> (or it would do it in asm.  hmm.).
>>
>
> I will look into that. I like this design better.
>
>> 3. If this thing gets factored out, then arch code can call it for
>> non-syscall entries, too.
>>
>
> Yes, it makes sense.
>
>> 4. Can we make this configurable?
>>
>>
>> For x86, a nice implementation might be:
>>
>> select ARCH_NO_SYSCALL_VERIFY_PRE_USERMODE_STATE
>>
>> ... in prepare_exit_to_usermode():
>>
>> verify_pre_usermode_state();  // right at the beginning
>>
>> ... in the asm syscall fast path:
>>
>> #ifdef CONFIG_VERIFY_PRE_USERMODE_STATE
>> call verify_pre_usermode_staet
>> #endif
>>
>> (or just inline the interesting bit)
>>
>
> So by default it is in the wrapper. If selected, an architecture can
> disable the wrapper put it in the best places. Understood correctly?

Sounds good to me.

Presumably the result should go through -mm.  Want to cc: akpm and
linux-arch@ on the next version?

I've also cc'd arm and s390 folks -- those are the other arches that
try to be on top of hardening.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.