kernel-hardening - Re: afaef01c00 ("x86/entry: Add STACKLEAK erasing the kernel stack .."): double fault: 0000 [#1]

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e902f017-43f3-2532-69ad-3cdfca6d75c8@linux.com>
Date: Sat, 10 Nov 2018 02:09:09 +0300
From: Alexander Popov <alex.popov@...ux.com>
To: Andy Lutomirski <luto@...capital.net>, Jann Horn <jannh@...gle.com>,
 Joerg Roedel <joro@...tes.org>
Cc: Andy Lutomirski <luto@...nel.org>, Kees Cook <keescook@...omium.org>,
 Ingo Molnar <mingo@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
 lkp@...org, lkp@...el.com,
 Kernel Hardening <kernel-hardening@...ts.openwall.com>,
 linux-doc@...r.kernel.org, kernel list <linux-kernel@...r.kernel.org>,
 Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: afaef01c00 ("x86/entry: Add STACKLEAK erasing the kernel stack
 .."): double fault: 0000 [#1]


On 09.11.2018 23:46, Andy Lutomirski wrote:
>> On Nov 9, 2018, at 12:06 PM, Jann Horn <jannh@...gle.com> wrote:
>>
>> +Andy, Thomas, Ingo
>>
>>> On Fri, Nov 9, 2018 at 2:24 PM kernel test robot <lkp@...el.com> wrote:
>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>>
>>> commit afaef01c001537fa97a25092d7f54d764dc7d8c1
>>> Author:     Alexander Popov <alex.popov@...ux.com>
>>> AuthorDate: Fri Aug 17 01:16:58 2018 +0300
>>> Commit:     Kees Cook <keescook@...omium.org>
>>> CommitDate: Tue Sep 4 10:35:47 2018 -0700
>>>
>>>    x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
>> [...]
>>> [  127.808225] double fault: 0000 [#1]
>>> [  127.808695] CPU: 0 PID: 414 Comm: trinity-main Tainted: G                T 4.19.0-rc2-00001-gafaef01 #1
>>> [  127.809799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>>> [  127.810760] RIP: 0010:ftrace_ops_test+0x27/0xa0
>>> [  127.811289] Code: eb 9a 90 41 54 55 49 89 f4 53 48 89 d3 48 89 fd 48 81 ec b0 00 00 00 65 48 8b 04 25 28 00 00 00 48 89 84 24 a8 00 00 00 31 c0 <e8> 54 df ff ff 48 85 db 74 57 e8 4a df ff ff 48 8b 85 d0 00 00 00
>>> [  127.813385] RSP: 0018:fffffe0000001fb8 EFLAGS: 00010046
>> [...]
>>> [  127.819762] CR2: fffffe0000001fa8 CR3: 000000001579a000 CR4: 00000000000006b0
>> [...]
>>> [  127.822234] Call Trace:
>>> [  127.822530]  <ENTRY_TRAMPOLINE>
>>> [  127.822914]  ? __ia32_sys_rseq+0x2f0/0x2f0
>>> [  127.823395]  ftrace_ops_list_func+0xa5/0x1b0
>>> [  127.823922]  ftrace_call+0x5/0x34
>>> [  127.824318]  ? stackleak_erase+0x5/0xf0
>>> [  127.824789]  ? stackleak_erase+0x43/0xf0
>>> [  127.825260]  stackleak_erase+0x5/0xf0
>>> [  127.825699]  syscall_return_via_sysret+0x61/0x81
>>> [  127.826238] WARNING: stack recursion on stack type 4
>>> [  127.826243] WARNING: can't dereference registers at (____ptrval____) for ip syscall_return_via_sysret+0x61/0x81
>>> [  127.826246]  </ENTRY_TRAMPOLINE>
>>> [  127.828342] ---[ end trace e9f96d3f45575499 ]---
>>> [  127.828911] RIP: 0010:ftrace_ops_test+0x27/0xa0
>>
>> CR2: fffffe0000001fa8, RSP: 0018:fffffe0000001fb8; this is a pagefault
>> on the stack. fffffe0000000000 is CPU_ENTRY_AREA_RO_IDT;
>> fffffe0000001000 is CPU_ENTRY_AREA_PER_CPU; so fffffe0000002000 is the
>> page with the entry stack for cpu 0, and you overflowed from that into
>> the readonly gdt at fffffe0000001000, which doubles as a guard page
>> for the entry stack:
>>
>> struct cpu_entry_area {
>>        char gdt[PAGE_SIZE];
>>
>>        /*
>>         * The GDT is just below entry_stack and thus serves (on x86_64) as
>>         * a a read-only guard page.
>>         */
>>        struct entry_stack_page entry_stack_page;
>> [...]
>> };
>>
>> In other words: You're calling C code on the entry trampoline stack;
>> this C code can call into ftrace; and the entry trampoline stack isn't
>> big enough for ftrace shenanigans. I think you probably shouldn't be
>> calling C code on the entry stack, but maybe one of the X86 folks has
>> a different opinion?
>
> My opinion was that, on x86_32, the entry stack ought to be fairly large so
> that NMIs could execute on the entry stack.  I don’t remember what the code
> actually does, though.
> 
> But stackleak_erase should probably not run on the entry stack. That seems
> like it’s just asking for trouble.

Hello Jann and Andy,


The stackleak_erase() function is called on the trampoline stack at the end of
syscall, it erases the used part of the kernel thread stack after the syscall is
handled.


I've reproduced such a double fault with function tracing for stackleak_erase():

  # mount -t tracefs nodev /sys/kernel/tracing
  # echo 'p:myprobe stackleak_erase' > /sys/kernel/debug/tracing/kprobe_events
  # echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable


I think we should simply not allow function tracing for stackleak_*() functions:

diff --git a/kernel/Makefile b/kernel/Makefile
index 7343b3a..0906f6d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_MULTIUSER) += groups.o
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace internal ftrace files
 CFLAGS_REMOVE_irq_work.o = $(CC_FLAGS_FTRACE)
+CFLAGS_REMOVE_stackleak.o = $(CC_FLAGS_FTRACE)
 endif


With this patch setting kprobe event for stackleak_erase() is not allowed. This
is the corresponding dmesg output:
  [   75.660478] trace_kprobe: Could not probe notrace function stackleak_erase


If you agree, I'll prepare the patch for LKML.

Best regards,
Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.