Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhhkp3aknwwlmrmmqmr532yfq3ioqh6jbkrxfqf6ovlj2szsai@f3sjwakyq323>
Date: Wed, 5 Jul 2023 20:12:01 +0800
From: Ruihan Li <lrh2000@....edu.cn>
To: oss-security@...ts.openwall.com
Cc: Ruihan Li <lrh2000@....edu.cn>
Subject: StackRot (CVE-2023-3269): Linux kernel privilege escalation
 vulnerability

Hi all,

**This serves as an "early" disclosure of the StackRot vulnerability, in
compliance with [the policy of the linux-distros list][po]. While all the
essential vulnerability details have been provided here, the complete exploit
code and a comprehensive write-up will be made publicly available no later than
the end of July. This oss-security thread will be notified, and any
updates will be reflected in [the GitHub repository][gh].**

 [po]: https://oss-security.openwall.org/wiki/mailing-lists/distros
 [gh]: https://github.com/lrh2000/StackRot

A flaw was found in the handling of stack expansion in the Linux kernel 6.1
through 6.4, aka "Stack Rot". The maple tree, responsible for managing virtual
memory areas, can undergo node replacement without properly acquiring the MM
write lock, leading to use-after-free issues. An unprivileged local user could
use this flaw to compromise the kernel and escalate their privileges.

As StackRot is a Linux kernel vulnerability found in the memory management
subsystem, it affects almost all kernel configurations and requires minimal
capabilities to trigger. However, it should be noted that maple nodes are freed
using RCU callbacks, delaying the actual memory deallocation until after the
RCU grace period. Consequently, exploiting this vulnerability is considered
challenging.

To the best of my knowledge, there are currently no publicly available exploits
targeting use-after-free-by-RCU (UAFBR) bugs. This marks the first instance
where UAFBR bugs have been proven to be exploitable, even without the presence
of CONFIG_PREEMPT or CONFIG_SLAB_MERGE_DEFAULT settings. Notably, this exploit
has been successfully demonstrated in the environment provided by [Google kCTF
VRP][ctf] ([bzImage_upstream_6.1.25][img], [config][cfg]).

 [ctf]: https://google.github.io/kctf/vrp.html
 [img]: https://storage.googleapis.com/kctf-vrp-public-files/bzImage_upstream_6.1.25
 [cfg]: https://storage.googleapis.com/kctf-vrp-public-files/bzImage_upstream_6.1.25_config

The StackRot vulnerability has been present in the Linux kernel since version
6.1 when the VMA tree structure was [changed][ch] from red-black trees to maple
trees.

 [ch]: https://lore.kernel.org/lkml/20220906194824.2110408-1-Liam.Howlett@oracle.com/

## Background

Whenever the `mmap()` system call is utilized to establish a memory mapping,
the kernel generates a structure called `vm_area_struct` to represent the
corresponding virtual memory area (VMA). This structure stores various
information including flags, properties, and other pertinent details related to
the mapping.

```c
struct vm_area_struct {
        long unsigned int          vm_start;             /*     0     8 */
        long unsigned int          vm_end;               /*     8     8 */
        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        long unsigned int          vm_flags;             /*    32     8 */
        union {
                struct {
                        struct rb_node rb __attribute__((__aligned__(8))); /*    40    24 */
                        /* --- cacheline 1 boundary (64 bytes) --- */
                        long unsigned int rb_subtree_last; /*    64     8 */
                } __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8))); /*    40    32 */
                struct anon_vma_name * anon_name;        /*    40     8 */
        } __attribute__((__aligned__(8)));               /*    40    32 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct list_head           anon_vma_chain;       /*    72    16 */
        struct anon_vma *          anon_vma;             /*    88     8 */
        const struct vm_operations_struct  * vm_ops;     /*    96     8 */
        long unsigned int          vm_pgoff;             /*   104     8 */
        struct file *              vm_file;              /*   112     8 */
        void *                     vm_private_data;      /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        atomic_long_t              swap_readahead_info;  /*   128     8 */
        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   136     0 */

        /* size: 136, cachelines: 3, members: 14 */
        /* forced alignments: 1 */
        /* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));
```

Subsequently, when the kernel encounters page faults or other memory-related
system calls, it requires fast lookup of the VMA solely based on the address.
Previously, the VMAs were managed using red-black trees. However, starting from
Linux kernel version 6.1, the migration to maple trees took place. [Maple
trees][mt] are RCU-safe B-tree data structures optimized for storing
non-overlapping ranges. Nonetheless, their intricate nature adds complexity to
the codebase and introduces the StackRot vulnerability.

 [mt]: https://docs.kernel.org/6.4/core-api/maple_tree.html

In essence, the maple tree is composed of maple nodes. Throughout this article,
it is assumed that the maple tree has only a single root node, which can
contain a maximum of 16 intervals. Each interval can either represent a gap or
point to a VMA. Thus, no gaps are allowed between two intervals.

```c
struct maple_range_64 {
        struct maple_pnode *       parent;               /*     0     8 */
        long unsigned int          pivot[15];            /*     8   120 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        union {
                void *             slot[16];             /*   128   128 */
                struct {
                        void *     pad[15];              /*   128   120 */
                        /* --- cacheline 3 boundary (192 bytes) was 56 bytes ago --- */
                        struct maple_metadata meta;      /*   248     2 */
                };                                       /*   128   128 */
        };                                               /*   128   128 */

        /* size: 256, cachelines: 4, members: 3 */
};
```

The structure `maple_range_64` represents a maple node in the following manner.
The pivots indicate the endpoints of 16 intervals, while the slots are used to
reference the VMA structure when the node is considered a leaf node. The layout
of pivots and slots can be visualized as shown below:

```
 Slots -> | 0 | 1 | 2 | ... | 12 | 13 | 14 | 15 |
          ┬   ┬   ┬   ┬     ┬    ┬    ┬    ┬    ┬
          │   │   │   │     │    │    │    │    └─ Implied maximum
          │   │   │   │     │    │    │    └─ Pivot 14
          │   │   │   │     │    │    └─ Pivot 13
          │   │   │   │     │    └─ Pivot 12
          │   │   │   │     └─ Pivot 11
          │   │   │   └─ Pivot 2
          │   │   └─ Pivot 1
          │   └─ Pivot 0
          └─  Implied minimum
```

Regarding concurrent modification, the maple tree imposes restrictions,
requiring an exclusive lock to be held by writers. In the case of the VMA tree,
the exclusive lock corresponds to the MM write lock. As for readers, two
options are available. The first option involves holding the MM read lock,
which results in the writer being blocked by the MM read-write lock.
Alternatively, the second option is to enter the RCU critical section. By doing
so, the writer is not blocked, and readers can continue their operations since
the maple tree is RCU-safe. While most existing VMA accesses opt for the first
option, the second option is employed in a few performance-critical scenarios,
such as lockless page faults.

However, there is an additional aspect that requires particular attention,
which pertains to stack expansion. The stack represents a memory area that is
mapped with the MAP_GROWSDOWN flag, indicating automatic expansion when an
address below the region is accessed. In such cases, the start address of the
corresponding VMA is adjusted, as well as the associated interval within the
maple tree. Notably, these adjustments are made without holding the MM write
lock.

```c
static inline
void do_user_addr_fault(struct pt_regs *regs,
                        unsigned long error_code,
                        unsigned long address)
{
	// ...

	if (unlikely(!mmap_read_trylock(mm))) {
		// ...
	}
	// ...
	if (unlikely(expand_stack(vma, address))) {
		// ...
	}

	// ...
}
```

Typically, a gap exists between the stack VMA and its neighboring VMA, as the
kernel enforces a stack guard. In this scenario, when expanding the stack, only
the pivot value in the maple node needs updating, a process that can be
performed atomically. However, if the neighboring VMA also possesses the
MAP_GROWSDOWN flag, no stack guard is enforced. Consequently, the stack
expansion can eliminate the gap. In such situations, the gap interval within
the maple node must be removed. As the maple tree is RCU-safe, overwriting the
node in-place is not possible. Instead, a new node is created, triggering node
replacement, and the old node is subsequently destroyed using an RCU callback.

```c
int expand_downwards(struct vm_area_struct *vma, unsigned long address)
{
	// ...

	if (prev) {
		if (!(prev->vm_flags & VM_GROWSDOWN) &&
		    vma_is_accessible(prev) &&
		    (address - prev->vm_end < stack_guard_gap))
			return -ENOMEM;
	}

	// ...
}
```

The RCU callback is invoked only after all pre-existing RCU critical sections
have concluded. However, the issue arises when accessing VMAs, as only the MM
read lock is held, and it does not enter the RCU critical section.
Consequently, in theory, the callback could be invoked at any time, resulting
in the freeing of the old maple node. However, pointers to the old node may
have already been fetched, leading to a use-after-free bug when attempting
subsequent access to it.

The backtrace where use-after-free (UAF) occurs is shown below:

```
  - CPU 0 -                                        - CPU 1 -

  mm_read_lock()                                    mm_read_lock()
  expand_stack()                                    find_vma_prev()
    expand_downwards()                                mas_walk()
      mas_store_prealloc()                              mas_state_walk()
        mas_wr_story_entry()                              mas_start()
          mas_wr_modify()                                   mas_root()
            mas_wr_store_node()                               node = rcu_dereference_check()
              mas_replace()                                   [ The node pointer is recorded ]
                mas_free()
                  ma_free_rcu()
                    call_rcu(&mt_free_rcu)
                    [ The node is dead ]
  mm_read_unlock()

  [ Wait for the next RCU grace period.. ]
  rcu_do_batch()                                      mas_prev()
    mt_free_rcu()                                       mas_prev_entry()
      kmem_cache_free()                                   mas_prev_nentry()
      [ The node is freed ]                                 mas_slot()
                                                              mt_slot()
                                                                rcu_dereference_check(node->..)
                                                                [ UAF occurs here ]
                                                    mm_read_unlock()
```

## Fix

I reported this vulnerability to the Linux kernel security team on June 15th.
Following that, the process of addressing this bug was led by Linus Torvalds.
Given its complexity, it took nearly two weeks to develop a set of patches that
received consensus.

On June 28th, during the merge window for Linux kernel 5.5, the fix was merged
into Linus' tree. Linus provided a [comprehensive merge message][fix] to
elucidate the patch series from a technical perspective.

 [fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9471f1f2f50282b9e8f59198ec6bb738b4ccc009

These patches were subsequently backported to stable kernels ([6.1.37][6.1],
[6.3.11][6.3], and [6.4.1][6.4]), effectively resolving the "Stack Rot" bug on
July 1st.

 [6.1]: https://lore.kernel.org/stable/2023070133-create-stainless-9a8c@gregkh/T/
 [6.3]: https://lore.kernel.org/stable/2023070146-endearing-bounding-d21a@gregkh/T/
 [6.4]: https://lore.kernel.org/stable/2023070140-eldercare-landlord-133c@gregkh/T/

## Exploit

**The complete exploit code and a comprehensive write-up will be made publicly
available no later than the end of July.**

Thanks,
Ruihan Li

Powered by blists - more mailing lists

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.