|
|
Message-ID: <2025-11-05-remember-remember-the-fifth-of-november-3EtRdS@cyphar.com> Date: Wed, 5 Nov 2025 20:53:08 +1100 From: Aleksa Sarai <cyphar@...har.com> To: oss-security@...ts.openwall.com, fulldisclosure@...lists.org Subject: runc container breakouts via procfs writes: CVE-2025-31133, CVE-2025-52565, and CVE-2025-52881 | NOTE: This advisory was sent to <security-announce@...ncontainers.org> | on 2025-10-16. If you ship any Open Container Initiative software, we | highly recommend that you subscribe to our security-announce list in | order to receive more timely disclosures of future security issues. | The procedure for subscribing to security-announce is outlined here: | <https://github.com/opencontainers/.github/blob/main/SECURITY.md#disclosure-distribution-list> Hello, This is a notification to vendors that use or ship runc about THREE (3) high-severity vulnerabilities (CVE-2025-31133, CVE-2025-52565, and CVE-2025-52881). All three vulnerabilities ultimately allow (through different methods) for full container breakouts by bypassing runc's restrictions for writing to arbitrary /proc files. Today we have released the following runc releases which include more than 20 patches to resolve this issue: * runc v1.4.0-rc.3 <https://github.com/opencontainers/runc/releases/tag/v1.4.0-rc.3> * runc v1.3.3 <https://github.com/opencontainers/runc/releases/tag/v1.3.3> * runc v1.2.8 <https://github.com/opencontainers/runc/releases/tag/v1.2.8> We strongly recommend you update as soon as possible. For your own reference I have attached a tarball of the patches (which apply cleanly on top of runc v1.2.7, v1.3.2 and v1.4.0-rc.2). Unfortunately the patches are are quite large as they required a lot of development work in github.com/cyphar/filepath-securejoin along with quite deep changes to runc. I would recommend just going with the released versions. Note that these patches have not been split into per-CVE patches, as the resolutions for each issue overlap and so some patches help resolve more than one CVE on the list. We strongly recommend simply applying all of the provided patches (we have included a squashed single-patch version for your convenience -- see v1.[234].patch). | **NOTE**: | Some vendors were given a pre-release version of this release. | These public releases include two extra patches to fix regressions | dIscovered very late during the embargo period and were thus not | included in the pre-release versions. Please update to this version. | The above tarball includes these extra patches as well. /*** Vulnerabilities ***/ Below is a break-down of the key points of each issue. Once this vulnerability is made public on the embargo date, the linked advisory pages will contain some more information about the issues. Please note that while these issues are generally related, the available mitigations (if any) vary from issue to issue. However, all of these attacks rely on starting containers with custom mount configurations -- if you do not run untrusted container images from unknown or unverified sources then these attacks would not be possible to exploit. Note that Dockerfiles support custom mount configurations (with RUN --mount=...) and so these issues are also exploitable from Dockerfiles. Also please note that the below CVSS scores are based on the threat model from *runc's point of view*. If you were to analyse the same vulnerability from the perspective of network-enabled systems like Docker or Kubernetes you would likely end up with a much higher severity. /* CVE-2025-31133 */ "container escape via 'masked path' abuse due to mount race conditions" CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3) <https://github.com/opencontainers/runc/security/advisories/GHSA-9493-h29p-rfm2> CVE-2025-31133 exploits an issue with how masked paths are implemented in runc. When masking files, runc will bind-mount the container's /dev/null inode on top of the file. However, if an attacker can replace /dev/null with a symlink to some other procfs file, runc will instead bind-mount the symlink target read-write. This issue affects all known runc versions. This stage happens after pivot_root(2) and so cannot be used to bind-mount host files directly. However, paths like /proc/sys/kernel/core_pattern which can be used to break out of a container entirely (coredump helpers are spawned as upcalls, which are not namespaced and have full host privileges). /proc/sysrq-trigger can also be used by an attacker to cause the host system to crash or halt. (This is "Attack 1".) While developing a fix for this issue, we also discovered that if the attacker instead deleted /dev/null, runc would purposefully ignore the error and thus make maskedPath a no-op. This is slightly less serious, but it would permit some information disclosure through masked files like /proc/kcore and /proc/timer_list. (This is "Attack 2".) Potential mitigations for this issue include: * Using user namespaces, with the host root user not mapped into the container's namespace. procfs file permissions are managed using Unix DAC and thus user namespaces stop a container process from being able to write to them. * Not running as a root user in the container (this includes disabling setuid binaries with noNewPrivileges). As above, procfs file permissions are managed using Unix DAC and thus non-root users cannot write to them. * Depending on the maskedPath configuration (the default configuration only masks paths in /proc and /sys), using an AppArmor that blocks unexpected writes to any maskedPaths (as is the case with the default profile used by Docker and Podman) will block attempts to exploit this issue. However, CVE-2025-52881 allows an attacker to bypass LSM labels, and so this mitigation is not helpful when considered in combination with CVE-2025-52881. * Based on our analysis, SELinux will NOT help mitigate this issue -- the /dev/null bind-mount used for maskedPaths get re-labeled to the container context and thus the container will have access to them. Thanks to Lei Wang (@ssst0n3 from Huawei) for finding and reporting the original vulnerability (Attack 1), and Li Fubang (@lifubang from acmcoder.com, CIIC) for discovering another attack vector (Attack 2) based on @ssst0n3's initial findings. /* CVE-2025-52565 */ "container escape with malicious config due to /dev/console mount and related races" CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3) <https://github.com/opencontainers/runc/security/advisories/GHSA-qw9x-cqr3-wc7r> CVE-2025-52565 is very similar in concept and application to CVE-2025-31133, except that it exploits a flaw in /dev/console bind-mounts. When creating the /dev/console bind-mount (to /dev/pts/$n), if an attacker replaces /dev/pts/$n with a symlink then runc will bind-mount the symlink target over /dev/console. This issue affects all versions of runc >= 1.0.0-rc3. As with CVE-2025-31133, this happens after pivot_root(2) and so cannot be used to bind-mount host files directly, but an attacker can trick runc into creating a read-write bind-mount of /proc/sys/kernel/core_pattern or /proc/sysrq-trigger, leading to a complete container breakout (as with CVE-2025-31133). While developing a fix for this issue, we also found some potentially concerning issues with os.Create usage (which may have allowed for host files to be truncated by an attacker) -- though we deemed these issues to not be exploitable, we have provided fixes for them. In addition, some previously known issues with /dev/pts/$n race conditions were re-analysed and we have included mitigations for them too (even though we still feel these are mostly hypothetical issues). Potential mitigations for this issue include: * Using user namespaces, with the host root user not mapped into the container's namespace. procfs file permissions are managed using Unix DAC and thus user namespaces stop a container process from being able to write to them. * Not running as a root user in the container (this includes disabling setuid binaries with noNewPrivileges). As above, procfs file permissions are managed using Unix DAC and thus non-root users cannot write to them. * The default SELinux policy should mitigate this issue, as the /dev/console bind-mount does not re-label the mount and so the container process should not be able to write to unsafe procfs files. However, CVE-2025-52881 allows an attacker to bypass LSM labels, and so this mitigation is not helpful when considered in combination with CVE-2025-52881. * The default AppArmor profile used by most runtimes will NOT help mitigate this issue, as /dev/console access is permitted. You could create a custom profile that blocks access to /dev/console, but such a profile might break regular containers. In addition, CVE-2025-52881 allows an attacker to bypass LSM labels, and so that mitigation is not helpful when considered in combination with CVE-2025-52881. Known Issues: * We are aware of an issue with our mitigation for this attack and certain configurations Thanks to Lei Wang (@ssst0n3 from Huawei) and Li Fubang (@lifubang from acmcoder.com, CIIC) for discovering and reporting the main /dev/console bind-mount vulnerability, as well as Aleksa Sarai (@cyphar from SUSE) for discovering the related issues mentioned above as well as the original research into these classes of issues several years ago. /* CVE-2025-52881 */ "container escape and denial of service due to arbitrary write gadgets and procfs write redirects" CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3) <https://github.com/opencontainers/runc/security/advisories/GHSA-cgrx-mc8f-2prm> This attack is a more sophisticated variant of CVE-2019-16884, which was CVE-2019-19921 a flaw that allowed an attacker to trick runc into writing the LSM process labels for a container process into a dummy tmpfs file and thus not apply the correct LSM labels to the container process. The mitigation we applied for CVE-2019-19921 was fairly limited and effectively only caused runc to verify that when we write LSM labels that those labels are actual procfs files. This issue affects all known runc versions. Rather than using a fake tmpfs file for /proc/self/attr/<label>, an attacker could instead (through various means) make /proc/self/attr/<label> reference a real procfs file, but one that would still be a no-op (such as /proc/self/sched). This would have the same effect but would clear the "is a procfs file" check. We were aware that this kind of attack would be possible (even going so far as to discuss this publicly as "future work" at conferences), and we were working on a far more comprehensive mitigation of this attack, but this security issue was disclosed before we could complete this work. This attack pairs well with CVE-2025-31133 and CVE-2025-52565, as the most basic version described above acts as an LSM bypass that makes it easy for an attacker to write to procfs files and break out of a container. However, rather than just making the write a no-op, the attacker could instead redirect the write to a more malicious target (such as /proc/sysrq-trigger to crash the host machine). In addition, sysctl writes could be similarly redirected, so it is plausible an attacker would be able to provide a custom payload to write, allowing for a /proc/sys/kernel/core_pattern-based full container breakout. This lead us to do a complete audit for all write operations in runc, as any write operation could potentially be redirected in a similar way -- we did not find any more problematic writes in our analysis but we are still investigating the possibility of using lints or static analysis to detect this kind of issue. Potential mitigations for this issue include: * Using rootless containers, as doing so will block most of the inadvertent writes (runc would run with reduced privileges, making attempts to write to procfs files ineffective). * Based on our analysis, neither AppArmor or SELinux can protect against the full version of the redirected write attack. The container runtime is generally privileged enough to write to arbitrary procfs files, which is more than sufficient to cause a container breakout. With SELinux, it is *possible* that the container_runtime_t label applied to runc will restrict how much runc can do with the no-op variant of the attack, but it seems to us that the /proc/sysrq-trigger host crash and /proc/sys/kernel/core_pattern container breakout attacks would still work. Thanks to Li Fubang (@lifubang from acmcoder.com, CIIC) and Tõnis Tiigi (@tonistiigi from Docker) for both independently discovering this vulnerability, as well as Aleksa Sarai (@cyphar from SUSE) for the original research into this class of security issues and solutions over the past few years. /*** Other Container Runtimes ***/ These issues are all very easy-to-make logic flaws, and as such we contacted several other container runtimes to alert them of these issues and provide them our analysis. Our current understanding is that youki and crun have similar flaws and are working on patches to be released in co-ordination with this advisory. LXC appears to have some similar bugs but their security policy is (understandably) that non-user-namespaced containers are fundamentally insecure and thus such exploits are not security issues. If you use a container runtime other than runc, please check whether upstream has released a security update addressing these (or similar) issues once this issue becomes public. If you are a container runtime author that we did not contact, please get in touch with me at <cyphar@...har.com> to get added to the cross-runtime security group. Please note that this group is intended for *low-level* container runtime *upstream maintainers* only. /*** Extra Patches ***/ There were three issues with these patches which we became aware of quite late in the embargo process. We have included new patches in the released versions linked above to address two of them, but these patches were not included in the pre-release tarballs provided to vendors: * *00*-openat2-improve-resilience-on-busy-systems.patch * *00*-rootfs-re-allow-dangling-symlinks-in-mount-targets.patch Note that these are *NOT* security issues, they are usability regressions that may affect some users depending on what images they use and what kind of systems they run their containers on. Below is the description provided to vendors, for your own reference, but the issues listed have been fixed (with the exception of the last issue, which is still being investigated). /* openat2 EAGAIN Retry Failures */ openat2 will return -EAGAIN if there was a racing rename or mount when trying to walk into ".." during a scoped lookup. On systems with heavy load, this can happen fairly frequently. In the version of the patches we merged, runc would retry every openat2 operation up to 32 times before failing with an error in order to mitigate this while also avoiding denial-of-service attacks. Unfortunately, it seems this number was too conservative and some vendors have reported seeing this error: runc run failed: unable to start container process: error during container init: error mounting "$source" to rootfs at "$destination": create mountpoint for $destination mount: lookup mountpoint target: securejoin.OpenInRoot $destination: openat2 $destination: possible attack detected Based on my testing, the worst-case failure rate for this is probably around 3% (this is based on figures from me running very aggressive rename loops on all 16 cores of my laptop). It is probably lower for production deployments that have less aggressive rename and mount churn, but it was a detectable regression for some downstreams. *00*-openat2-improve-resilience-on-busy-systems.patch is a patch that resolves this issue. The simplest mitigation is to just bump the retry number (which this patch does), but I have also included some additional retries with a time-based deadline that in my testing should be virtually impossible to hit even in very high load scenarios (I was unable to hit the error even after running >50k tests in a tight loop). Some vendors have reported that this reduced the failure rate to effectively 0 after 3-4 days of heavy load testing. /* Dangling Symlink Mount Targets */ Due to the hardening work done for mounts in the provided patchsets, it was necessary to block certain configurations that could not be done safely in a reasonable way. One of these configurations is mount targets that contain symlinks to non-existent paths (otherwise known as "dangling symlinks"). With these patches, such configurations will result in the following error: runc create failed: unable to start container process: error during container init: error mounting "$source" to rootfs at "$destination": create mountpoint for $destination mount: make mountpoint "$destination": file exists The workaround is to either change the symlink to point to a real path or create the target of the dangling symlink (previously, runc would do this for you). A survey of public images indicates that this pattern is incredibly rare (the one example I've been given is of a broken /etc/resolv.conf symlink), and in addition these kinds of symlinks are quite hard to deal with in a sane and safe manner. This change in behaviour was intentional, but after receving reports from more than one downstream, I took another look and wrote a hotfix that should allow us to continue to support these broken symlinks. *00*-rootfs-re-allow-dangling-symlinks-in-mount-targets.patch is that patch. However, we still strongly suggest users refrain from creating images with such broken symlinks. /* Issues with "-v /dev:/dev" */ At SUSE, we found an example of a developer tool creating a bind-mount of the host /dev into the container. For reasons that are not entirely clear to me yet, this setup appears to have worked previously but can now lead to permission issues with rootless containers with our mitigating patches, with typical errors looking like: exec failed: unable to start container process: reopen ptmx to get new pty pair: reopen fd 11: permission denied I have not yet been able to root-cause this issue (I suspect that ptmxmode=000 has some part to play here), but I would argue that such setups are not particularly safe nor recommended, and users should instead be doing --mount type=devpts,... if they have a strong need to configure the /dev/pts mount (which is what our tool was trying to do and had already been patched in newer versions to do properly). If you have seen this issue or have any other information, feel free to open a bug report. /*** Credits ***/ Thanks again to the following researchers for helping discover and report these vulnerabilities: * Lei Wang (@ssst0n3 from Huawei) * Li Fubang (@lifubang from acmcoder.com, CIIC) * Tõnis Tiigi (@tonistiigi from Docker) * Aleksa Sarai (@cyphar from SUSE) Additional thanks go to Tõnis Tiigi for showing that Dockerfiles can be used to exploit these issues, and thus providing us with some very useful exploit templates for these kinds of race attacks. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ Download attachment "runc-patches-2025-11-05.tar.xz" of type "application/x-xz" (109568 bytes) Download attachment "signature.asc" of type "application/pgp-signature" (266 bytes)
Powered by blists - more mailing lists
Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.