|
Message-ID: <cf57fb34-460c-3211-840f-8a5e3d88811a@linux.com> Date: Tue, 16 Nov 2021 10:52:39 +0300 From: Alexander Popov <alex.popov@...ux.com> To: Gabriele Paoloni <gpaoloni@...hat.com>, Lukas Bulwahn <lukas.bulwahn@...il.com>, Robert Krutsch <krutsch@...il.com> Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Jonathan Corbet <corbet@....net>, Paul McKenney <paulmck@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra <peterz@...radead.org>, Joerg Roedel <jroedel@...e.de>, Maciej Rozycki <macro@...am.me.uk>, Muchun Song <songmuchun@...edance.com>, Viresh Kumar <viresh.kumar@...aro.org>, Robin Murphy <robin.murphy@....com>, Randy Dunlap <rdunlap@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>, Petr Mladek <pmladek@...e.com>, Kees Cook <keescook@...omium.org>, Luis Chamberlain <mcgrof@...nel.org>, Wei Liu <wl@....org>, John Ogness <john.ogness@...utronix.de>, Andy Shevchenko <andriy.shevchenko@...ux.intel.com>, Alexey Kardashevskiy <aik@...abs.ru>, Christophe Leroy <christophe.leroy@...roup.eu>, Jann Horn <jannh@...gle.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Mark Rutland <mark.rutland@....com>, Andy Lutomirski <luto@...nel.org>, Dave Hansen <dave.hansen@...ux.intel.com>, Steven Rostedt <rostedt@...dmis.org>, Will Deacon <will@...nel.org>, Ard Biesheuvel <ardb@...nel.org>, Laura Abbott <labbott@...nel.org>, David S Miller <davem@...emloft.net>, Borislav Petkov <bp@...en8.de>, Arnd Bergmann <arnd@...db.de>, Andrew Scull <ascull@...gle.com>, Marc Zyngier <maz@...nel.org>, Jessica Yu <jeyu@...nel.org>, Iurii Zaikin <yzaikin@...gle.com>, Rasmus Villemoes <linux@...musvillemoes.dk>, Wang Qing <wangqing@...o.com>, Mel Gorman <mgorman@...e.de>, Mauro Carvalho Chehab <mchehab+huawei@...nel.org>, Andrew Klychkov <andrew.a.klychkov@...il.com>, Mathieu Chouquet-Stringer <me@...hieu.digital>, Daniel Borkmann <daniel@...earbox.net>, Stephen Kitt <steve@....org>, Stephen Boyd <sboyd@...nel.org>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>, Mike Rapoport <rppt@...nel.org>, Bjorn Andersson <bjorn.andersson@...aro.org>, Kernel Hardening <kernel-hardening@...ts.openwall.com>, linux-hardening@...r.kernel.org, "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, linux-arch <linux-arch@...r.kernel.org>, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, linux-fsdevel <linux-fsdevel@...r.kernel.org>, notify@...nel.org, main@...ts.elisa.tech, safety-architecture@...ts.elisa.tech, devel@...ts.elisa.tech, Shuah Khan <shuah@...nel.org> Subject: Re: [ELISA Safety Architecture WG] [PATCH v2 0/2] Introduce the pkill_on_warn parameter On 15.11.2021 18:51, Gabriele Paoloni wrote: > > > On 15/11/2021 14:59, Lukas Bulwahn wrote: >> On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...ux.com> wrote: >>> >>> On 13.11.2021 00:26, Linus Torvalds wrote: >>>> On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...ux.com> wrote: >>>>> >>>>> Hello everyone! >>>>> Friendly ping for your feedback. >>>> >>>> I still haven't heard a compelling _reason_ for this all, and why >>>> anybody should ever use this or care? >>> >>> Ok, to sum up: >>> >>> Killing the process that hit a kernel warning complies with the Fail-Fast >>> principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when >>> the **first signs** of wrong behavior are detected. >>> >>> By default, the Linux kernel ignores a warning and proceeds the execution from >>> the flawed state. That is opposite to the Fail-Fast principle. >>> A kernel warning may be followed by memory corruption or other negative effects, >>> like in CVE-2019-18683 exploit [2] or many other cases detected by the SyzScope >>> project [3]. pkill_on_warn would prevent the system from the errors going after >>> a warning in the process context. >>> >>> At the same time, pkill_on_warn does not kill the entire system like >>> panic_on_warn. That is the middle way of handling kernel warnings. >>> Linus, it's similar to your BUG_ON() policy [4]. The process hitting BUG_ON() is >>> killed, and the system proceeds to work. pkill_on_warn just brings a similar >>> policy to WARN_ON() handling. >>> >>> I believe that many Linux distros (which don't hit WARN_ON() here and there) >>> will enable pkill_on_warn because it's reasonable from the safety and security >>> points of view. >>> >>> And I'm sure that the ELISA project by the Linux Foundation (Enabling Linux In >>> Safety Applications [5]) would support the pkill_on_warn sysctl. >>> [Adding people from this project to CC] >>> >>> I hope that I managed to show the rationale. >>> >> >> Alex, officially and formally, I cannot talk for the ELISA project >> (Enabling Linux In Safety Applications) by the Linux Foundation and I >> do not think there is anyone that can confidently do so on such a >> detailed technical aspect that you are raising here, and as the >> various participants in the ELISA Project have not really agreed on >> such a technical aspect being one way or the other and I would not see >> that happening quickly. However, I have spent quite some years on the >> topic on "what is the right and important topics for using Linux in >> safety applications"; so here are my five cents: >> >> One of the general assumptions about safety applications and safety >> systems is that the malfunction of a function within a system is more >> critical, i.e., more likely to cause harm to people, directly or >> indirectly, than the unavailability of the system. So, before >> "something potentially unexpected happens"---which can have arbitrary >> effects and hence effects difficult to foresee and control---, it is >> better to just shutdown/silence the system, i.e., design a fail-safe >> or fail-silent system, as the effect of shutdown is pretty easily >> foreseeable during the overall system design and you could think about >> what the overall system does, when the kernel crashes the usual way. >> >> So, that brings us to what a user would expect from the kernel in a >> safety-critical system: Shutdown on any event that is unexpected. >> >> Here, I currently see panic_on_warn as the closest existing feature to >> indicate any event that is unexpected and to shutdown the system. That >> requires two things for the kernel development: >> >> 1. Allow a reasonably configured kernel to boot and run with >> panic_on_warn set. Warnings should only be raised when something is >> not configured as the developers expect it or the kernel is put into a >> state that generally is _unexpected_ and has been exposed little to >> the critical thought of the developer, to testing efforts and use in >> other systems in the wild. Warnings should not be used for something >> informative, which still allows the kernel to continue running in a >> proper way in a generally expected environment. Up to my knowledge, >> there are some kernels in production that run with panic_on_warn; so, >> IMHO, this requirement is generally accepted (we might of course >> discuss the one or other use of warn) and is not too much to ask for. >> >> 2. Really ensure that the system shuts down when it hits warn and >> panic. That requires that the execution path for warn() and panic() is >> not overly complicated (stuffed with various bells and whistles). >> Otherwise, warn() and panic() could fail in various complex ways and >> potentially keep the system running, although it should be shut down. >> Some people in the ELISA Project looked a bit into why they believe >> panic() shuts down a system but I have not seen a good system analysis >> and argument why any third person could be convinced that panic() >> works under all circumstances where it is invoked or that at least, >> the circumstances under which panic really works is properly >> documented. That is a central aspect for using Linux in a >> reasonably-designed safety-critical system. That is possibly also >> relevant for security, as you might see an attacker obtain information >> because it was possible to "block" the kernel shutting down after >> invoking panic() and hence, the attacker could obtain certain >> information that was only possible because 1. the system got into an >> inconsistent state, 2. it was detected by some check leading to warn() >> or panic(), and 3. the system's security engineers assumed that the >> system must have been shutting down at that point, as panic() was >> invoked, and hence, this would be disallowing a lot of further >> operations or some specific operations that the attacker would need to >> trigger in that inconsistent state to obtain information. >> >> To your feature, Alex, I do not see the need to have any refined >> handling of killing a specific process when the kernel warns; stopping >> the whole system is the better and more predictable thing to do. I >> would prefer if systems, which have those high-integrity requirements, >> e.g., in a highly secure---where stopping any unintended information >> flow matters more than availability---or in fail-silent environments >> in safety systems, can use panic_on_warn. That should address your >> concern above of handling certain CVEs as well. >> >> In summary, I am not supporting pkill_on_warn. I would support the >> other points I mentioned above, i.e., a good enforced policy for use >> of warn() and any investigation to understand the complexity of >> panic() and reducing its complexity if triggered by such an >> investigation. > > Hi Alex > > I also agree with the summary that Lukas gave here. From my experience > the safety system are always guarded by an external flow monitor (e.g. a > watchdog) that triggers in case the safety relevant workloads slows down > or block (for any reason); given this condition of use, a system that > goes into the panic state is always safe, since the watchdog would > trigger and drive the system automatically into safe state. > So I also don't see a clear advantage of having pkill_on_warn(); > actually on the flip side it seems to me that such feature could > introduce more risk, as it kills only the threads of the process that > caused the kernel warning whereas the other processes are trusted to > run on a weaker Kernel (does killing the threads of the process that > caused the kernel warning always fix the Kernel condition that lead to > the warning?) Lukas, Gabriele, Robert, Thanks for showing this from the safety point of view. The part about believing in panic() functionality is amazing :) Yes, safety critical systems depend on the robust ability to restart. Best regards, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.