kernel-hardening - RE: Linux guest kernel threat model for Confidential Computing

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DM8PR11MB5750F939C0B70939AD3CBC37E7D39@DM8PR11MB5750.namprd11.prod.outlook.com>
Date: Mon, 30 Jan 2023 07:42:30 +0000
From: "Reshetova, Elena" <elena.reshetova@...el.com>
To: "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>, Leon Romanovsky
	<leon@...nel.org>
CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Shishkin, Alexander"
	<alexander.shishkin@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Kuppuswamy, Sathyanarayanan"
	<sathyanarayanan.kuppuswamy@...el.com>, "Kleen, Andi" <andi.kleen@...el.com>,
	"Hansen, Dave" <dave.hansen@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>, "Wunner, Lukas"
	<lukas.wunner@...el.com>, Mika Westerberg <mika.westerberg@...ux.intel.com>,
	"Michael S. Tsirkin" <mst@...hat.com>, Jason Wang <jasowang@...hat.com>,
	"Poimboe, Josh" <jpoimboe@...hat.com>, "aarcange@...hat.com"
	<aarcange@...hat.com>, Cfir Cohen <cfir@...gle.com>, Marc Orr
	<marcorr@...gle.com>, "jbachmann@...gle.com" <jbachmann@...gle.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "keescook@...omium.org"
	<keescook@...omium.org>, James Morris <jmorris@...ei.org>, Michael Kelley
	<mikelley@...rosoft.com>, "Lange, Jon" <jlange@...rosoft.com>,
	"linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>, "Linux Kernel
 Mailing List" <linux-kernel@...r.kernel.org>, Kernel Hardening
	<kernel-hardening@...ts.openwall.com>
Subject: RE: Linux guest kernel threat model for Confidential Computing

 On Thu, 2023-01-26 at 13:28 +0000, Reshetova, Elena wrote:
> > > On Thu, Jan 26, 2023 at 11:29:20AM +0000, Reshetova, Elena wrote:
> > > > > On Wed, Jan 25, 2023 at 03:29:07PM +0000, Reshetova, Elena
> > > > > wrote:
> > > > > > Replying only to the not-so-far addressed points.
> > > > > >
> > > > > > > On Wed, Jan 25, 2023 at 12:28:13PM +0000, Reshetova, Elena
> > > > > > > wrote:
> > > > > > > > Hi Greg,
> > > > >
> > > > > <...>
> > > > >
> > > > > > > > 3) All the tools are open-source and everyone can start
> > > > > > > > using them right away even without any special HW (readme
> > > > > > > > has description of what is needed).
> > > > > > > > Tools and documentation is here:
> > > > > > > > https://github.com/intel/ccc-linux-guest-hardening
> > > > > > >
> > > > > > > Again, as our documentation states, when you submit patches
> > > > > > > based on these tools, you HAVE TO document that.  Otherwise
> > > > > > > we think you all are crazy and will get your patches
> > > > > > > rejected.  You all know this, why ignore it?
> > > > > >
> > > > > > Sorry, I didn’t know that for every bug that is found in
> > > > > > linux kernel when we are submitting a fix that we have to
> > > > > > list the way how it has been found. We will fix this in the
> > > > > > future submissions, but some bugs we have are found by
> > > > > > plain code audit, so 'human' is the tool.
> > > > > My problem with that statement is that by applying different
> > > > > threat model you "invent" bugs which didn't exist in a first
> > > > > place.
> > > > >
> > > > > For example, in this [1] latest submission, authors labeled
> > > > > correct behaviour as "bug".
> > > > >
> > > > > [1] https://lore.kernel.org/all/20230119170633.40944-1-
> > > > > alexander.shishkin@...ux.intel.com/
> > > >
> > > > Hm.. Does everyone think that when kernel dies with unhandled
> > > > page fault (such as in that case) or detection of a KASAN out of
> > > > bounds violation (as it is in some other cases we already have
> > > > fixes or investigating) it represents a correct behavior even if
> > > > you expect that all your pci HW devices are trusted?
> > >
> > > This is exactly what I said. You presented me the cases which exist
> > > in your invented world. Mentioned unhandled page fault doesn't
> > > exist in real world. If PCI device doesn't work, it needs to be
> > > replaced/blocked and not left to be operable and accessible from
> > > the kernel/user.
> >
> > Can we really assure correct operation of *all* pci devices out
> > there? How would such an audit be performed given a huge set of them
> > available? Isnt it better instead to make a small fix in the kernel
> > behavior that would guard us from such potentially not correctly
> > operating devices?
> 
> I think this is really the wrong question from the confidential
> computing (CC) point of view.  The question shouldn't be about assuring
> that the PCI device is operating completely correctly all the time (for
> some value of correct).  It's if it were programmed to be malicious
> what could it do to us?  

Sure, but Leon didn’t agree with CC threat model to begin with, so 
I was trying to argue here how this fix can be useful for non-CC threat 
model case. But obviously my argument for non-CC case wasn't good (
especially reading Ted's reply here 
https://lore.kernel.org/all/Y9Lonw9HzlosUPnS@mit.edu/ ), so I better
stick to CC threat model case indeed. 

>If we take all DoS and Crash outcomes off the
> table (annoying but harmless if they don't reveal the confidential
> contents), we're left with it trying to extract secrets from the
> confidential environment.

Yes, this is the ultimate end goal. 

> 
> The big threat from most devices (including the thunderbolt classes) is
> that they can DMA all over memory.  However, this isn't really a threat
> in CC (well until PCI becomes able to do encrypted DMA) because the
> device has specific unencrypted buffers set aside for the expected DMA.
> If it writes outside that CC integrity will detect it and if it reads
> outside that it gets unintelligible ciphertext.  So we're left with the
> device trying to trick secrets out of us by returning unexpected data.

Yes, by supplying the input that hasn’t been expected. This is exactly 
the case we were trying to fix here for example:
https://lore.kernel.org/all/20230119170633.40944-2-alexander.shishkin@linux.intel.com/
I do agree that this case is less severe when others where memory
corruption/buffer overrun can happen, like here:
https://lore.kernel.org/all/20230119135721.83345-6-alexander.shishkin@linux.intel.com/
But we are trying to fix all issues we see now (prioritizing the second ones
though). 

> 
> If I set this as the problem, verifying device correct operation is a
> possible solution (albeit hugely expensive) but there are likely many
> other cheaper ways to defeat or detect a device trying to trick us into
> revealing something.

What do you have in mind here for the actual devices we need to enable for CC cases?
We have been using here a combination of extensive fuzzing and static code analysis.

Best Regards,
Elena.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.