|   | 
| 
 | 
Message-ID: <8b282de8-a324-6d9e-8d04-b370196ad0be@redhat.com>
Date: Tue, 20 Oct 2020 22:28:13 +0200
From: David Hildenbrand <david@...hat.com>
To: "Guilherme G. Piccoli" <gpiccoli@...onical.com>,
 Michal Hocko <mhocko@...e.com>, Mike Kravetz <mike.kravetz@...cle.com>
Cc: linux-mm@...ck.org, kernel-hardening@...ts.openwall.com,
 linux-hardening@...r.kernel.org, linux-security-module@...r.kernel.org,
 kernel@...ccoli.net, cascardo@...onical.com,
 Alexander Potapenko <glider@...gle.com>,
 James Morris <jamorris@...ux.microsoft.com>,
 Kees Cook <keescook@...omium.org>
Subject: Re: [PATCH] mm, hugetlb: Avoid double clearing for hugetlb pages
On 20.10.20 22:07, David Hildenbrand wrote:
> On 20.10.20 21:19, Guilherme G. Piccoli wrote:
>> Hi Michal, thanks a lot for your thorough response. I'll address the
>> comments inline, below. Thanks also David and Mike - in fact, I almost
>> don't need to respond here after Mike, he was right to the point I'm
>> going to discuss heh...
>>
>>
>> On 20/10/2020 05:20, Michal Hocko wrote:
>>>
>>> Yes zeroying is quite costly and that is to be expected when the feature
>>> is enabled. Hugetlb like other allocator users perform their own
>>> initialization rather than go through __GFP_ZERO path. More on that
>>> below.
>>>
>>> Could you be more specific about why this is a problem. Hugetlb pool is
>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>>> is non trivial amount of time but it doens't look like a major disaster
>>> either. If the pool is allocated later it can take much more time due to
>>> memory fragmentation.
>>>
>>> I definitely do not want to downplay this but I would like to hear about
>>> the real life examples of the problem.
>>
>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
>> just my simple test in a guest, the real case is much worse! It aligns
>> with Mike's comment, we have complains of minute-like delays, due to a
>> very big pool of hugepages being allocated.
>>
>> Users have their own methodology for allocating pages, some would prefer
>> do that "later" for a variety of reasons, so early boot time allocations
>> are not always used, that shouldn't be the only focus of the discussion
>> here.
>> In the specific report I had, the user complains about more than 3
>> minutes to allocate ~542G of 2M hugetlb pages.
>>
>> Now, you'll ask why in the heck they are using init_on_alloc then -
>> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
>> by default in Ubuntu, for hardening reasons. So, the workaround for the
>> users complaining of delays in allocating hugetlb pages currently is to
>> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
>> hardening thing just because we have a double initialization in hugetlb...
>>
>>
>>>
>>>
>>> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
>>> Previously it has been brought up in SLUB context AFAIR. Your numbers
>>> are quite clear here but do we really need a gfp flag with all the
>>> problems we tend to grow in with them?
>>>
>>> One potential way around this specifically for hugetlb would be to use
>>> __GFP_ZERO when allocating from the allocator and marking the fact in
>>> the struct page while it is sitting in the pool. Page fault handler
>>> could then skip the zeroying phase. Not an act of beauty TBH but it
>>> fits into the existing model of the full control over initialization.
>>> Btw. it would allow to implement init_on_free semantic as well. I
>>> haven't implemented the actual two main methods
>>> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
>>> I am not entirely sure about the current state of hugetlb struct page in
>>> the pool. But there should be a lot of room in there (or in tail pages).
>>> Mike will certainly know much better. But the skeleton of the patch
>>> would look like something like this (not even compile tested).
>>> [code...]
>>
>> Thanks a lot for pointing the previous discussion for me! I should have
>> done my homework properly and read all versions of the patchset...my
>> bad! I'm glad to see this problem was discussed and considered early in
>> the patch submission, I guess it only missed more real-world numbers.
>>
>> Your approach seems interesting, but as per Mike's response (which seems
>> to have anticipated all my arguments heheh) your approach is a bit
>> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
>> in fault time), whereas the big problem hereby tentatively fixed is the
>> massive delay on allocation time of the hugetlb pages.
>>
>> I understand that your suggestion has no burden of introducing more GFP
>> flags, and I agree that those are potentially dangerous if misused (and
>> I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd
>> rather go with the originally proposed __GFP_NO_AUTOINIT), but...
>> wouldn't it be letting the code just drive a design decision? Like "oh,
>> adding a flag is so bad..better just let this bug/perf issue to stay".
> 
> The main problem I have is that page alloc code does some internal page
> allocator things ("init_on_alloc" - "Fill newly allocated pages and heap
> objects with zeroes"), and we're allowing users of page alloc code *that
> really shouldn't have to care* to override that behavior, exposing
> unnecessary complexity. Mainly: other allocators.
> 
> "__GFP_NOINIT_ON_ALLOC" - what exactly does it do?
> "__GFP_NO_AUTOINIT" - what exactly does it do?
> 
> __GFP_ZERO set: page always zero.
> __GFP_ZERO not set: page zero with init_on_alloc, page not necessarily
>                     zero without init_on_alloc. Users can find out by 	
>                     looking at init_on_alloc.
> 
> IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still
> somewhat don't like letting users of the buddy override configured
> behavior. Yes, it could be used by other alloactors (like hugetlb) to
> optimize.
> 
> But it could also be used by any driver wanting to optimize the
> "init_on_alloc" case, eventually introducing security issues because the
> code tries to be smart.
> 
BTW, there might be other users for something like __GFP_DONT_ZERO.
Especially, memory ballooning drivers (and virtio-mem), whereby the
hypervisor is (WHP) going to zap the page either way after allocation.
You just cannot assume that when freeing such a page again, that it's
actually zero.
But then, somebody told the system to suffer ("alloc_on_init"), so there
isn't too much motivation to optimize such corner cases.
-- 
Thanks,
David / dhildenb
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.