|
|
Message-ID: <CAHktk4h4p5_qWh5dEGGEEaRnPbnw2dJ6Dai_rkFpBz89nJ3HGw@mail.gmail.com>
Date: Mon, 23 Mar 2026 17:58:55 -0700
From: Charles Munger <clm@...gle.com>
To: Alejandro Colomar <une+c@...jandro-colomar.es>
Cc: sc22wg14@...n-std.org, Rich Felker <dalias@...c.org>, musl@...ts.openwall.com
Subject: Re: [SC22WG14.35974] N3849 - Memory allocation with size feedback v2,
request for feedback
On Thu, Mar 19, 2026 at 4:16 PM Alejandro Colomar <
une+c@...jandro-colomar.es> wrote:
> [CC += Rich, musl@]
>
> Hi Charles,
>
> On 2026-03-16T22:50:40-0700, Charles Munger wrote:
> > This paper is the second revision of the original N3813,
>
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3813.pdf>
>
> > updated to address
> > feedback on this mailing list; thank you Alejandro for providing feedback
> > on the initial version.
> >
> > This new revision includes more discussion of design choices, as well as
> > fixes for typos in example code, and no longer special casing allocation
> > failure vs allocation success with regard to the returned size. In
> > addition, two examples I have personal experience with in C codebases are
> > included, as challenges dealing with these motivated this standards
> > proposal.
> >
> > I look forward to your feedback.
>
> I'll quote the paper to reply.
> TL;DR: I don't see evidence that this is necessary.
>
> | Use Cases
> | Dynamic Array
> | A common data structure is a resizable array
> | that grows on insertions by reallocating.
> | To achieve amortized linear time,
> | implementations allocate larger blocks of memory
> | than they actually need to hold the current set of items
> | - if they could use additional space at low/no cost,
> | fewer resizes would be needed,
> | minimizing copying and allocator churn.
>
> I expect most dynamic arrays --just like every dynamic data structure--
> to allocate sizes with the usual 2x algorithm: duplicate the size
> whenever we need to grow.
That's often the first algorithm people start with, but I and others have
empirically found growth factors of 1.5 (minimizes overshoot) to be better.
https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md #
memory-handling
<https://github.com/facebook/folly/blob/main/folly/docs/FBVector.md#memory-handling>
Reading this link, I note that they also collaborate with the allocator on
size classes via some of the extensions referenced in the paper.
> If you do that, then
>
> - "Fewer resizes" is something you don't care about. The amount of
> resizes is logarithmic with respect to the size. We're not talking
> about hundreds of realloc(3) calls.
>
I think that very much depends on the use case. The code I work on
certainly cares about this, we see page faults and cache misses in
production profiles of Android applications, and improvements from removing
or minimizing them.
>
> - Considering allocators with different arenas for different sizes,
> you'd only be able to use a small percentage of the space, but
> the extra size is statistically not going to be meaningful. And when
> the allocation is larger, the percentage is smaller (you'll get at
> most the page size).
>
That had been my assumption; but testing upb's arena allocator, I actually
saw substantial improvements, including:
name old peak-mem(Bytes)/op new
peak-mem(Bytes)/op delta
BM_Parse_Upb_FileDesc<UseArena, Alias> 43.7k ± 0%
30.4k ± 0% -30.32% (p=0.000 n=60+60)
This is because the size of allocated blocks increases as the total amount
of memory in the arena increases (to better amortize the cost of small
allocations and improve locality), so the last block where the parse stops
is always larger than the previous blocks (up to a max size), and is also
the least full. It also means that saving the last allocation is a large
savings relative to the total.
>
> You should quantify the gains, with some experiments in real code,
> measuring how many realloc(3) calls you cave if you use those proposed
> APIs compared to not using them. I expect this will be in at most in
> the order of 5~10%; possibly less.
>
> And then, measure the performance difference too. Because if you
> achieve reduction of maybe 7% of realloc(3) calls, but the calls to this
> API have a 7% performance penalty, then it makes no sense.
>
In fact, we have exactly that data from SQLite. I reached out to the author
of SQLite about whether their use of malloc_usable_size() was safe; they
profiled their workloads and determined that the availability of the extra
memory did make SQLite faster, but the cost of pointer-to-size lookups was
expensive. That's why this new API is careful to impose essentially no
constraints on the allocator that would force performance costs, and
there's a discussion of that in the current revision. We don't require the
allocator to do anything, and we ask for additional information at the time
when they already have it available. The user of the API doesn't have to
worry that they're calling something with additional expense that may not
give them extra memory in return. The performance overhead for the
allocator thus scales with any extra work the allocator (for hardening or
other reasons) might do that scales with the size of the memory returned;
for example, zeroing, tagging for MTE, or filling with some specific
garbage pattern.
>
> But my comments above would assume you also take advantage of realloc(3)
> wasted space, which your proposal doesn't. You only take advantage in
> the initial allocation, which makes the savings even less meaningful.
>
We could certainly add realloc_at_least; for one of my motivating use cases
(serialization of protocol buffers) it wouldn't help though, because
realloc places the contents of existing memory at the beginning of the new
allocation, and my use case needs it to place it at the end. I realize
that's atypical. But realloc in general is dramatically less useful as
discussed in the paper, as any size-class-based allocator will never be
able to enlarge in place from an allocation that was allocated with
alloc_at_least. realloc did not get an aligned_alloc variant either.
>
> I'd like to see numbers showing that this is really meaningful, compared
> to requesting a reasonable round number initially (which is likely to be
> quite optimized for space).
>
For the UPB parsing case, we're allocating in-memory structs based on a
wire format; it's hard to pick a reasonable size before doing the actual
parsing work.
>
> | Arena/Bump allocators
> | Repeated small allocations with the same lifetime
> | can be grouped together by using malloc
> | to obtain a large block of memory,
> | and then adding the size of each small allocation
> | to the returned pointer until it is full.
> | If needed, allocate another block and repeat.
> | However, chosen block sizes can waste substantial memory
> | if the requested sizes are a poor fit
> | for underlying size classes used by malloc.
>
> How large is the usual waste? Have you measured numbers? I expect that
> as long as you use round numbers, you're not going to waste much, if at
> all. I expect this is not a real problem.
>
For UPB the results were quite significant, a substantial reduction in peak
memory, because the block sizes grow, and the heuristics for block sizing
that perform best are not simply matching powers of 2.
>
> | I/O Buffers
> | When streaming data to/from
> | a file, the network, or some other system,
> | we often don’t know how much data we’ll move in advance.
> | Larger buffers use more memory
> | but often increase throughput
> | and decrease syscall and device overhead;
> | if we can obtain a larger buffer than required
> | at no additional cost,
> | we want to do so.
>
> Again, you need to show numbers. The size extensions aren't going to be
> an order of magnitude. More likely, it will be a small % of what you
> have, which will not affect in a meaningful way the performance. You
> should probably optimize for your system calls and network.
>
> | Examples
> | In SQLite, [...]
> | In upb, [...]
>
> It would be interesting (actually, necessary) to see numbers of how much
> memory those projects are saving, or how much performance those projects
> are gaining, thanks to the use of these APIs.
>
> Apart from the lack of numbers that would show whether this is a real
> problem or not, I don't like the proposed APIs.
>
> | Choice of Return Type
> | The proposed API returns a struct,
> | which is atypical although not unheard of in C’s standard library,
> | div_t being a prior example.
>
> div(3) is quite weird.
>
> | The following alternative structures were considered:
> | void* alloc_at_least(size_t* size);
> | The former is somewhat similar to
> | atomic_compare_exchange_strong’s expected parameter.
> | Like that parameter,
> | taking an address precludes
> | passing a literal or expression for the size,
>
> Yes. But is that a problem?
It has been a minor annoyance for me in the past when writing lock free
code, which is why I mentioned it.
>
> size = 42;
> p = alloc_at_least(&size);
>
> | and the developer has to pass an address,
> | usually of a local variable, instead.
> | If the original value is still needed later,
> | it has to be stored somewhere else.
>
> Why would you want to store the guess? You want to know the amount of
> memory used, and the amount of memory available, but the requested
> amount of memory seems not very useful.
>
Presumably the memory you're requesting may be the amount of memory you
strictly need, which would translate to the used capacity, which you'd want
to retain.
>
> Since you have mentioned code bases that do something like this, you
> should be able to show whether the guess is something that needs to be
> stored often, and how each of these designs would affect their source
> code. It would be good to see real code using this, rather than
> hypotheses.
>
> | alloc_result_t r = alloc_at_least(offsetof(T, data[4]));
> | if (r.ptr) {
> | T* result = r.ptr;
> | result->count = (r.size - offsetof(T, data[0])) /
> sizeof(uint64_t);
> | }
>
> This seems quite unreadable, IMO.
>
> | size_t size = offsetof(T, data[4]);
> | T* result = alloc_at_least(&size);
> | if (result)
> | result->count = (size - offsetof(T, data[0])) /
> sizeof(uint64_t);
>
> Much better, IMO.
>
> | size_t actual_size;
> | T* result = alloc_at_least(offsetof(T, data[4]), &actual_size);
> | if (result)
> | result->count = (actual_size - offsetof(T, data[0])) /
> sizeof(uint64_t);
>
> The former seems more idiomatic.
>
> | Each of these is basically the same amount of code,
> | and offer similar opportunities for programmer error.
>
> I tend to disagree. The second is much more robust than the others, by
> having less code.
>
I think this is pretty subjective; the latter examples get to one line less
code by omitting braces, which is a stylistic choice with both costs and
benefits, and that may or may not be applicable depending on the
surrounding code.
>
> | However, there are some practical concerns.
> | With the out-param style,
> | an ABI is effectively required to load and store.
>
> Are you really concerned about a "load and store"? We're talking of
> malloc(3) here. That's a monster, and the least of our problems is a
> load and a store.
>
malloc(3) is not necessarily a monster, but the issue is the model for
people using the code. In order for a developer to feel comfortable using
an API in place of malloc that the implementation is free to treat just
like malloc, it's desirable to ensure that any costs relative to malloc are
minimized. That way, even if the code runs in an environment where the
allocator always returns the requested size, they are not making any real
tradeoff by using alloc_at_least.
>
> | When returning a struct by value,
> | some ABIs will return the small struct on the stack
> | at cost similar to
> | putting a size_t on the stack and passing its address.
> | However,
> | implementors may pass the struct’s members in registers,
> | and some do - notably including
> | aarch64, x86_64 using the System V ABI, and Risc-V.
> | WebAssembly/Wasm can also take advantage of multiple return values.
>
> You're micro optimizing performance in malloc(3). It makes no sense.
> Please show numbers to see that this performance difference is
> measurable at all. We should be optimizing for safety, readability, and
> usability when designing an API, unless there are strong reasons to care
> about performance.
> | As far as developer ergonomics,
> | there are costs and benefits to each approach.
> | The functions that return a pointer
> | (and use an out-param for the size)
> | can assign the return value directly
> | to where it’s likely to be consumed,
> | to a non-`void` pointer.
>
> This is what makes it importantly safer.
>
> alx@...uan:~/tmp$ cat m.c
> #include <stdlib.h>
>
> #define typeas(T) typeof((T){})
> #define mallocarray(n, z) reallocarray(NULL, (n)?:1, z)
> #define malloc_T(n, T) malloc_T_(n, typeas(T))
> #define malloc_T_(n, T) ((void)0, (T *){mallocarray(n,
> sizeof(T))})
>
> int
> main(void)
> {
> long *p;
>
> p = malloc_T(42, long); // ok
> free(p);
>
> p = malloc_T(42, int); // -Wincompatible-pointer-types
> free(p);
> }
> alx@...uan:~/tmp$ gcc m.c
> m.c: In function ‘main’:
> m.c:16:11: error: assignment to ‘long int *’ from incompatible
> pointer type ‘int *’ [-Wincompatible-pointer-types]
> 16 | p = malloc_T(42, int); //
> -Wincompatible-pointer-types
> | ^
>
> Your variant is inherently unsafe, and can't be made type-safe in any
> way. This is a non-starter, IMO.
>
If you're going to wrap it that way you can wrap it in a `static
inline` function that uses any of the alternative structures you prefer,
which would then permit this same macro for typing. My preference for
returning a struct is that it's the calling convention that affords the
best performance for the part of the ABI the programmer doesn't control.
>
> | However,
> | they also generally have to declare a variable
> | to pass the address for the actual size;
> | often the allocated size will be stored
> | inside the just-allocated block.
>
> Count lines of code in the examples you provided above. I count 4 LoC
> in all examples, so you're not really gaining in this regard with your
> design.
>
> | Interaction with Sized Free
>
> These interactions seem bad. free_sized() is losing its only argument,
> which was to mitigate some double-free(3) bugs. This proposal is
> mitigating the mitigation.
>
free_sized provides both hardening and performance improvements, depending
on the allocator's implementation choices. The behavior matches the
existing standard guidance and makes the hardening work better, because it
permits implementations to omit functionality like malloc_usable_size (or
make it slower) and thus permit stricter size checking for allocations not
obtained from alloc_at_least. Right now, if you call ptr = malloc(999) and
then free_sized(ptr, 1024), any allocator that supports malloc_usable_size
must either record that malloc_usable_size was called (and thus that
free_sized should accept something in the range of the size class) or emit
an error in response to someone passing the size they got from
malloc_usable_size. With alloc_at_least, there is no tradeoff; a debug or
hardened allocator can strictly enforce sizes for different calls.
>
> | Why not realloc alone?
> | realloc must determine from the allocator’s metadata
> | the true size of the block.
> | Even if paired with extensions like malloc_usable_size
> | to resize to the precise, actual size,
> | these pointer-to-size lookups are costly.
>
> Do you have actual numbers?
>
I spotted it in benchmarks and the SQLite folks pointed it out as an issue
in their profiling. I can try and pull some for you.
>
> | Avoiding this lookup
> | was a motivation behind C++14’s sized delete
> | and C23’s free_sized features.
>
> Is the performance of free(3) vs free_sized() really a concern?
> The security argument is slightly more compelling (although I'm still
> not convinced either), but optimizing free(3) surprises me.
> Is free(3) a performance issue to any program? I expect the chances of
> passing an incorrect size to be worse than calling free(3).
>
> The paper that added free_sized() (n3699) claims a 30% performance gain.
> But is a 30% meaningful? I expect free(3) to be quite cheap,
> relatively. Also, I'm a bit confused: if free_sized() is about safety,
> it means it must really check the actual size of the allocation. And if
> it checks the actual size, where does the performance gain come from?
> I'd like to question those numbers, and see an actual experiment.
> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm>
>
That function was already standardized, but freeing memory (into a
CPU-local cache of fixed size classes via rseq) has gotten extremely cheap
for some allocators, such that the pointer-to-size lookup cost is notable.
>
> | When a program can make use of the added space,
> | the best time to determine it is at allocation time
> | when the allocator has all of the relevant metadata available.
>
>
> Have a lovely night!
> Alex
>
> --
> <https://www.alejandro-colomar.es>
>
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.