musl - Re: [PATCH] optimize malloc0

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170704214554.GS1627@brightrain.aerifal.cx>
Date: Tue, 4 Jul 2017 17:45:54 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] optimize malloc0

On Tue, Jun 27, 2017 at 12:43:39AM +0300, Alexander Monakov wrote:
> Implementation of __malloc0 in malloc.c takes care to preserve zero
> pages by overwriting only non-zero data. However, malloc must have
> already modified auxiliary heap data just before and beyond the
> allocated region, so we know that edge pages need not be preserved.
> 
> For allocations smaller than one page, pass them immediately to memset.
> Otherwise, use memset to handle partial pages at the head and tail of
> the allocation, and scan complete pages in the interior. Optimize the
> scanning loop by processing 16 bytes per iteration and handling rest of
> page via memset as soon as a non-zero byte is found.
> ---
> A followup to a recent IRC discussion. Code size cost on x86 is about just 80
> bytes (note e.g. how mal0_clear uses memset for two purposes simultaneously,
> handling the partial page at the end, and clearing interior non-zero pages).
> 
> On a Sandy Bridge CPU, speed improvement for the potentially-zero-page scanning
> loop is almost 2x on 64-bit, almost 3x on 32-bit.
> 
> Note that existing implementation can over-clear by as much as sizeof(size_t)-1
> beyond the allocation, the new implementation never does that. This may expose
> application bugs that were hidden before.
> 
> Alexander
> 
>  src/malloc/malloc.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/src/malloc/malloc.c b/src/malloc/malloc.c
> index d5ee4280..720fa696 100644
> --- a/src/malloc/malloc.c
> +++ b/src/malloc/malloc.c
> @@ -366,15 +366,28 @@ void *malloc(size_t n)
>  	return CHUNK_TO_MEM(c);
>  }
>  
> +static size_t mal0_clear(char *p, size_t pagesz, size_t n)
> +{
> +	typedef unsigned long long T;
> +	char *pp = p + n;
> +	size_t i = (uintptr_t)pp & (pagesz - 1);
> +	for (;;) {
> +		pp = memset(pp - i, 0, i);
> +		if (pp - p < pagesz) return pp - p;
> +		for (i = pagesz; i; i -= 2*sizeof(T), pp -= 2*sizeof(T))
> +		        if (((T *)pp)[-1] | ((T *)pp)[-2])
> +				break;
> +	}
> +}
> +
>  void *__malloc0(size_t n)
>  {
>  	void *p = malloc(n);
> -	if (p && !IS_MMAPPED(MEM_TO_CHUNK(p))) {
> -		size_t *z;
> -		n = (n + sizeof *z - 1)/sizeof *z;
> -		for (z=p; n; n--, z++) if (*z) *z=0;
> -	}
> -	return p;
> +	if (!p || IS_MMAPPED(MEM_TO_CHUNK(p)))
> +		return p;
> +	if (n >= PAGE_SIZE)
> +		n = mal0_clear(p, PAGE_SIZE, n);
> +	return memset(p, 0, n);
>  }
>  
>  void *realloc(void *p, size_t n)
> -- 
> 2.11.0

Overall I like this. Reviewing what was discussed on IRC, I called the
loop logic clever and nsz said maybe a bit too clever. On further
reading I think he's right. One additional concern was that the
reverse-scanning may be bad for performance. I suggested it might work
just as well to restructure the loop as "for each word, if nonzero,
memset to end of page and advance to that point". A cheap way to avoid
the scanning logic for the first and last partial page, while not
complicating the loop logic, would be just writing a nonzero value to
the first byte of each before the loop.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.