john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cc0e38f49fb7a974602f2cc209e6c49b@smtp.hushmail.com>
Date: Tue, 02 Jun 2015 16:57:03 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

On 2015-06-02 13:01, Solar Designer wrote:
> magnum -
>
> On Mon, Jun 01, 2015 at 01:37:14PM +0200, magnum wrote:
>> Or perhaps as soon as we use interleaving, things like tmp[SIMD_PARA]
>> end up being stack arrays? That should hurt a lot.
>
> This is quite possible.  In general, one of the things limiting the
> interleaving factor is register pressure - and the compiler might in
> fact do a worse job at register allocation when we use arrays.
>
>> Actually, here's a bug we have: Using the wide loops as in SHA2, we
>> don't need to use "tmp[i]" at all - we do fine with just "tmp".
>
> Huh?  Doesn't this defeat interleaving, replacing it with sequential
> processing, because our source code sort of hints to the compiler to
> reuse the same register across instances?  Or are we hoping that the
> compiler or the CPU will recognize that we're reusing the variable, and
> actually allocate a new register or a new rename register, respectively?
> The compiler might and a CPU capable of register renaming at all
> probably will, but didn't we intend to reduce rather then increase our
> reliance on luck?
>
> I just took a look at commit cde0fb470f35ef6dc5949d3b11137dd27ca2672b,
> and it does look as problematic as I had thought from reading your
> message. :-(

I see what you mean and maybe we never got proper interleaving anyway. 
But MD4 and MD5 are faster at the same x3 as before. Anyway reverting to 
use tmp arrays again is easy.

> Would it be reasonable for us to try my usual approach, with separate
> variables at the outer scope (inside the hashing function, but not
> inside the individual steps)?  And if those are in fact separate
> variables rather than array elements, this implies manual or cpp level
> loop unrolling.

I did try hard-coding SHA1 at x2 without using for loops, and with tmp0 
and tmp1 instead of tmp[i]. It made no difference :-(

The real deal would be looking at the generated assembly, like you 
suggested to Lei. I have tried this in the past but it's very hard to 
follow. Last time I did, I stripped everything but the actual digest 
part of one hash function. Still, I couldn't tell a thing from the code 
other than counting loads/stores. That alone is not too bad though.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.