Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130126232544.GB2363@openwall.com>
Date: Sun, 27 Jan 2013 03:25:44 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Proposed optimizations to pwsafe

On Sat, Jan 26, 2013 at 11:58:29PM +0100, magnum wrote:
> On 26 Jan, 2013, at 23:51 , Milen Rangelov <gat3way@...il.com> wrote:
> 
> > Hm, I guess the compiler got smarter and was able to generate the bfi_int when not explicitly doing bitselect(). This was not the case some ago and that's good news. Need to do some experiments and check the ISA generated.
> 
> I think we should do it anyway (for now, at least).

I agree.

> What is much more annoying is that bitselects on nvidia hurts performance (last time I checked). That is just weird. I mean, OK if they do not have a hardware instruction but they should definitely not end up slower than using the spelled out syntax. Or is it harder than I imagine?

What I saw happen when optimizing Sayantan's older mscash2-opencl code -
before we moved the redundant portion of the HMAC out of the loop at
source code level - was that switching from simple bitwise ops to
rotate() and bitselect() precluded the compiler from optimizing out and
moving out of the loop some of the redundant code.  This was on HD 7970
with Catalyst 12.4 (not sure of this last detail, though).  So changing
to rotate() and bitselect() slowed the code down - making it up to 2x
slower (when the compiler would no longer move the redundant SHA-1
computations out of the loop).  However, after the code was properly
optimized, changing to rotate() and bitselect() actually sped it up some
more.  I think something similar can be happening on NVIDIA as well,
with more complicated ops being too complicated for the compiler to "see
through" and to optimize nearby code.  For example, the compiler is
almost certainly aware of the associative and commutative properties of
ADD and OR, which may allow for certain optimizations, but it might not
be aware of any/enough properties of rotate() and bitselect().  This
means that we ought to optimize our code fully ourselves (of course),
and only then expect to see the full advantage of rotate() and
bitselect().

I don't oppose introducing rotate() and bitselect() early on.  On the
contrary, I am for it.  I merely said that it's normal not to see
much/any advantage from them when the rest of code is not yet fully
optimized.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.