john-dev - Re: optimizing bcrypt cracking on x86

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170326185245.GA7140@openwall.com>
Date: Sun, 26 Mar 2017 20:52:45 +0200
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86

On Wed, Jun 24, 2015 at 09:43:47AM +0300, Solar Designer wrote:
> On Wed, Jun 24, 2015 at 07:10:07AM +0300, Solar Designer wrote:
> > Now, the bottleneck does indeed appear to be different.  It appears to
> > be taking too many instructions and uops to extract all of the S-box
> > indices and compute the effective addresses to keep the L1 data cache
> > read ports 100% busy at all times.  (We only keep them about 60% busy.)
> 
> BTW, optimizing the effective address calculation could make a
> difference.  In my testing on Haswell, not of bcrypt code but in
> general, there is performance impact from using larger than 1-byte
> displacement, as well as from using an index register (even if the
> scaling factor is set to 1).  Loads that use only a base register with
> either no displacement or a 1-byte displacement (sufficient e.g. to skip
> P-boxes and get to the first S-box) run faster (or rather, have less
> impact on issue rate of adjacent instructions).

FWIW, I just found this note:

http://www.7-cpu.com/cpu/Haswell.html

    L1 Data Cache Latency = 4 cycles for simple access via pointer
    L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

So maybe (just maybe) the "performance impact from using larger than
1-byte displacement, as well as from using an index register" was
actually from an extra cycle of latency on the load itself (the effect
of which on issue of other instructions may vary).

> To avoid needing index scaling, we could use 64-bit S-box elements,
> using bits 33 to 2 for data.  Extracting the lowest pre-scaled index is
> then quick.  Extracting the remaining 3, not so much - can't just use
> BEXTR, need an AND.  Maybe the cost of an AND can be amortized, if
> applied to two non-adjacent pre-scaled indices at once, and then BEXTR
> could be used.  BMI's 3-op ANDN instruction may be used to avoid MOVs.
> 
> Unfortunately, when using only a base register, we can't avoid either
> needing large displacements or ADDing them to that register first.  Yet
> in my testing performance impact from using both a large displacement
> and an index at once is worse than from using either of these alone.
> So maybe large_displacement+base is better than small+base+index*4.
> ... Just tested on artificial code: yes, this is so, especially when the
> instruction is a load-op rather than a pure load,
> 
> Unfortunately, there's also performance impact from larger than 1-byte
> immediate values in instructions such as AND.  So we'd need to minimize
> the number of different masks and preload them into registers.  This
> means that this approach is possibly only worthwhile on CPUs with HT,
> where we don't need more than 2x interleaving.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.