Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150701100505.GA9071@openwall.com>
Date: Wed, 1 Jul 2015 13:05:05 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86

On Thu, Jun 25, 2015 at 07:33:21AM +0300, Solar Designer wrote:
> Regarding the 2x2 MMX2 code on i7-4770K:
> 
> On Wed, Jun 24, 2015 at 07:10:07AM +0300, Solar Designer wrote:
> > On 64-bit builds, though, I only got this to run at cumulative speeds
> > like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
> > OpenMP (and even worse than the slightly better speeds that can be seen
> > with separate independent processes).
> 
> I managed to improve this to 796*8 = 6368 c/s by removing some of the
> large displacements on loads, and instead keeping them in base registers
> (using the extra GPRs that we have in 64-bit mode for this).  For the
> 288 bytes of P, an offset into the middle of this range may be put into
> a register, and then 256 out of the 288 bytes may be accessed via 1-byte
> displacements (or alternatively 248 out of 288, but then we can also
> access the first S-box via the same base register with 0x78 in the
> 1-byte displacement).  Also, remembering that R13 is special just like
> RBP (no without-displacement encoding) can sometimes be helpful.

Another related trick, which I haven't tried yet, is to interleave pairs
of S-boxes (from the same bcrypt instance or from different instances).
Then the same base register could be used to access two of such S-boxes
at once, with "4" in the displacement field (fits 1-byte, obviously) for
the second S-box in a pair.  The index scaling would be by 8 rather than
by 4, but it's same cost.  This way, only 4 base registers would be
needed to access everything for 2 bcrypt instances, with at most 1-byte
displacements (thus avoiding 4-byte displacements, which appear to cost
extra on Haswell).

Another advantage of such interleaving is that we're guaranteed to have
no cache bank conflict between lookups from these two S-boxes then.  Per
my testing, this is irrelevant for Haswell, but it might be relevant on
other CPUs (and not only CPUs).

> This is still not good enough, though.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.