john-dev - Re: optimizing bcrypt cracking on x86

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20150624210302.GD29169@openwall.com>
Date: Thu, 25 Jun 2015 00:03:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86

On Wed, Jun 24, 2015 at 11:56:43AM -0400, Alain Espinosa wrote:
> One thing worth trying is to interleave scalar instructions with AVX2.

I didn't try exactly that - interleaving within the same thread - but I
did try running a mix of AVX2 and scalar bcrypt threads on the i7-4770K
back in 2013.  No luck.  On your HT-less CPU, you will actually have to
go for the effort of mixing instructions within one thread, and please
feel free to try, but based on that old result I don't expect you'd have
any luck.

> ...shldl $16,tmp1,tmp2
> (Latency 2, throughput 1)

Per these files:

GenuineIntel00306C3_HaswellXeon_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt-2013

from http://users.atw.hu/instlatx64/

"SHLD r32, r32, imm8" is latency 1, throughput 0.5.

("-2013" is my own older saved version, downloaded in 2013.  In fact, my
copies of the *.txt might also be older than what's currently on the
website.)

> Faster is:
> mov tmp2, tmp1
> Shl tmp2, 16
> (Latency 2, throughput 0.75)

You mean SHR, not SHL.  Yes, I was using SHLD, but the equivalent
sequence will use SHR.

I've just tried this as well.  I got speedup for 1 thread/core, but
significant slowdown for 2 threads/core.  I also tried moving the two
instructions apart, which didn't affect the speeds much.

> Note that for Haswell, shld is slower in throughput than older architectures. 

I am not seeing that.

> ...bextr %r14d,La,tmp1;
> 
> I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?).

http://users.atw.hu/instlatx64/ has BMI instructions.

BEXTR is latency 2, throughput 1.  SHRX is latency 1, throughput 0.5.

Yes, SHRX + AND might be faster, depending on context I guess.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.