|
|
Message-ID: <20150624210302.GD29169@openwall.com>
Date: Thu, 25 Jun 2015 00:03:02 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimizing bcrypt cracking on x86
On Wed, Jun 24, 2015 at 11:56:43AM -0400, Alain Espinosa wrote:
> One thing worth trying is to interleave scalar instructions with AVX2.
I didn't try exactly that - interleaving within the same thread - but I
did try running a mix of AVX2 and scalar bcrypt threads on the i7-4770K
back in 2013. No luck. On your HT-less CPU, you will actually have to
go for the effort of mixing instructions within one thread, and please
feel free to try, but based on that old result I don't expect you'd have
any luck.
> ...shldl $16,tmp1,tmp2
> (Latency 2, throughput 1)
Per these files:
GenuineIntel00306C3_HaswellXeon_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt-2013
from http://users.atw.hu/instlatx64/
"SHLD r32, r32, imm8" is latency 1, throughput 0.5.
("-2013" is my own older saved version, downloaded in 2013. In fact, my
copies of the *.txt might also be older than what's currently on the
website.)
> Faster is:
> mov tmp2, tmp1
> Shl tmp2, 16
> (Latency 2, throughput 0.75)
You mean SHR, not SHL. Yes, I was using SHLD, but the equivalent
sequence will use SHR.
I've just tried this as well. I got speedup for 1 thread/core, but
significant slowdown for 2 threads/core. I also tried moving the two
instructions apart, which didn't affect the speeds much.
> Note that for Haswell, shld is slower in throughput than older architectures.
I am not seeing that.
> ...bextr %r14d,La,tmp1;
>
> I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?).
http://users.atw.hu/instlatx64/ has BMI instructions.
BEXTR is latency 2, throughput 1. SHRX is latency 1, throughput 0.5.
Yes, SHRX + AND might be faster, depending on context I guess.
Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.