|
Message-ID: <20150624210302.GD29169@openwall.com> Date: Thu, 25 Jun 2015 00:03:02 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: optimizing bcrypt cracking on x86 On Wed, Jun 24, 2015 at 11:56:43AM -0400, Alain Espinosa wrote: > One thing worth trying is to interleave scalar instructions with AVX2. I didn't try exactly that - interleaving within the same thread - but I did try running a mix of AVX2 and scalar bcrypt threads on the i7-4770K back in 2013. No luck. On your HT-less CPU, you will actually have to go for the effort of mixing instructions within one thread, and please feel free to try, but based on that old result I don't expect you'd have any luck. > ...shldl $16,tmp1,tmp2 > (Latency 2, throughput 1) Per these files: GenuineIntel00306C3_HaswellXeon_InstLatX64.txt GenuineIntel00306C3_Haswell_InstLatX64.txt GenuineIntel00306C3_Haswell_InstLatX64.txt-2013 from http://users.atw.hu/instlatx64/ "SHLD r32, r32, imm8" is latency 1, throughput 0.5. ("-2013" is my own older saved version, downloaded in 2013. In fact, my copies of the *.txt might also be older than what's currently on the website.) > Faster is: > mov tmp2, tmp1 > Shl tmp2, 16 > (Latency 2, throughput 0.75) You mean SHR, not SHL. Yes, I was using SHLD, but the equivalent sequence will use SHR. I've just tried this as well. I got speedup for 1 thread/core, but significant slowdown for 2 threads/core. I also tried moving the two instructions apart, which didn't affect the speeds much. > Note that for Haswell, shld is slower in throughput than older architectures. I am not seeing that. > ...bextr %r14d,La,tmp1; > > I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?). http://users.atw.hu/instlatx64/ has BMI instructions. BEXTR is latency 2, throughput 1. SHRX is latency 1, throughput 0.5. Yes, SHRX + AND might be faster, depending on context I guess. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.