|
Message-ID: <20131101214255.GA11091@openwall.com> Date: Sat, 2 Nov 2013 01:42:55 +0400 From: Solar Designer <solar@...nwall.com> To: "Sc00bz64@...oo.com" <sc00bz64@...oo.com>, john-dev@...ts.openwall.com Subject: Re: bcrypt on AVX2 (was: [john-users] Anyone want to benchmark AVX2 code for bcrypt) On Fri, Nov 01, 2013 at 08:09:39PM +0400, Solar Designer wrote: > Now let's try computing only 7 instances of bcrypt, to ensure there's > room left in L1 data cache for other uses (such as for P-boxes, etc.): > > solar@...l:~/j/bcrypt/bcryptavx2-hack$ ./bcryptbench64 > Some tests failed, but this may be OK since 7 AVX2 code outputs are correct > AVX2 bcrypt failed to produce valid hashes. > Benchmarking... > AVX2: 7*256 took 1.9343 sec (926.5 h/s) 0 > Normal: 1024 took 1.5486 sec (661.2 h/s) 0 I've also tried with icc 14.0.0, after adding "volatile" before "union" in three places (to workaround what I think is an icc bug, which I also faced with another program before). Got similar speeds: Some tests failed, but this may be OK since 7 AVX2 code outputs are correct AVX2 bcrypt failed to produce valid hashes. Benchmarking... AVX2: 7*256 took 1.9314 sec (927.8 h/s) 0 Normal: 1024 took 1.9710 sec (519.5 h/s) 0 Well, somehow "Normal" was hit by the move from gcc to icc, but this does not matter much - we know it can be much faster anyway with 2x interleaving. The AVX2 speed is still worse than interleaved scalar code's speed. I also took a look at the generated assembly code from both gcc and icc. Both look sane, although there are lots of vmovdqa's due to the way vpgatherdd is defined. Maybe half of those vmovdqa's could be avoided by programming in assembly rather than with intrinsics, since they are there only to substitute the default value for the masked lanes. We don't care what this default value is (any garbage would do), but _mm256_mask_i32gather_epi32() forces us to specify it. I don't think avoiding those vmovdqa's would make enough of a difference. Possibly it'd make no difference at all: I think MOVs are generally free starting with Ivy Bridge. On the other hand, I got almost 2x speedup when going from AVX to AVX2 in php_mt_seed, as tested on this same 4770K CPU. php_mt_seed's AVX code speed on this CPU was worse (per MHz) than it used to be on older CPUs, though, so going to AVX2 was in part a workaround. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.