|
Message-ID: <20130809145345.GA493@openwall.com> Date: Fri, 9 Aug 2013 18:53:45 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Intel OpenCL on CPU and MIC On Fri, Aug 09, 2013 at 04:03:56PM +0200, magnum wrote: > On 8 Aug, 2013, at 23:32 , Solar Designer <solar@...nwall.com> wrote: > > -cl-strict-aliasing was not understood by (and resulted in errors from) Intel's OpenCL compiler. (Should we make this change standard?) > > This is a violation of OpenCL, that option is mandatory ever since 1.0 afaik. We can drop it for now but we should probably use it selectively - some compilers may produce faster code with this option, right? Maybe. I'm not familiar with the OpenCL option, but from my experience gcc's -fstrict-aliasing almost never results in faster code, although it theoretically might. > > I am getting roughly the same cumulative speed by building the whole JtR right for Xeon Phi and running it there. No OpenMP that way for a subtle reason that I can explain separately, but I did run with --fork=228, for a cumulative speed at bcrypt of around 6000 c/s. > > A good speed would be at least 10x that, no? It could be. With vectorized code, the limiting factor would be L1 cache size, and moreover tricks would probably need to be used to avoid exceeding that size. (I'm not sure, but I think MIC supports conditional processing of each vector element. The newer AVX-512 certainly does.) > What could be the reason for such a low figure? Lack of vectorization. > What speed do you get from one core? Cross-compile with icc from host, running directly on the MIC card: Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE Raw: 67.1 c/s real, 67.1 c/s virtual This is 1 process, on one of the 228 logical CPUs. 1.1 GHz. I'd say this is a bit low - but only a little bit. Maybe my ancient Pentium assembly code will run faster, since Xeon Phi reuses P54C cores (with some updates, including to 64-bitness - but 64-bitness is not helpful for bcrypt). Like I said, the cumulative speed of 228 such processes is about 6000 c/s, so roughly the same as we get with OpenCL. More single-process speeds, without intrinsics, 1-second benchmarks: Benchmarking: descrypt, traditional crypt(3) [DES 64/64]... DONE Many salts: 185411 c/s real, 187247 c/s virtual Only one salt: 180720 c/s real, 180720 c/s virtual Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 64/64]... DONE Many salts: 6400 c/s real, 6400 c/s virtual Only one salt: 6337 c/s real, 6337 c/s virtual Benchmarking: md5crypt [MD5 32/64 X2]... DONE Raw: 1407 c/s real, 1407 c/s virtual Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE Raw: 66.6 c/s real, 66.6 c/s virtual Benchmarking: LM [DES 64/64]... DONE Raw: 3231K c/s real, 3231K c/s virtual Benchmarking: AFS, Kerberos AFS [DES 48/64 4K]... DONE Short: 40300 c/s real, 40300 c/s virtual Long: 133069 c/s real, 133069 c/s virtual Benchmarking: tripcode [DES 64/64]... DONE Raw: 169353 c/s real, 169353 c/s virtual Benchmarking: dummy [N/A]... DONE Raw: 8578K c/s real, 8578K c/s virtual Benchmarking: crypt, generic crypt(3) [?/64]... DONE Many salts: 30400 c/s real, 30400 c/s virtual Only one salt: 30494 c/s real, 30494 c/s virtual With 512-bit MIC intrinsics (except for key setup since I did not bother encoding the shift counts into vectors yet): Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE Many salts: 619419 c/s real, 619419 c/s virtual Only one salt: 561679 c/s real, 561679 c/s virtual Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [512/512]... DONE Many salts: 21374 c/s real, 21374 c/s virtual Only one salt: 20380 c/s real, 20380 c/s virtual Benchmarking: LM [512/512]... DONE Raw: 6816K c/s real, 6816K c/s virtual Benchmarking: tripcode [512/512]... DONE Raw: 481882 c/s real, 481882 c/s virtual Somehow got faster bcrypt in this build: Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE Raw: 70.5 c/s real, 70.5 c/s virtual Like I mentioned, the cumulative speed at descrypt for 228 processes is around 67M c/s, although in a longer run I observed almost 70M c/s. For md5crypt, the cumulative speed is around 150k c/s. Not good, but that's without SIMD yet - we should add MIC SIMD intrinsics to sse-intrinsics.c. > I'm hoping I will find time to start experimenting on 'well' within a few weeks. Note that there's no Xeon Phi in "well" yet. We still haven't managed to solve the motherboard incompatibility issue - might need to setup a new machine for the Xeon Phi card. The above tests are on Microway's Xeon Phi system, which they kindly provided me with access to (for a week, so I won't be able to do much more on it). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.