john-dev - Re: Intel OpenCL on CPU and MIC

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20130809145345.GA493@openwall.com>
Date: Fri, 9 Aug 2013 18:53:45 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Intel OpenCL on CPU and MIC

On Fri, Aug 09, 2013 at 04:03:56PM +0200, magnum wrote:
> On 8 Aug, 2013, at 23:32 , Solar Designer <solar@...nwall.com> wrote:
> > -cl-strict-aliasing was not understood by (and resulted in errors from) Intel's OpenCL compiler.  (Should we make this change standard?)
> 
> This is a violation of OpenCL, that option is mandatory ever since 1.0 afaik. We can drop it for now but we should probably use it selectively - some compilers may produce faster code with this option, right?

Maybe.  I'm not familiar with the OpenCL option, but from my experience
gcc's -fstrict-aliasing almost never results in faster code, although it
theoretically might.

> > I am getting roughly the same cumulative speed by building the whole JtR right for Xeon Phi and running it there.  No OpenMP that way for a subtle reason that I can explain separately, but I did run with --fork=228, for a cumulative speed at bcrypt of around 6000 c/s.
> 
> A good speed would be at least 10x that, no?

It could be.  With vectorized code, the limiting factor would be L1
cache size, and moreover tricks would probably need to be used to avoid
exceeding that size.  (I'm not sure, but I think MIC supports
conditional processing of each vector element.  The newer AVX-512
certainly does.)

> What could be the reason for such a low figure?

Lack of vectorization.

> What speed do you get from one core?

Cross-compile with icc from host, running directly on the MIC card:

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    67.1 c/s real, 67.1 c/s virtual

This is 1 process, on one of the 228 logical CPUs.  1.1 GHz.

I'd say this is a bit low - but only a little bit.  Maybe my ancient
Pentium assembly code will run faster, since Xeon Phi reuses P54C cores
(with some updates, including to 64-bitness - but 64-bitness is not
helpful for bcrypt).

Like I said, the cumulative speed of 228 such processes is about 6000 c/s,
so roughly the same as we get with OpenCL.

More single-process speeds, without intrinsics, 1-second benchmarks:

Benchmarking: descrypt, traditional crypt(3) [DES 64/64]... DONE
Many salts:     185411 c/s real, 187247 c/s virtual
Only one salt:  180720 c/s real, 180720 c/s virtual

Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 64/64]...  DONE
Many salts:     6400 c/s real, 6400 c/s virtual
Only one salt:  6337 c/s real, 6337 c/s virtual

Benchmarking: md5crypt [MD5 32/64 X2]... DONE
Raw:    1407 c/s real, 1407 c/s virtual

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    66.6 c/s real, 66.6 c/s virtual

Benchmarking: LM [DES 64/64]... DONE
Raw:    3231K c/s real, 3231K c/s virtual

Benchmarking: AFS, Kerberos AFS [DES 48/64 4K]... DONE
Short:  40300 c/s real, 40300 c/s virtual
Long:   133069 c/s real, 133069 c/s virtual

Benchmarking: tripcode [DES 64/64]... DONE
Raw:    169353 c/s real, 169353 c/s virtual

Benchmarking: dummy [N/A]... DONE
Raw:    8578K c/s real, 8578K c/s virtual

Benchmarking: crypt, generic crypt(3) [?/64]... DONE
Many salts:     30400 c/s real, 30400 c/s virtual
Only one salt:  30494 c/s real, 30494 c/s virtual

With 512-bit MIC intrinsics (except for key setup since I did not bother
encoding the shift counts into vectors yet):

Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     619419 c/s real, 619419 c/s virtual
Only one salt:  561679 c/s real, 561679 c/s virtual

Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [512/512]... DONE
Many salts:     21374 c/s real, 21374 c/s virtual
Only one salt:  20380 c/s real, 20380 c/s virtual

Benchmarking: LM [512/512]... DONE
Raw:    6816K c/s real, 6816K c/s virtual

Benchmarking: tripcode [512/512]... DONE
Raw:    481882 c/s real, 481882 c/s virtual

Somehow got faster bcrypt in this build:

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... DONE
Raw:    70.5 c/s real, 70.5 c/s virtual

Like I mentioned, the cumulative speed at descrypt for 228 processes is
around 67M c/s, although in a longer run I observed almost 70M c/s.

For md5crypt, the cumulative speed is around 150k c/s.  Not good, but
that's without SIMD yet - we should add MIC SIMD intrinsics to
sse-intrinsics.c.

> I'm hoping I will find time to start experimenting on 'well' within a few weeks.

Note that there's no Xeon Phi in "well" yet.  We still haven't managed
to solve the motherboard incompatibility issue - might need to setup a
new machine for the Xeon Phi card.

The above tests are on Microway's Xeon Phi system, which they kindly
provided me with access to (for a week, so I won't be able to do much
more on it).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.