john-dev - OpenCL vectorizing how-to.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <49f7790a70e66cdeb803dba4101806ff@smtp.hushmail.com>
Date: Fri, 18 Oct 2013 16:40:06 +0200
From: magnum <john.magnum@...hmail.com>
To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com>
Subject: OpenCL vectorizing how-to.

Not sure anyone listens but anyway,

I take a pride in aiming to write device agnostic OpenCL code. Ideally 
it should support ridiculously weak devices as well as ones much more 
powerful than today's flagships, and ideally with no configuration or 
tweaking at runtime (or worse, at build time). I can't beat Hashcat in 
raw performance on MD4 hashes, but I can try to beat it with support for 
devices not even blueprinted yet.

This week I've been concentrating on vectorizing OpenCL kernels. Until 
now I made some sporadic experiments with optional vectorizing (with 
width 4, enabled or not at runtime build of kernel) but I never figured 
out a safe way to decide when to actually enable it or not, so it 
remained disabled.

Now I suddenly stumbled over a detail I totally missed until now: You 
can ask a device about it's "preferred vector size" for a given type 
(eg. int). Empirical tests showed that this was indeed the missing piece 
I needed.

The first thing I did was add this information to --list=opencl-devices. 
Examples:


         Device #2 (2) name:     Juniper
         Board name:             AMD Radeon HD 6770 Green Edition
(...)
         Native vector widths:   char 16, short 8, int 4, long 2
         Preferred vector width: char 16, short 8, int 4, long 2


         Device #0 (0) name:     GeForce GTX 570
(...)
         Native vector widths:   char 1, short 1, int 1, long 1
         Preferred vector width: char 1, short 1, int 1, long 1


Interestingly enough this finally settled not only *why* CPU devices 
sometimes suffer from vectorizing and sometimes gain a lot from it (I 
did have a pretty good understanding of that), but also how to *know* 
that beforehand and act accordingly:


Platform #0 name: AMD Accelerated Parallel Processing
Platform version: OpenCL 1.2 AMD-APP (1214.3)
(...)
         Device #3 (3) name:     Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
         Native vector widths:   char 16, short 8, int 4, long 2
         Preferred vector width: char 16, short 8, int 4, long 2


Platform #1 name: Intel(R) OpenCL
Platform version: OpenCL 1.2 LINUX
(...)
         Device #0 (4) name:     Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
         Native vector widths:   char 32, short 16, int 8, long 8
         Preferred vector width: char 1, short 1, int 1, long 1


At least with these versions of drivers it seems that AMD's CPU driver 
wants us to supply vectorized code, while Intel's driver prefers to get 
scalar code (and it will try to auto-vectorize it). Now both will get 
what they need automatically. On a side note, the AMD driver doesn't 
seem to be AVX2-aware yet while the native figures for Intel are 
somewhat confusing for 'long' but probably indicates AVX2 support.

Anyway, I went ahead and made the vectorizing formats obey this 
information, not only "vectorizing or not" but also using the actual 
recommended *width*: SHA-512 formats should obviously use ulong2 for 
SSE2/AVX while most other formats use uint4. Obvious, but I overlooked 
it in my earlier experiments and used ulong4. Also, future drivers may 
ask for wider vectors when using AVX2 or future technologies - and my 
formats will support that already. The code for achieving all this is 
trivial, the host code supplies a V_WIDTH macro and the kernels build 
accordingly using #ifdefs and macros.

The rest of the week I've been adding vectorizing support to more 
formats and improving code/macros to support all widths (except 3 - that 
width is a pig to code and anyway it's weird for our use and not likely 
to give anything anyway). In almost all cases, the Juniper and AMD's CPU 
device (as well as my Apple CPU device) get a fine boost from 
vectorizing. And the automagics seems to work like a champ even if there 
might be cases when vectorizing on eg. Juniper give suboptimal results 
due to register spilling. New option --force-scalar or john.conf entry 
"ForceScalar = Y" will disable vectorizing in that case. BTW there is 
also a new "--force-vector-width=N" option mostly for testing/debugging.

BTW on Well's Haswell CPU, Intel's auto-vectorizing beats AMD running 
pre-vectorized code every time. Part of that may be that Intel's driver 
supports AVX2 while AMD's does not.

Formats that support vectorizing right now:
krb5pa-sha1-opencl        Kerberos 5 AS-REQ Pre-Auth etype 17/18
ntlmv2-opencl             NTLMv2 C/R
office2007-opencl         Office 2007
office2010-opencl         Office 2010
office2013-opencl         Office 2013
RAKP-opencl               IPMI 2.0 RAKP (RMCP+)
WPAPSK-opencl             WPA/WPA2 PSK


Err... well that was it I think. Have a nice day. BTW note to self after 
wasting a lot of time: There is NO WAY to vectorize RC4, lol. Sometimes 
I'm really thick %-)

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.