|
Message-ID: <49f7790a70e66cdeb803dba4101806ff@smtp.hushmail.com> Date: Fri, 18 Oct 2013 16:40:06 +0200 From: magnum <john.magnum@...hmail.com> To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com> Subject: OpenCL vectorizing how-to. Not sure anyone listens but anyway, I take a pride in aiming to write device agnostic OpenCL code. Ideally it should support ridiculously weak devices as well as ones much more powerful than today's flagships, and ideally with no configuration or tweaking at runtime (or worse, at build time). I can't beat Hashcat in raw performance on MD4 hashes, but I can try to beat it with support for devices not even blueprinted yet. This week I've been concentrating on vectorizing OpenCL kernels. Until now I made some sporadic experiments with optional vectorizing (with width 4, enabled or not at runtime build of kernel) but I never figured out a safe way to decide when to actually enable it or not, so it remained disabled. Now I suddenly stumbled over a detail I totally missed until now: You can ask a device about it's "preferred vector size" for a given type (eg. int). Empirical tests showed that this was indeed the missing piece I needed. The first thing I did was add this information to --list=opencl-devices. Examples: Device #2 (2) name: Juniper Board name: AMD Radeon HD 6770 Green Edition (...) Native vector widths: char 16, short 8, int 4, long 2 Preferred vector width: char 16, short 8, int 4, long 2 Device #0 (0) name: GeForce GTX 570 (...) Native vector widths: char 1, short 1, int 1, long 1 Preferred vector width: char 1, short 1, int 1, long 1 Interestingly enough this finally settled not only *why* CPU devices sometimes suffer from vectorizing and sometimes gain a lot from it (I did have a pretty good understanding of that), but also how to *know* that beforehand and act accordingly: Platform #0 name: AMD Accelerated Parallel Processing Platform version: OpenCL 1.2 AMD-APP (1214.3) (...) Device #3 (3) name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Native vector widths: char 16, short 8, int 4, long 2 Preferred vector width: char 16, short 8, int 4, long 2 Platform #1 name: Intel(R) OpenCL Platform version: OpenCL 1.2 LINUX (...) Device #0 (4) name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Native vector widths: char 32, short 16, int 8, long 8 Preferred vector width: char 1, short 1, int 1, long 1 At least with these versions of drivers it seems that AMD's CPU driver wants us to supply vectorized code, while Intel's driver prefers to get scalar code (and it will try to auto-vectorize it). Now both will get what they need automatically. On a side note, the AMD driver doesn't seem to be AVX2-aware yet while the native figures for Intel are somewhat confusing for 'long' but probably indicates AVX2 support. Anyway, I went ahead and made the vectorizing formats obey this information, not only "vectorizing or not" but also using the actual recommended *width*: SHA-512 formats should obviously use ulong2 for SSE2/AVX while most other formats use uint4. Obvious, but I overlooked it in my earlier experiments and used ulong4. Also, future drivers may ask for wider vectors when using AVX2 or future technologies - and my formats will support that already. The code for achieving all this is trivial, the host code supplies a V_WIDTH macro and the kernels build accordingly using #ifdefs and macros. The rest of the week I've been adding vectorizing support to more formats and improving code/macros to support all widths (except 3 - that width is a pig to code and anyway it's weird for our use and not likely to give anything anyway). In almost all cases, the Juniper and AMD's CPU device (as well as my Apple CPU device) get a fine boost from vectorizing. And the automagics seems to work like a champ even if there might be cases when vectorizing on eg. Juniper give suboptimal results due to register spilling. New option --force-scalar or john.conf entry "ForceScalar = Y" will disable vectorizing in that case. BTW there is also a new "--force-vector-width=N" option mostly for testing/debugging. BTW on Well's Haswell CPU, Intel's auto-vectorizing beats AMD running pre-vectorized code every time. Part of that may be that Intel's driver supports AVX2 while AMD's does not. Formats that support vectorizing right now: krb5pa-sha1-opencl Kerberos 5 AS-REQ Pre-Auth etype 17/18 ntlmv2-opencl NTLMv2 C/R office2007-opencl Office 2007 office2010-opencl Office 2010 office2013-opencl Office 2013 RAKP-opencl IPMI 2.0 RAKP (RMCP+) WPAPSK-opencl WPA/WPA2 PSK Err... well that was it I think. Have a nice day. BTW note to self after wasting a lot of time: There is NO WAY to vectorize RC4, lol. Sometimes I'm really thick %-) magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.