|
Message-ID: <CABeUhwuqY24EzkfZ2hywHa949U+PjHorV42E_fbAQ0OQSmaefA@mail.gmail.com> Date: Fri, 29 Jun 2012 21:56:02 +0200 From: newangels newangels <contact.newangels@...il.com> To: john-users@...ts.openwall.com Subject: Re: John the Ripper 1.7.9-jumbo-6 Hi, Verry nice news & so many improvement's ! thanks a lot to all of you for the effort's & time. I just try to compiled on MAC_OSX LION an GPU enable build, but unfotunately got an error. I do : make macosx-x86-64-opencl & i got : make[1]: *** [common_opencl_pbkdf2.o] Error 1 make: *** [macosx-x86-64-opencl] Error 2 System information's : MacBook Pro 17' / ATI 6750 M - 1Go / SSD - osx-Lion Can some of you help me on this issue, Thanks a lot in advance, Regards, Donovan 2012/6/29, Solar Designer <solar@...nwall.com>: > Hi, > > We've released John the Ripper 1.7.9-jumbo-6 earlier today. This is a > "community-enhanced" version, which includes many contributions from JtR > community members - in fact, that's what it primarily consists of. It's > been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot > has been added to jumbo since then. Even though it's just a one digit > change in the version number, this is in fact the biggest single jumbo > update we've made so far. It appears that between -5 and -6 the source > code grew by over 1 MB, or by over 40,000 lines of code (and that's not > including lines that were changed as opposed to added). The biggest new > thing is integrated GPU support, both CUDA and OpenCL - although for a > subset of the hash and non-hash types only, not for all that are > supported on CPU. (Also, it is efficient only for so-called "slow" > hashes now, and for the "non-hashes" that we chose to support on GPU. > For "fast" hashes, it is just a development milestone, albeit a > desirable one as well.) The other biggest new thing is the addition of > support for many more "non-hashes" and hashes (see below). > > You may download John the Ripper 1.7.9-jumbo-6 at the usual place: > > http://www.openwall.com/john/ > > With so many changes, even pushing this release out was difficult. > Despite of the statement that "jumbo is buggy by definition", we did try > to eliminate as many bugs as we reasonably could - but after a week of > mad testing and bug-fixing, I chose to release the tree as-is, only > documenting the remaining known bugs (below and in doc/BUGS). Still, we > ended up posting over 1200 messages to john-dev in June - even though in > prior months we did not even hit 500. Indeed, we did run plenty of > tests and fix plenty of bugs, which you won't see in this release. > > I've included a lengthy description of some of the changes below, and > below that I'll add some benchmark results that I find curious (such as > for bcrypt on CPU vs. GPU). > > Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by > commit count: > > magnum > Dhiru Kholia > Frank Dittrich > JimF (Jim Fougeron) > myrice (Dongdong Li) > Claudio Andre > Lukas Odzioba > Solar Designer > Sayantan Datta > Samuele Giovanni Tonon > Tavis Ormandy > bartavelle (Simon Marechal) > Sergey V > bizonix > Robert Veznaver > Andras > > New non-hashes: > * Mac OS X keychains [OpenMP] (Dhiru) > - based on research from extractkeychain.py by Matt Johnston > * KeePass 1.x files [OpenMP] (Dhiru) > - keepass2john is based on ideas from kppy by Karsten-Kai Koenig > http://gitorious.org/kppy/kppy > * Password Safe [OpenMP, CUDA, OpenCL] (Dhiru, Lukas) > * ODF files [OpenMP] (Dhiru) > * Office 2007/2010 documents [OpenMP] (Dhiru) > - office2john is based on test-dump-msole.c by Jody Goldberg and > OoXmlCrypto.cs by Lyquidity Solutions Limited > * Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP] > (Dhiru) > - based on FireMaster and FireMasterLinux > http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux > * RAR -p mode encrypted archives (magnum) > - RAR -hp mode was supported previously, now both modes are > > New challenge/responses, MACs: > * WPA-PSK [OpenMP, CUDA, OpenCL] (Lukas, Solar) > - CPU code is loosely based on Aircrack-ng > http://www.aircrack-ng.org > http://openwall.info/wiki/john/WPA-PSK > * VNC challenge/response authentication [OpenMP] (Dhiru) > - based on VNCcrack by Jack Lloyd > http://www.randombit.net/code/vnccrack/ > * SIP challenge/response authentication [OpenMP] (Dhiru) > - based on SIPcrack by Martin J. Muench > * HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512 > (magnum) > > New hashes: > * IBM RACF [OpenMP] (Dhiru) > - thanks to Nigel Pentland (author of CRACF) and Main Framed for > providing > algorithm details, sample code, sample RACF binary database, test vectors > * sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL] (magnum, Lukas, Claudio) > - previously supported in 1.7.6+ only via "generic crypt(3)" interface > * sha256crypt (SHA-crypt) [OpenMP, CUDA] (magnum, Lukas) > - previously supported in 1.7.6+ only via "generic crypt(3)" interface > * DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP] (magnum) > * Django 1.4 [OpenMP] (Dhiru) > * Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP] (magnum) > * WoltLab Burning Board 3 [OpenMP] (Dhiru) > * New EPiServer default (based on SHA-256) [OpenMP] (Dhiru) > * GOST R 34.11-94 [OpenMP] (Dhiru, Sergey V, JimF) > * MD4 support in "dynamic" hashes (user-configurable) (JimF) > - previously, only MD5 and SHA-1 were supported in "dynamic" > * Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed) (JimF) > > Alternate implementations for previously supported hashes: > * Faster raw SHA-1 (raw-sha1-ng, password length up to 15) (Tavis) > > OpenMP support in new formats: > * Mac OS X keychains (Dhiru) > * KeePass 1.x files (Dhiru) > * Password Safe (Lukas) > * ODF files (Dhiru) > * Office 2007/2010 documents (Dhiru) > * Mozilla Firefox, Thunderbird, SeaMonkey master passwords (Dhiru) > * WPA-PSK (Solar) > * VNC challenge/response authentication (Dhiru) > * SIP challenge/response authentication (Dhiru) > * IBM RACF (Dhiru) > * DragonFly BSD SHA-256 and SHA-512 based hashes (magnum) > * Django 1.4 (Dhiru) > * Drupal 7 $S$ phpass-like (based on SHA-512) (magnum) > * WoltLab Burning Board 3 (Dhiru) > * New EPiServer default (based on SHA-256) (Dhiru) > * GOST R 34.11-94 (Dhiru, JimF) > > OpenMP support for previously supported hashes that lacked it: > * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum) > * DES-based tripcodes (Solar) > * Invision Power Board 2.x salted MD5 (magnum) > * HTTP Digest access authentication MD5 (magnum) > * MySQL (old) (Solar) > > CUDA support for: > * phpass MD5-based "portable hashes" (Lukas) > * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas) > * sha512crypt (glibc 2.7+ SHA-crypt) (Lukas) > * sha256crypt (glibc 2.7+ SHA-crypt) (Lukas) > * Password Safe (Lukas) > * WPA-PSK (Lukas) > * Raw SHA-224, raw SHA-256 [inefficient] (Lukas) > * MSCash (DCC) [not working reliably yet] (Lukas) > * MSCash2 (DCC2) [not working reliably yet] (Lukas) > * Raw SHA-512 [not working reliably yet] (myrice) > * Mac OS X 10.7 salted SHA-512 [not working reliably yet] (myrice) > - we have already identified the problem with the above two, and a post > 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users > if > interested in trying it out > > OpenCL support for: > * phpass MD5-based "portable hashes" (Lukas) > * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas) > * sha512crypt (glibc 2.7+ SHA-crypt) (Claudio) > - suitable for NVIDIA cards, faster than the CUDA implementation above > http://openwall.info/wiki/john/OpenCL-SHA-512 > * bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes) (Sayantan) > - pre-configured for AMD Radeon HD 7970, will likely fail on others > unless > WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl; > the achieved level of performance is CPU-like (bcrypt is known to be > somewhat GPU-unfriendly - a lot more than SHA-512) > http://openwall.info/wiki/john/GPU/bcrypt > * MSCash2 (DCC2) (Sayantan) > - with optional and experimental multi-GPU support as a compile-time hack > (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c > * Password Safe (Lukas) > * WPA-PSK (Lukas) > * RAR (magnum) > * MySQL 4.1 double-SHA-1 [inefficient] (Samuele) > * Netscape LDAP salted SHA-1 (SSHA) [inefficient] (Samuele) > * NTLM [inefficient] (Samuele) > * Raw MD5 [inefficient] (Dhiru, Samuele) > * Raw SHA-1 [inefficient] (Samuele) > * Raw SHA-512 [not working properly yet] (myrice) > * Mac OS X 10.7 salted SHA-512 [not working properly yet] (myrice) > - we have already identified the problem with the above two, and a post > 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users > if > interested in trying it out > > Several of these require byte-addressable store (any NVIDIA card, but > only 5000 series or newer if AMD/ATI). Also, OpenCL kernels for "slow" > hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI > cards with recent driver versions. We'll try to address these issues in > a future version. > > AMD XOP (Bulldozer) support added for: > * Many hashes based on MD4, MD5, SHA-1 (Solar) > > Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for: > * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum) > * Invision Power Board 2.x salted MD5 (magnum) > * HTTP Digest access authentication MD5 (magnum) > * SAP CODVN B (BCODE) MD5 (magnum) > * SAP CODVN F/G (PASSCODE) SHA-1 (magnum) > * Oracle 11 (magnum) > > Other optimizations: > * Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2 (magnum) > * Prefer CommonCrypto over OpenSSL on Mac OS X 10.7 (Dhiru) > * New SSE2 intrinsics code for SHA-1 (JimF, magnum) > * Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in > the > compiler at build time) to implement some bit rotates for MD5, SHA-1 > (Solar) > * Assorted optimizations for raw SHA-1 and HMAC-MD5 (magnum) > * In RAR format, added inline storing of RAR data in JtR input file when > the > original file is small enough (magnum) > * Added use of the bitslice DES implementation for tripcodes (Solar) > * Raw-MD5-unicode made "thick" again (that is, not building upon > "dynamic"), > using much faster code (magnum) > * Assorted performance tweaks in "salted-sha1" (SSHA) (magnum) > * Added functions for larger hash tables to several formats (magnum, > Solar) > > Other assorted enhancements: > * linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda, linux-*-opencl, > macosx-x86-64-opencl make targets (magnum et al.) > * linux-*-native make targets (pass -march=native to gcc) (magnum) > * New option: --dupe-suppression (for wordlist mode) (magnum) > * New option: --loopback[=FILE] (implies --dupe-suppression) (magnum) > * New option: --max-run-time=N for graceful exit after N seconds (magnum) > * New option: --log-stderr (magnum) > * New option: --regenerate-lost-salts=N for cracking hashes where we do not > have the salt and essentially need to crack it as well (JimF) > * New unlisted option: --list (for bash completion, GUI, etc.) (magnum) > * --list=[encodings|opencl-devices] (magnum) > * --list=cuda-devices (Lukas) > * --list=format-details (Frank) > * --list=subformats (magnum) > * New unlisted option: --length=N for reducing maximum plaintext length of > a > format, mostly for testing purposes (magnum) > * Enhanced parameter syntax for --markov: may refer to a configuration file > section, may specify the start and/or end in percent of total (Frank) > * Make incremental mode restore ETA figures (JimF) > * In "dynamic", support NUL octets in constants (JimF) > * In "salted-sha1" (SSHA), support any salt length (magnum) > * Use comment and home directory fields from PWDUMP-style input (magnum) > * Sort the format names list in "john" usage output alphabetically > (magnum) > * New john.conf options subsection "MPI" (magnum) > * New john.conf config item CrackStatus under Options:Jumbo (magnum) > * \xNN escape sequence to specify arbitrary characters in rules (JimF) > * New rule command _N to reject a word unless it is of length N (JimF) > * Extra wordlist rule sections: Extra, Single-Extra, Jumbo (magnum) > * Enhanced "Double" external mode sample (JimF) > * Source $JOHN/john.local.conf by default (magnum) > * Many format and algorithm names have been changed for consistency > (Solar) > * When intrinsics are in use, the reported algorithm name now tells which > ones > (SSE2, AVX, or XOP) (Solar) > * benchmark-unify: a Perl script to unify benchmark output of different > versions of JtR for use with relbench (Frank) > * Per-benchmark speed ratio output added to relbench (Frank) > * bash completion for JtR (to install: "sudo make bash-completion") > (Frank) > * New program: raw2dyna (helper to convert raw hashes to "dynamic") (JimF) > * New program: pass_gen.pl (generates hashes from plaintexts) (JimF, > magnum) > * Many code changes made, many bugs fixed, many new bugs introduced (all) > > Now the promised benchmarks. Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6 > overall speed change on one core in FX-8120 (should be 4.0 GHz turbo), > after running through benchmark-unify and relbench (yet about 50 of the > new version's benchmark results could not be directly compared against > results of the previous version, and thus are excluded): > > Number of benchmarks: 151 > Minimum: 0.84668 real, 0.84668 virtual > Maximum: 10.92416 real, 10.92416 virtual > Median: 1.10800 real, 1.10800 virtual > Median absolute deviation: 0.12531 real, 0.12369 virtual > Geometric mean: 1.26217 real, 1.26284 virtual > Geometric standard deviation: 1.47239 real, 1.47274 virtual > > Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz): > > Number of benchmarks: 151 > Minimum: 0.94616 real, 0.48341 virtual > Maximum: 24.19709 real, 4.29610 virtual > Median: 1.17609 real, 1.05964 virtual > Median absolute deviation: 0.17436 real, 0.11465 virtual > Geometric mean: 1.35493 real, 1.17097 virtual > Geometric standard deviation: 1.71505 real, 1.36577 virtual > > These show that overall we do indeed have a speedup, and that's without > any GPU stuff. > > Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in > both cases), on the same CPU (8 threads): > > Number of benchmarks: 202 > Minimum: 0.76235 real, 0.09553 virtual > Maximum: 30.51791 real, 3.81904 virtual > Median: 1.01479 real, 0.98287 virtual > Median absolute deviation: 0.02747 real, 0.03514 virtual > Geometric mean: 1.71441 real, 0.77454 virtual > Geometric standard deviation: 2.08823 real, 1.50966 virtual > > The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is > for: > > Ratio: 30.51791 real, 3.81904 virtual SIP MD5:Raw > > We'll correct the non-OpenMP performance for SIP in the next version. > For the rest, the maximum speedup is 6.13x for SSH, which is great > (considering that the CPU clock rate reduces with more threads running, > and that this is a 4-module CPU rather than a true 8-core). Here are > the top 10 OpenMP performers (excluding SIP): > > Ratio: 6.13093 real, 0.77210 virtual SSH RSA/DSA (one 2048-bit RSA and > one 1024-bit DSA key):Raw > Ratio: 6.05882 real, 0.75737 virtual NTLMv2 C/R MD4 HMAC-MD5:Many salts > Ratio: 6.04342 real, 0.75548 virtual LMv2 C/R MD4 HMAC-MD5:Many salts > Ratio: 5.92830 real, 0.74108 virtual GOST R 34.11-94:Raw > Ratio: 5.81605 real, 0.73986 virtual sha256crypt (rounds=5000):Raw > Ratio: 5.65289 real, 0.70523 virtual sha512crypt (rounds=5000):Raw > Ratio: 5.63333 real, 0.72034 virtual Drupal 7 $S$ SHA-512 (x16385):Raw > Ratio: 5.56435 real, 0.69937 virtual OpenBSD Blowfish (x32):Raw > Ratio: 5.50484 real, 0.69682 virtual Password Safe SHA-256:Raw > Ratio: 5.49613 real, 0.68814 virtual Sybase ASE salted SHA-256:Many > salts > > The worst regression is for: > > Ratio: 0.76235 real, 0.09553 virtual LM DES:Raw > > It is known that our current LM hash code does not scale well, and is > very fast even with one thread (close to the bottleneck of the current > interface). It is in fact better not to use OpenMP for LM hashes yet, > or to keep the thread count low (e.g., 4 would behave better than 8). > The low median and mean speedup are because many hashes still lack > OpenMP support - mostly the "fast" ones, where we'd bump into the > bottleneck anyway. We might deal with this later. For "slow" hashes, > the speedup with OpenMP is close to perfect (5x to 6x for this CPU). > > Now to the new stuff. The effect of XOP (make linux-x86-64-xop): > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5 > Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE > Raw: 204600 c/s real, 25625 c/s virtual > > -5 achieved at most: > > user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5 > Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE > Raw: 158208 c/s real, 19751 c/s virtual > > with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only: > > user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5 > Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE > Raw: 141312 c/s real, 17664 c/s virtual > > with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor > for MD4 and SHA-1), only knowing how to use it for DES (which it did). > > So we got an over 20% speedup due to XOP here. > > Similarly, for raw SHA-1 best result with -5: > > user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1 > Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE > Raw: 13067K c/s real, 13067K c/s virtual > > whereas -6 does, with JimF's and magnum's optimizations and with XOP: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1 > Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE > Raw: 23461K c/s real, 23698K c/s virtual > > and with Tavis' contribution: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng > Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE > Raw: 28024K c/s real, 28024K c/s virtual > > So that's an over 2x speedup if we can accept the length 15 limit, or > an almost 80% speedup otherwise. > > Note: all of the raw SHA-1 benchmarks above are for one CPU core, not > for the entire chip (no OpenMP for fast hashes like this yet, but > there's MPI and there are always separate process invocations...) > > To more important stuff, sha512crypt on CPU vs. GPU: > > For reference, here's what we would get with the previous version, using > the glibc implementation of SHA-crypt: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt > Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE > Many salts: 1518 c/s real, 189 c/s virtual > Only one salt: 1515 c/s real, 189 c/s virtual > > Now we also have builtin implementation, although it nevertheless uses > OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet - > adding that and making use of SIMD would provide much additional > speedup, this is a to-do item for us): > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt > Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE > Raw: 2045 c/s real, 256 c/s virtual > > So it is about 35% faster. Let's try GPUs, first GTX 570 1600 MHz > (a card that is vendor-overclocked to that frequency): > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda > Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE > Raw: 3833 c/s real, 3833 c/s virtual > > Another 2x speedup here, but that's still not it. Let's see: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl > OpenCL platform 0: NVIDIA CUDA, 1 device(s). > Using device 0: GeForce GTX 570 > Building the kernel, this could take a while > Local work size (LWS) 512, global work size (GWS) 7680 > Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE > Raw: 11405 c/s real, 11349 c/s virtual > > And now this is it - Claudio's OpenCL code is really good on NVIDIA, > giving us a 5.5x speedup over CPU. (SHA-512 is not as GPU-friendly as > e.g. MD5, but is friendly enough for some decent speedup.) > > Let's also try AMD Radeon HD 7970 (normally a faster card), at stock > clocks: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl > -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > Building the kernel, this could take a while > Elapsed time: 17 seconds > Local work size (LWS) 32, global work size (GWS) 16384 > Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE > Raw: 5144 c/s real, 3276K c/s virtual > > Not as much luck here yet. Finally, for comparison and to show how any > one of the three OpenCL devices may be accessed from john's command-line > with --platform and --device options, the same OpenCL code on the CPU: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1 > -dev=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 1: AMD FX(tm)-8120 Eight-Core Processor > Local work size (LWS) 1, global work size (GWS) 1024 > Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE > Raw: 1850 c/s real, 233 c/s virtual > > This shows that the code is indeed pretty efficient - almost reaching > OpenSSL's specialized code speed. > > Now to bcrypt. This CPU is pretty good at it: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf > Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE > Raw: 5300 c/s real, 664 c/s virtual > > (FWIW, with overclocking I was able to get this to about 5650 c/s, but > not more - bumping into 125 W TDP. The above is at stock clocks.) > > This is for "$2a$05" or only 32 iterations, which is used as baseline > for benchmarks for historical reasons. Actual systems often use > "$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days. > > Anyway, the reference cracking speed for bcrypt above is higher than the > speed for sha512crypt on the same CPU (with the current code at least, > which admittedly can be optimized much further). Can we make it even > higher on a GPU? Maybe, but not yet, not with the current code: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > ****Please see 'opencl_bf_std.h' for device specific optimizations**** > Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE > Raw: 4143 c/s real, 238933 c/s virtual > > user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable > --od-setclocks=1225,1375 > AMD Overdrive(TM) enabled > > Default Adapter - AMD Radeon HD 7900 Series > New Core Peak : 1225 > New Memory Peak : 1375 > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > ****Please see 'opencl_bf_std.h' for device specific optimizations**** > Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE > Raw: 5471 c/s real, 358400 c/s virtual > > It's only with a 30% overclock that the high-end GPU gets to the same > level of performance as the 2-3 times cheaper CPU. BTW, the GPU stays > cool with this overclock (73 C with stock cooling when running bf-opencl > for a while), precisely because we have to heavily under-utilize it due > to it not having enough local memory to accommodate as many parallel > bcrypt computations as we'd need for full occupancy and to hide memory > access latencies. > > Maybe more optimal code will achieve better results, though. > > The NVIDIA card also has no luck competing with the CPU at bcrypt yet: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl > OpenCL platform 0: NVIDIA CUDA, 1 device(s). > Using device 0: GeForce GTX 570 > ****Please see 'opencl_bf_std.h' for device specific optimizations**** > Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE > Raw: 1137 c/s real, 1137 c/s virtual > > Some tuning could provide better numbers, but they stay a lot lower than > the CPU's and HD 7970's anyway (for the current code). > > Some other GPU benchmarks where I think we achieve decent performance > (not exactly the best, but on par with competing tools that had GPU > support for longer): > > GTX 570 1600 MHz: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda > Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE > Raw: 510171 c/s real, 507581 c/s virtual > > HD 7970 925 MHz (stock): > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > Optimal Work Group Size:256 > Kernel Execution Speed (Higher is better):1.403044 > Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE > Raw: 92467 c/s real, 92142 c/s virtual > > 1225 MHz: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > Optimal Work Group Size:128 > Kernel Execution Speed (Higher is better):1.856949 > Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE > Raw: 121644 c/s real, 121644 c/s virtual > > (would overheat if actually used? this is not bcrypt anymore) > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar > OpenCL platform 0: NVIDIA CUDA, 1 device(s). > Using device 0: GeForce GTX 570 > Optimal keys per crypt 32768 > (to avoid this test on next run, put "rar_GWS = 32768" in john.conf, section > [Options:OpenCL]) > Local worksize (LWS) 64, Global worksize (GWS) 32768 > Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE > Raw: 4380 c/s real, 4334 c/s virtual > > The HD 7970 card is back to stock clocks here: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > Optimal keys per crypt 65536 > (to avoid this test on next run, put "rar_GWS = 65536" in john.conf, section > [Options:OpenCL]) > Local worksize (LWS) 64, Global worksize (GWS) 65536 > Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE > Raw: 7162 c/s real, 468114 c/s virtual > > WPA-PSK, on CPU: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk > Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE > Raw: 1980 c/s real, 247 c/s virtual > > (no SIMD yet; could do several times faster). CUDA: > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda > Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE > Raw: 32385 c/s real, 16695 c/s virtual > > OpenCL on the faster card (stock clock): > > user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1 > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Tahiti > Max local work size 256 > Optimal local work size = 256 > Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE > Raw: 55138 c/s real, 42442 c/s virtual > > 27x speedup over CPU here, although presumably the CPU code is further > from optimal. > > ...Hey, what are you doing here? That message was way too long, you > couldn't possibly read this far. I'll just presume you scrolled to the > end. There's good stuff you have missed above, so please scroll up. ;-) > > As usual, feedback is welcome on the john-users list. I realize that > we're currently missing usage instructions for much of the new stuff, so > please just ask on john-users - and try to make your questions specific. > That way, code contributors will also be prompted/forced to contribute > documentation, and we'll get it under doc/ and on the wiki - in fact, > you can contribute to that too. > > Alexander >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.