Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABeUhwvfqVE7dsarVXs4dFWfDS2Ka2n3-j1BZQUcakv+qun4MQ@mail.gmail.com>
Date: Fri, 29 Jun 2012 22:07:53 +0200
From: newangels newangels <contact.newangels@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: John the Ripper 1.7.9-jumbo-6

Hi Again,

After my GPU Build attempt, i try an classic mac-x86-64, unfortunately
that crashed again !

On OSX LION ( 1.7.4) - MaccbookPro 17'

I do :

macosx-x86-64

I got ;

ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[1]: *** [../run/john] Error 1
make: *** [macosx-x86-64] Error 2

Any help are welcome,

Tthanks,

Regards,

Donovan

2012/6/29, newangels newangels <contact.newangels@...il.com>:
> Hi,
>
> Verry nice news & so many improvement's ! thanks a lot to all of you
> for the effort's & time.
>
> I just try to compiled on MAC_OSX LION an GPU enable build, but
> unfotunately got an error.
>
> I do :
>
> make macosx-x86-64-opencl
>
> & i got :
>
> make[1]: *** [common_opencl_pbkdf2.o] Error 1
> make: *** [macosx-x86-64-opencl] Error 2
>
> System information's :
>
> MacBook Pro 17' / ATI 6750 M - 1Go / SSD - osx-Lion
>
> Can some of you help me on this issue,
>
> Thanks a lot in advance,
>
> Regards,
>
> Donovan
>
> 2012/6/29, Solar Designer <solar@...nwall.com>:
>> Hi,
>>
>> We've released John the Ripper 1.7.9-jumbo-6 earlier today.  This is a
>> "community-enhanced" version, which includes many contributions from JtR
>> community members - in fact, that's what it primarily consists of.  It's
>> been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot
>> has been added to jumbo since then.  Even though it's just a one digit
>> change in the version number, this is in fact the biggest single jumbo
>> update we've made so far.  It appears that between -5 and -6 the source
>> code grew by over 1 MB, or by over 40,000 lines of code (and that's not
>> including lines that were changed as opposed to added).  The biggest new
>> thing is integrated GPU support, both CUDA and OpenCL - although for a
>> subset of the hash and non-hash types only, not for all that are
>> supported on CPU.  (Also, it is efficient only for so-called "slow"
>> hashes now, and for the "non-hashes" that we chose to support on GPU.
>> For "fast" hashes, it is just a development milestone, albeit a
>> desirable one as well.)  The other biggest new thing is the addition of
>> support for many more "non-hashes" and hashes (see below).
>>
>> You may download John the Ripper 1.7.9-jumbo-6 at the usual place:
>>
>> http://www.openwall.com/john/
>>
>> With so many changes, even pushing this release out was difficult.
>> Despite of the statement that "jumbo is buggy by definition", we did try
>> to eliminate as many bugs as we reasonably could - but after a week of
>> mad testing and bug-fixing, I chose to release the tree as-is, only
>> documenting the remaining known bugs (below and in doc/BUGS).  Still, we
>> ended up posting over 1200 messages to john-dev in June - even though in
>> prior months we did not even hit 500.  Indeed, we did run plenty of
>> tests and fix plenty of bugs, which you won't see in this release.
>>
>> I've included a lengthy description of some of the changes below, and
>> below that I'll add some benchmark results that I find curious (such as
>> for bcrypt on CPU vs. GPU).
>>
>> Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by
>> commit count:
>>
>> magnum
>> Dhiru Kholia
>> Frank Dittrich
>> JimF (Jim Fougeron)
>> myrice (Dongdong Li)
>> Claudio Andre
>> Lukas Odzioba
>> Solar Designer
>> Sayantan Datta
>> Samuele Giovanni Tonon
>> Tavis Ormandy
>> bartavelle (Simon Marechal)
>> Sergey V
>> bizonix
>> Robert Veznaver
>> Andras
>>
>> New non-hashes:
>> * Mac OS X keychains [OpenMP]  (Dhiru)
>>   - based on research from extractkeychain.py by Matt Johnston
>> * KeePass 1.x files [OpenMP]  (Dhiru)
>>   - keepass2john is based on ideas from kppy by Karsten-Kai Koenig
>>     http://gitorious.org/kppy/kppy
>> * Password Safe [OpenMP, CUDA, OpenCL]  (Dhiru, Lukas)
>> * ODF files [OpenMP]  (Dhiru)
>> * Office 2007/2010 documents [OpenMP]  (Dhiru)
>>   - office2john is based on test-dump-msole.c by Jody Goldberg and
>>   OoXmlCrypto.cs by Lyquidity Solutions Limited
>> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP]
>> (Dhiru)
>>   - based on FireMaster and FireMasterLinux
>>     http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux
>> * RAR -p mode encrypted archives  (magnum)
>>   - RAR -hp mode was supported previously, now both modes are
>>
>> New challenge/responses, MACs:
>> * WPA-PSK [OpenMP, CUDA, OpenCL]  (Lukas, Solar)
>>   - CPU code is loosely based on Aircrack-ng
>>     http://www.aircrack-ng.org
>>     http://openwall.info/wiki/john/WPA-PSK
>> * VNC challenge/response authentication [OpenMP]  (Dhiru)
>>   - based on VNCcrack by Jack Lloyd
>>     http://www.randombit.net/code/vnccrack/
>> * SIP challenge/response authentication [OpenMP]  (Dhiru)
>>   - based on SIPcrack by Martin J. Muench
>> * HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512
>> (magnum)
>>
>> New hashes:
>> * IBM RACF [OpenMP]  (Dhiru)
>>   - thanks to Nigel Pentland (author of CRACF) and Main Framed for
>> providing
>>   algorithm details, sample code, sample RACF binary database, test
>> vectors
>> * sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL]  (magnum, Lukas,
>> Claudio)
>>   - previously supported in 1.7.6+ only via "generic crypt(3)" interface
>> * sha256crypt (SHA-crypt) [OpenMP, CUDA]  (magnum, Lukas)
>>   - previously supported in 1.7.6+ only via "generic crypt(3)" interface
>> * DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP]  (magnum)
>> * Django 1.4 [OpenMP]  (Dhiru)
>> * Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP]  (magnum)
>> * WoltLab Burning Board 3 [OpenMP]  (Dhiru)
>> * New EPiServer default (based on SHA-256) [OpenMP]  (Dhiru)
>> * GOST R 34.11-94 [OpenMP]  (Dhiru, Sergey V, JimF)
>> * MD4 support in "dynamic" hashes (user-configurable)  (JimF)
>>   - previously, only MD5 and SHA-1 were supported in "dynamic"
>> * Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed)  (JimF)
>>
>> Alternate implementations for previously supported hashes:
>> * Faster raw SHA-1 (raw-sha1-ng, password length up to 15)  (Tavis)
>>
>> OpenMP support in new formats:
>> * Mac OS X keychains  (Dhiru)
>> * KeePass 1.x files  (Dhiru)
>> * Password Safe  (Lukas)
>> * ODF files  (Dhiru)
>> * Office 2007/2010 documents  (Dhiru)
>> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords  (Dhiru)
>> * WPA-PSK  (Solar)
>> * VNC challenge/response authentication  (Dhiru)
>> * SIP challenge/response authentication  (Dhiru)
>> * IBM RACF  (Dhiru)
>> * DragonFly BSD SHA-256 and SHA-512 based hashes  (magnum)
>> * Django 1.4  (Dhiru)
>> * Drupal 7 $S$ phpass-like (based on SHA-512)  (magnum)
>> * WoltLab Burning Board 3  (Dhiru)
>> * New EPiServer default (based on SHA-256)  (Dhiru)
>> * GOST R 34.11-94  (Dhiru, JimF)
>>
>> OpenMP support for previously supported hashes that lacked it:
>> * Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
>> * DES-based tripcodes  (Solar)
>> * Invision Power Board 2.x salted MD5  (magnum)
>> * HTTP Digest access authentication MD5  (magnum)
>> * MySQL (old)  (Solar)
>>
>> CUDA support for:
>> * phpass MD5-based "portable hashes"  (Lukas)
>> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
>> * sha512crypt (glibc 2.7+ SHA-crypt)  (Lukas)
>> * sha256crypt (glibc 2.7+ SHA-crypt)  (Lukas)
>> * Password Safe  (Lukas)
>> * WPA-PSK  (Lukas)
>> * Raw SHA-224, raw SHA-256 [inefficient]  (Lukas)
>> * MSCash (DCC) [not working reliably yet]  (Lukas)
>> * MSCash2 (DCC2) [not working reliably yet]  (Lukas)
>> * Raw SHA-512 [not working reliably yet]  (myrice)
>> * Mac OS X 10.7 salted SHA-512 [not working reliably yet]  (myrice)
>>   - we have already identified the problem with the above two, and a post
>>   1.7.9-jumbo-6 fix should be available shortly - please ask on
>> john-users
>> if
>>   interested in trying it out
>>
>> OpenCL support for:
>> * phpass MD5-based "portable hashes"  (Lukas)
>> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
>> * sha512crypt (glibc 2.7+ SHA-crypt)  (Claudio)
>>   - suitable for NVIDIA cards, faster than the CUDA implementation above
>>   http://openwall.info/wiki/john/OpenCL-SHA-512
>> * bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes)  (Sayantan)
>>   - pre-configured for AMD Radeon HD 7970, will likely fail on others
>> unless
>>   WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl;
>>   the achieved level of performance is CPU-like (bcrypt is known to be
>>   somewhat GPU-unfriendly - a lot more than SHA-512)
>>   http://openwall.info/wiki/john/GPU/bcrypt
>> * MSCash2 (DCC2)  (Sayantan)
>>   - with optional and experimental multi-GPU support as a compile-time
>> hack
>>   (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c
>> * Password Safe  (Lukas)
>> * WPA-PSK  (Lukas)
>> * RAR  (magnum)
>> * MySQL 4.1 double-SHA-1 [inefficient]  (Samuele)
>> * Netscape LDAP salted SHA-1 (SSHA) [inefficient]  (Samuele)
>> * NTLM [inefficient]  (Samuele)
>> * Raw MD5 [inefficient]  (Dhiru, Samuele)
>> * Raw SHA-1 [inefficient]  (Samuele)
>> * Raw SHA-512 [not working properly yet]  (myrice)
>> * Mac OS X 10.7 salted SHA-512 [not working properly yet]  (myrice)
>>   - we have already identified the problem with the above two, and a post
>>   1.7.9-jumbo-6 fix should be available shortly - please ask on
>> john-users
>> if
>>   interested in trying it out
>>
>> Several of these require byte-addressable store (any NVIDIA card, but
>> only 5000 series or newer if AMD/ATI).  Also, OpenCL kernels for "slow"
>> hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI
>> cards with recent driver versions.  We'll try to address these issues in
>> a future version.
>>
>> AMD XOP (Bulldozer) support added for:
>> * Many hashes based on MD4, MD5, SHA-1  (Solar)
>>
>> Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for:
>> * Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
>> * Invision Power Board 2.x salted MD5  (magnum)
>> * HTTP Digest access authentication MD5  (magnum)
>> * SAP CODVN B (BCODE) MD5  (magnum)
>> * SAP CODVN F/G (PASSCODE) SHA-1  (magnum)
>> * Oracle 11  (magnum)
>>
>> Other optimizations:
>> * Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2  (magnum)
>> * Prefer CommonCrypto over OpenSSL on Mac OS X 10.7  (Dhiru)
>> * New SSE2 intrinsics code for SHA-1  (JimF, magnum)
>> * Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in
>> the
>> compiler at build time) to implement some bit rotates for MD5, SHA-1
>> (Solar)
>> * Assorted optimizations for raw SHA-1 and HMAC-MD5  (magnum)
>> * In RAR format, added inline storing of RAR data in JtR input file when
>> the
>> original file is small enough  (magnum)
>> * Added use of the bitslice DES implementation for tripcodes  (Solar)
>> * Raw-MD5-unicode made "thick" again (that is, not building upon
>> "dynamic"),
>> using much faster code  (magnum)
>> * Assorted performance tweaks in "salted-sha1" (SSHA)  (magnum)
>> * Added functions for larger hash tables to several formats  (magnum,
>> Solar)
>>
>> Other assorted enhancements:
>> * linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda,
>> linux-*-opencl,
>> macosx-x86-64-opencl make targets  (magnum et al.)
>> * linux-*-native make targets (pass -march=native to gcc)  (magnum)
>> * New option: --dupe-suppression (for wordlist mode)  (magnum)
>> * New option: --loopback[=FILE] (implies --dupe-suppression)  (magnum)
>> * New option: --max-run-time=N for graceful exit after N seconds
>> (magnum)
>> * New option: --log-stderr  (magnum)
>> * New option: --regenerate-lost-salts=N for cracking hashes where we do
>> not
>> have the salt and essentially need to crack it as well  (JimF)
>> * New unlisted option: --list (for bash completion, GUI, etc.)  (magnum)
>> * --list=[encodings|opencl-devices]  (magnum)
>> * --list=cuda-devices  (Lukas)
>> * --list=format-details  (Frank)
>> * --list=subformats  (magnum)
>> * New unlisted option: --length=N for reducing maximum plaintext length
>> of
>> a
>> format, mostly for testing purposes  (magnum)
>> * Enhanced parameter syntax for --markov: may refer to a configuration
>> file
>> section, may specify the start and/or end in percent of total  (Frank)
>> * Make incremental mode restore ETA figures  (JimF)
>> * In "dynamic", support NUL octets in constants  (JimF)
>> * In "salted-sha1" (SSHA), support any salt length  (magnum)
>> * Use comment and home directory fields from PWDUMP-style input  (magnum)
>> * Sort the format names list in "john" usage output alphabetically
>> (magnum)
>> * New john.conf options subsection "MPI"  (magnum)
>> * New john.conf config item CrackStatus under Options:Jumbo  (magnum)
>> * \xNN escape sequence to specify arbitrary characters in rules  (JimF)
>> * New rule command _N to reject a word unless it is of length N  (JimF)
>> * Extra wordlist rule sections: Extra, Single-Extra, Jumbo  (magnum)
>> * Enhanced "Double" external mode sample  (JimF)
>> * Source $JOHN/john.local.conf by default  (magnum)
>> * Many format and algorithm names have been changed for consistency
>> (Solar)
>> * When intrinsics are in use, the reported algorithm name now tells which
>> ones
>> (SSE2, AVX, or XOP)  (Solar)
>> * benchmark-unify: a Perl script to unify benchmark output of different
>> versions of JtR for use with relbench  (Frank)
>> * Per-benchmark speed ratio output added to relbench  (Frank)
>> * bash completion for JtR (to install: "sudo make bash-completion")
>> (Frank)
>> * New program: raw2dyna (helper to convert raw hashes to "dynamic")
>> (JimF)
>> * New program: pass_gen.pl (generates hashes from plaintexts)  (JimF,
>> magnum)
>> * Many code changes made, many bugs fixed, many new bugs introduced
>> (all)
>>
>> Now the promised benchmarks.  Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6
>> overall speed change on one core in FX-8120 (should be 4.0 GHz turbo),
>> after running through benchmark-unify and relbench (yet about 50 of the
>> new version's benchmark results could not be directly compared against
>> results of the previous version, and thus are excluded):
>>
>> Number of benchmarks:           151
>> Minimum:                        0.84668 real, 0.84668 virtual
>> Maximum:                        10.92416 real, 10.92416 virtual
>> Median:                         1.10800 real, 1.10800 virtual
>> Median absolute deviation:      0.12531 real, 0.12369 virtual
>> Geometric mean:                 1.26217 real, 1.26284 virtual
>> Geometric standard deviation:   1.47239 real, 1.47274 virtual
>>
>> Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz):
>>
>> Number of benchmarks:           151
>> Minimum:                        0.94616 real, 0.48341 virtual
>> Maximum:                        24.19709 real, 4.29610 virtual
>> Median:                         1.17609 real, 1.05964 virtual
>> Median absolute deviation:      0.17436 real, 0.11465 virtual
>> Geometric mean:                 1.35493 real, 1.17097 virtual
>> Geometric standard deviation:   1.71505 real, 1.36577 virtual
>>
>> These show that overall we do indeed have a speedup, and that's without
>> any GPU stuff.
>>
>> Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in
>> both cases), on the same CPU (8 threads):
>>
>> Number of benchmarks:           202
>> Minimum:                        0.76235 real, 0.09553 virtual
>> Maximum:                        30.51791 real, 3.81904 virtual
>> Median:                         1.01479 real, 0.98287 virtual
>> Median absolute deviation:      0.02747 real, 0.03514 virtual
>> Geometric mean:                 1.71441 real, 0.77454 virtual
>> Geometric standard deviation:   2.08823 real, 1.50966 virtual
>>
>> The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is
>> for:
>>
>> Ratio:  30.51791 real, 3.81904 virtual  SIP MD5:Raw
>>
>> We'll correct the non-OpenMP performance for SIP in the next version.
>> For the rest, the maximum speedup is 6.13x for SSH, which is great
>> (considering that the CPU clock rate reduces with more threads running,
>> and that this is a 4-module CPU rather than a true 8-core).  Here are
>> the top 10 OpenMP performers (excluding SIP):
>>
>> Ratio:  6.13093 real, 0.77210 virtual   SSH RSA/DSA (one 2048-bit RSA and
>> one 1024-bit DSA key):Raw
>> Ratio:  6.05882 real, 0.75737 virtual   NTLMv2 C/R MD4 HMAC-MD5:Many
>> salts
>> Ratio:  6.04342 real, 0.75548 virtual   LMv2 C/R MD4 HMAC-MD5:Many salts
>> Ratio:  5.92830 real, 0.74108 virtual   GOST R 34.11-94:Raw
>> Ratio:  5.81605 real, 0.73986 virtual   sha256crypt (rounds=5000):Raw
>> Ratio:  5.65289 real, 0.70523 virtual   sha512crypt (rounds=5000):Raw
>> Ratio:  5.63333 real, 0.72034 virtual   Drupal 7 $S$ SHA-512 (x16385):Raw
>> Ratio:  5.56435 real, 0.69937 virtual   OpenBSD Blowfish (x32):Raw
>> Ratio:  5.50484 real, 0.69682 virtual   Password Safe SHA-256:Raw
>> Ratio:  5.49613 real, 0.68814 virtual   Sybase ASE salted SHA-256:Many
>> salts
>>
>> The worst regression is for:
>>
>> Ratio:  0.76235 real, 0.09553 virtual   LM DES:Raw
>>
>> It is known that our current LM hash code does not scale well, and is
>> very fast even with one thread (close to the bottleneck of the current
>> interface).  It is in fact better not to use OpenMP for LM hashes yet,
>> or to keep the thread count low (e.g., 4 would behave better than 8).
>> The low median and mean speedup are because many hashes still lack
>> OpenMP support - mostly the "fast" ones, where we'd bump into the
>> bottleneck anyway.  We might deal with this later.  For "slow" hashes,
>> the speedup with OpenMP is close to perfect (5x to 6x for this CPU).
>>
>> Now to the new stuff.  The effect of XOP (make linux-x86-64-xop):
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5
>> Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
>> Raw:    204600 c/s real, 25625 c/s virtual
>>
>> -5 achieved at most:
>>
>> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
>> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
>> Raw:    158208 c/s real, 19751 c/s virtual
>>
>> with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only:
>>
>> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
>> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
>> Raw:    141312 c/s real, 17664 c/s virtual
>>
>> with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor
>> for MD4 and SHA-1), only knowing how to use it for DES (which it did).
>>
>> So we got an over 20% speedup due to XOP here.
>>
>> Similarly, for raw SHA-1 best result with -5:
>>
>> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1
>> Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE
>> Raw:    13067K c/s real, 13067K c/s virtual
>>
>> whereas -6 does, with JimF's and magnum's optimizations and with XOP:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1
>> Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
>> Raw:    23461K c/s real, 23698K c/s virtual
>>
>> and with Tavis' contribution:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng
>> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
>> Raw:    28024K c/s real, 28024K c/s virtual
>>
>> So that's an over 2x speedup if we can accept the length 15 limit, or
>> an almost 80% speedup otherwise.
>>
>> Note: all of the raw SHA-1 benchmarks above are for one CPU core, not
>> for the entire chip (no OpenMP for fast hashes like this yet, but
>> there's MPI and there are always separate process invocations...)
>>
>> To more important stuff, sha512crypt on CPU vs. GPU:
>>
>> For reference, here's what we would get with the previous version, using
>> the glibc implementation of SHA-crypt:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt
>> Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE
>> Many salts:     1518 c/s real, 189 c/s virtual
>> Only one salt:  1515 c/s real, 189 c/s virtual
>>
>> Now we also have builtin implementation, although it nevertheless uses
>> OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet -
>> adding that and making use of SIMD would provide much additional
>> speedup, this is a to-do item for us):
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt
>> Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE
>> Raw:    2045 c/s real, 256 c/s virtual
>>
>> So it is about 35% faster.  Let's try GPUs, first GTX 570 1600 MHz
>> (a card that is vendor-overclocked to that frequency):
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda
>> Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE
>> Raw:    3833 c/s real, 3833 c/s virtual
>>
>> Another 2x speedup here, but that's still not it.  Let's see:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
>> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
>> Using device 0: GeForce GTX 570
>> Building the kernel, this could take a while
>> Local work size (LWS) 512, global work size (GWS) 7680
>> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
>> Raw:    11405 c/s real, 11349 c/s virtual
>>
>> And now this is it - Claudio's OpenCL code is really good on NVIDIA,
>> giving us a 5.5x speedup over CPU.  (SHA-512 is not as GPU-friendly as
>> e.g. MD5, but is friendly enough for some decent speedup.)
>>
>> Let's also try AMD Radeon HD 7970 (normally a faster card), at stock
>> clocks:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
>> -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> Building the kernel, this could take a while
>> Elapsed time: 17 seconds
>> Local work size (LWS) 32, global work size (GWS) 16384
>> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
>> Raw:    5144 c/s real, 3276K c/s virtual
>>
>> Not as much luck here yet.  Finally, for comparison and to show how any
>> one of the three OpenCL devices may be accessed from john's command-line
>> with --platform and --device options, the same OpenCL code on the CPU:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
>> -pla=1
>> -dev=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 1: AMD FX(tm)-8120 Eight-Core Processor
>> Local work size (LWS) 1, global work size (GWS) 1024
>> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
>> Raw:    1850 c/s real, 233 c/s virtual
>>
>> This shows that the code is indeed pretty efficient - almost reaching
>> OpenSSL's specialized code speed.
>>
>> Now to bcrypt.  This CPU is pretty good at it:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf
>> Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE
>> Raw:    5300 c/s real, 664 c/s virtual
>>
>> (FWIW, with overclocking I was able to get this to about 5650 c/s, but
>> not more - bumping into 125 W TDP.  The above is at stock clocks.)
>>
>> This is for "$2a$05" or only 32 iterations, which is used as baseline
>> for benchmarks for historical reasons.  Actual systems often use
>> "$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days.
>>
>> Anyway, the reference cracking speed for bcrypt above is higher than the
>> speed for sha512crypt on the same CPU (with the current code at least,
>> which admittedly can be optimized much further).  Can we make it even
>> higher on a GPU?  Maybe, but not yet, not with the current code:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> ****Please see 'opencl_bf_std.h' for device specific optimizations****
>> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
>> Raw:    4143 c/s real, 238933 c/s virtual
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable
>> --od-setclocks=1225,1375
>> AMD Overdrive(TM) enabled
>>
>> Default Adapter - AMD Radeon HD 7900 Series
>>                   New Core Peak   : 1225
>>                   New Memory Peak : 1375
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> ****Please see 'opencl_bf_std.h' for device specific optimizations****
>> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
>> Raw:    5471 c/s real, 358400 c/s virtual
>>
>> It's only with a 30% overclock that the high-end GPU gets to the same
>> level of performance as the 2-3 times cheaper CPU.  BTW, the GPU stays
>> cool with this overclock (73 C with stock cooling when running bf-opencl
>> for a while), precisely because we have to heavily under-utilize it due
>> to it not having enough local memory to accommodate as many parallel
>> bcrypt computations as we'd need for full occupancy and to hide memory
>> access latencies.
>>
>> Maybe more optimal code will achieve better results, though.
>>
>> The NVIDIA card also has no luck competing with the CPU at bcrypt yet:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl
>> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
>> Using device 0: GeForce GTX 570
>> ****Please see 'opencl_bf_std.h' for device specific optimizations****
>> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
>> Raw:    1137 c/s real, 1137 c/s virtual
>>
>> Some tuning could provide better numbers, but they stay a lot lower than
>> the CPU's and HD 7970's anyway (for the current code).
>>
>> Some other GPU benchmarks where I think we achieve decent performance
>> (not exactly the best, but on par with competing tools that had GPU
>> support for longer):
>>
>> GTX 570 1600 MHz:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda
>> Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE
>> Raw:    510171 c/s real, 507581 c/s virtual
>>
>> HD 7970 925 MHz (stock):
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> Optimal Work Group Size:256
>> Kernel Execution Speed (Higher is better):1.403044
>> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
>> Raw:    92467 c/s real, 92142 c/s virtual
>>
>> 1225 MHz:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> Optimal Work Group Size:128
>> Kernel Execution Speed (Higher is better):1.856949
>> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
>> Raw:    121644 c/s real, 121644 c/s virtual
>>
>> (would overheat if actually used? this is not bcrypt anymore)
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar
>> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
>> Using device 0: GeForce GTX 570
>> Optimal keys per crypt 32768
>> (to avoid this test on next run, put "rar_GWS = 32768" in john.conf,
>> section
>> [Options:OpenCL])
>> Local worksize (LWS) 64, Global worksize (GWS) 32768
>> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
>> Raw:    4380 c/s real, 4334 c/s virtual
>>
>> The HD 7970 card is back to stock clocks here:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> Optimal keys per crypt 65536
>> (to avoid this test on next run, put "rar_GWS = 65536" in john.conf,
>> section
>> [Options:OpenCL])
>> Local worksize (LWS) 64, Global worksize (GWS) 65536
>> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
>> Raw:    7162 c/s real, 468114 c/s virtual
>>
>> WPA-PSK, on CPU:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk
>> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE
>> Raw:    1980 c/s real, 247 c/s virtual
>>
>> (no SIMD yet; could do several times faster).  CUDA:
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda
>> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE
>> Raw:    32385 c/s real, 16695 c/s virtual
>>
>> OpenCL on the faster card (stock clock):
>>
>> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1
>> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
>> Using device 0: Tahiti
>> Max local work size 256
>> Optimal local work size = 256
>> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE
>> Raw:    55138 c/s real, 42442 c/s virtual
>>
>> 27x speedup over CPU here, although presumably the CPU code is further
>> from optimal.
>>
>> ...Hey, what are you doing here?  That message was way too long, you
>> couldn't possibly read this far.  I'll just presume you scrolled to the
>> end.  There's good stuff you have missed above, so please scroll up. ;-)
>>
>> As usual, feedback is welcome on the john-users list.  I realize that
>> we're currently missing usage instructions for much of the new stuff, so
>> please just ask on john-users - and try to make your questions specific.
>> That way, code contributors will also be prompted/forced to contribute
>> documentation, and we'll get it under doc/ and on the wiki - in fact,
>> you can contribute to that too.
>>
>> Alexander
>>
>

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.