|
Message-ID: <20120304002658.GA8272@openwall.com> Date: Sun, 4 Mar 2012 04:26:58 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: CUDA & OpenCL status Hi, I just built a new machine for my JtR development and testing - not cost-effective, but letting me run certain tests and benchmarks that I previously could not. Brief specs: FX-8120 CPU (Bulldozer) with unlocked multiplier, GTX-570 vendor-overclocked to 1600 MHz, Radeon HD 7970 at stock clocks. This is on ASRock 970 Extreme4 motherboard, which has 3 PCIe 16x slots, but only 2 are usable for dual-slot-wide cards in the midi-tower case I currently put it in; the slots operate as 8x+8x+4x (so I got 8x+8x for the two cards now), unless only one is in use (then it's real 16x), but I think it won't matter much (folks even use 1x slots for this purpose). For the OS, I simply installed Ubuntu 12.04 Beta1, plus NVidia drivers via Ubuntu's proprietary driver installer mechanism, plus the following: cudatoolkit_4.1.28_linux_64_ubuntu11.04.run amd-driver-installer-8.921-x86.x86_64.run AMD-APP-SDK-v2.6-lnx64.tgz Of the AMD stuff, I happened to install the driver first, then read in the SDK's README that it is preferable to install the SDK first and why. So I tarred up /usr/lib*/lib{amdocl,OpenCL}*.so, installed the SDK, then restored those files. Speaking of the CPU, I briefly tried -xop builds with and without OpenMP, which obviously worked fine. I also tried overclocking from the standard 3.1 GHz base and 4.0 GHz turbo up to a maximum of 5.1 GHz (no turbo then), but of course the CPU was unusable at that frequency (I could still navigate the BIOS menus and see CPU temperature grow slowly, but not do much more). ;-) So I just went back to the standard frequencies for my CUDA and OpenCL tests. Here are some results: For a linux-x86-64-cuda build, I ran: user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te=0 Warning: doing quick benchmarking - the performance numbers will be inaccurate Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 3916K c/s real, 3916K c/s virtual Only one salt: 3648K c/s real, 3648K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 140800 c/s real, 140800 c/s virtual Only one salt: 70400 c/s real, 140800 c/s virtual Benchmarking: FreeBSD MD5 [SSE2i 8x]... Segmentation fault (core dumped) Oops. Somehow in the CUDA build, some CPU-only formats fail. This is seen in a -gpu build as well (CUDA and NVidia OpenCL at once), where some CPU-only formats' self-tests fail (IIRC, I was getting not segfaults, but test failures in that build). Has anyone else seen that as well? Another try: user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te=1 Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Many salts: 4079K c/s real, 4079K c/s virtual Only one salt: 3866K c/s real, 3866K c/s virtual Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Many salts: 137344 c/s real, 137344 c/s virtual Only one salt: 134272 c/s real, 134272 c/s virtual Benchmarking: FreeBSD MD5 [SSE2i 8x]... Segmentation fault (core dumped) So it's reproducible. OK, let's try the CUDA formats specifically, then: user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptmd5-cuda Benchmarking: cryptmd5-cuda [MD5-based CRYPT]... DONE Raw: 637952 c/s real, 637952 c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptsha256-cuda Benchmarking: cryptsha256-cuda [SHA256-based CRYPT]... DONE Raw: 6892 c/s real, 6833 c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptsha512-cuda Benchmarking: cryptsha512-cuda [SHA512-based CRYPT]... DONE Raw: 3840 c/s real, 3840 c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=mscash-cuda Benchmarking: mscash-cuda len(pass)=8, len(salt)=13 []... DONE Raw: 17708K c/s real, 17533K c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=mscash2-cuda Benchmarking: mscash2-cuda [GPU]... DONE Raw: 7859 c/s real, 7929 c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=phpass-cuda Benchmarking: phpass-cuda [PORTABLE-MD5]... DONE Raw: 633061 c/s real, 633061 c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=raw-sha224-cuda Benchmarking: raw-sha224-cuda [SHA224]... DONE Raw: 7864K c/s real, 7864K c/s virtual user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=raw-sha256-cuda Benchmarking: raw-sha256-cuda [SHA256]... DONE Raw: 7798K c/s real, 7798K c/s virtual That's reasonable performance for "slow" hashes and poor performance for "fast" hashes, as expected. However, I think there's room for improvement for SHA-224 and SHA-256 even within the current formats interface. The phpass performance almost exactly matches the number for oclHashcat-plus given on the hashcat website ("653.6k c/s" for "PC2: Windows 7, 64 bit ForceWare 285.38 1x NVidia gtx570 1600Mhz core clock"), which is great. The cryptmd5-cuda number is almost twice worse than hashcat's. mscash2-cuda is 4 times worse than hashcat's. (I am using the published numbers for hashcat for this comparison.) With AMD stuff, things get worse: user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Optimal Group work Size = 256 Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... DONE Raw: 958464 c/s real, 1409K c/s virtual That's much better speed (although I think hashcat would do more like 1800k c/s on this card), but it's also the only time I was able to get this test to pass. Then it just kept failing: user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Optimal Group work Size = 256 Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed Aborted (core dumped) user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Optimal Group work Size = 256 Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed Aborted (core dumped) cryptmd5-opencl never worked (although I did not try it after a clean reboot, with no prior attempt to use OpenCL): user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=cryptmd5-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... FAILED (get_hash[0](0)) user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=cryptmd5-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... FAILED (get_hash[0](0)) Trying to use the CPU: user@...l:~/john/magnum-jumbo/src$ ../run/john -device=1 -te=1 -fo=phpass-openclOpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor >>> Optimal Group work Size = 2 Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed Aborted (core dumped) user@...l:~/john/magnum-jumbo/src$ ../run/john -device=1 -te=1 -fo=phpass-openclOpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor >>> Optimal Group work Size = 2 Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed Aborted (core dumped) No luck. Another format: user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Optimal Local work size 64 Benchmarking: NT MD4 [OpenCL 1.0]... DONE Raw: 26473K c/s real, 28445K c/s virtual Works, but is inefficient (as expected for a "fast" hash currently). ...but fails on the CPU: user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl -dev=1 OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor >>> Optimal Local work size 512 Benchmarking: NT MD4 [OpenCL 1.0]... ../../../thread/semaphore.cpp:87: sem_wait() failed Aborted (core dumped) Let's try both again: user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>> Optimal Local work size 64 Benchmarking: NT MD4 [OpenCL 1.0]... DONE Raw: 25954K c/s real, 27306K c/s virtual user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl -dev=1 OpenCL Platforms: 1 OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor >>> Optimal Local work size 512 Benchmarking: NT MD4 [OpenCL 1.0]... DONE Raw: 20360K c/s real, 4854K c/s virtual Both work now, but the performance is still unreasonable (the simple "NT" and "NT2" formats for this same hash type are slightly faster). ... NVidia OpenCL build does not crash on CPU formats when ran with -te=0 (unlike -cuda and -gpu builds), so that's what I did, and here's the final portion of output: [...] Benchmarking: dummy [N/A]... DONE Raw: 115379K c/s real, 115379K c/s virtual OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Optimal local work size 64 (to avoid this test on next run do export LWS=64) Local work size (LWS) 64, Keys per crypt (KPC) 2097152 Benchmarking: Netscape LDAP SSHA OPENCL [salted SHA-1]... DONE Many salts: 52428K c/s real, 69905K c/s virtual Only one salt: 41943K c/s real, 41943K c/s virtual OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Optimal local work size 128 (to avoid this test on next run do export LWS=128) Local work size (LWS) 128, Keys per crypt (KPC) 2097152 Benchmarking: Raw MD5 [raw-md5-opencl]... DONE Raw: 52428K c/s real, 41943K c/s virtual OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Optimal Local work size 128 Benchmarking: NT MD4 [OpenCL 1.0]... DONE Raw: 52428K c/s real, 26214K c/s virtual OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Optimal local work size 64 (to avoid this test on next run do export LWS=64) Local work size (LWS) 64, Keys per crypt (KPC) 2097152 Benchmarking: Raw SHA-1 OpenCL [raw-sha1-opencl]... DONE Raw: 52428K c/s real, 52428K c/s virtual OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> OpenCL error (CL_OUT_OF_RESOURCES) in file (cryptmd5_opencl_fmt.c) at line (194) - (Set ND range) Oops, cryptmd5-opencl in that build always fails like that: user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=cryptmd5-opencl ../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john) OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> OpenCL error (CL_OUT_OF_RESOURCES) in file (cryptmd5_opencl_fmt.c) at line (194) - (Set ND range) user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=mysql-sha1-opencl ../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john) OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Error -5 Optimal local work size 64 (to avoid this test on next run do export LWS=64) Local work size (LWS) 64, Keys per crypt (KPC) 2097152 Benchmarking: MySQL 4.1 double-SHA-1 [mysql-sha1-opencl]... DONE Many salts: 23741K c/s real, 23967K c/s virtual Only one salt: 23967K c/s real, 23741K c/s virtual ser@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=phpass-opencl ../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john) OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> OpenCL error (CL_OUT_OF_RESOURCES) in file (phpass_opencl_fmt.c) at line (162) - (Run kernel) user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=raw-sha1-opencl ../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john) OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Optimal local work size 64 (to avoid this test on next run do export LWS=64) Local work size (LWS) 64, Keys per crypt (KPC) 2097152 Benchmarking: Raw SHA-1 OpenCL [raw-sha1-opencl]... DONE Raw: 46829K c/s real, 47288K c/s virtual user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=ssha-opencl ../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john) OpenCL Platforms: 2 OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>> Max Group Work Size 1024 Optimal local work size 64 (to avoid this test on next run do export LWS=64) Local work size (LWS) 64, Keys per crypt (KPC) 2097152 Benchmarking: Netscape LDAP SSHA OPENCL [salted SHA-1]... DONE Many salts: 67108K c/s real, 65793K c/s virtual Only one salt: 45680K c/s real, 45680K c/s virtual So some of them work (or at least pass the test), some don't. Some deliver reasonable performance, most don't. This is fine as a development milestone, but not surprisingly a lot of further work is needed after this point. Thanks for reading this far. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.