|
Message-ID: <20150801124152.GA8811@openwall.com> Date: Sat, 1 Aug 2015 15:41:52 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Benchmark result All - Viktor is joining team john-users for the upcoming CMIYC contest, as per Aleksey's invitation: http://www.openwall.com/lists/john-users/2015/07/22/7 Hi Viktor, First of all, thank you for running and posting this benchmark. It is not great that you break the thread each time you post, though. You got to be using the "reply" feature of your mail client to add stuff to the same thread (only if relevant to the thread, of course), so that the In-Reply-To header (normally hidden) is set to point to the thread. As you probably have figured out, this benchmark is mostly of your two CPUs, and only some of the formats - the "-cuda" suffixed ones - use one of your GPUs (leaving the remaining 7 idle). Moreover, as it relates to JtR's GPU support, we focus on OpenCL rather than on CUDA. The few CUDA formats just happen to exist in the tree, but we normally don't use them at all, and hence don't optimize them. You don't have any "-opencl" suffixed formats reported, which suggests that your build of JtR lacks OpenCL support. Please go back to ./configure and see why it isn't detecting OpenCL - e.g., a header file might not be in header search path, or similar. And no, the fact that your GPU cards are from NVIDIA does not mean that you're stuck with CUDA. They will work with OpenCL just fine, and much faster too - given the optimization focus we had in JtR development. NVIDIA's CUDA SDK provides OpenCL support, too. So you did in fact need to install it, but you should then be using OpenCL rather than CUDA. Also, you should prefer the "--fork" option over OpenMP. Just like with use of multiple GPUs, you won't be able to do that along with "--test", but it will work for actual cracking just fine, delivering a greater cumulative performance (than OpenMP) across the multiple processes. Another reason to prefer "--fork" is that OpenMP is very sensitive to other load on the system, and you will have plenty of "other load" when you use your CPUs and GPUs simultaneously. I'd recommend running something like "--fork=24" for a CPU-using john instance, and thus leaving another 8 logical CPUs (out of a total of 32 that you have) for keeping your 8 GPUs busy. You don't strictly have to do it that way, though. Running more than 32 total concurrent child processes (across multiple instances of john) is also OK, and may sometimes be convenient (e.g., for starting short-lived jobs while a longer one is running and is already using all of the logical CPUs). With "--fork", exceeding the logical CPU count is tolerable. With OpenMP, it is not (the threads might all crawl to a halt if you try that with OpenMP; they might still appear to be busily running, but would actually waste time busy-waiting for other threads). If you do choose to use "--fork" instead of OpenMP, I suggest that you go a step further and exclude OpenMP support (and thus its remaining associated overhead) from your build, for a slight extra performance boost. You do this by building with: ./configure --disable-openmp To summarize, if you're serious about this and want a race car with the muffler removed and you know how to drive it (as above), build without OpenMP, but with OpenCL. Use "--fork" for your CPU runs, and use "-opencl" suffixed formats on your GPUs. Forget about the "-cuda" suffixed formats (you may even "./configure --disable-cuda" so that they don't confuse you anymore). You may also use e.g. "--fork=8 --dev=0,1,2,3,4,5,6,7" to use all of your 8 GPUs at once (of course, with a specific OpenCL format specified, e.g. "--format=phpass-opencl"). Or you may run separate jobs on them, or group them in any way you like. Since you will be running multiple instances of john (at least two: one for CPUs and another for GPUs, and probably many more), you will need to use the "--session" option to specify unique session names to them, so that they don't fight for the same .rec (crash recovery) files. On Fri, Jul 31, 2015 at 02:42:15PM +0200, Viktor Gazdag wrote: > 8x GPU NVIDIA Corporation GK110GL [Tesla K20Xm] (rev a1) These are pretty good (even if badly over-priced for this use case), but the speeds you got for one of them with the "-cuda" formats are poor. I am interested in what speeds you'd get with "-opencl". > Model name: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz These CPUs are also pretty good. With two of them, you have a total of 16 cores and 32 logical CPUs. 2.2 GHz is the base non-turbo clock rate; actual clock rate is higher. Per this table: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Xeon_E5-26xx_.28dual-processor.29 it's up to 2.7 GHz with all cores in use, 3.0 GHz with few cores in use. I previously benchmarked a system with CPUs exactly like this, although I only ran the 1.8.0 release on it rather than jumbo. I've just added that benchmark's results (which I had in a text file) to the table at: http://openwall.info/wiki/john/benchmarks > Linux 3.16.0-4-amd64 Some older kernels, such as those included in RHEL6 (and CentOS 6, etc.), wouldn't automatically enable turbo boost on Xeon E5-2600 series CPUs. Yours appears recent enough that it would, and your benchmark results show that it did. So you're fine in that respect. > Benchmarking: descrypt, traditional crypt(3) [DES 128/128 AVX-16]... (32xOMP) DONE > Many salts: 53726K c/s real, 1725K c/s virtual > Only one salt: 36175K c/s real, 1129K c/s virtual Somehow the first one of these results is slightly worse than what I had. As you can see, I got 67764K c/s at the "many salts" result. This might indicate that you could need to tune OpenMP parameters, or it might simply be caused by it being the very first benchmark, and CPU clock frequency scaling not yet having a chance to fully kick in. You could want to run: ./john --test --format=descrypt; ./john --test --format=descrypt That is, run the same benchmark twice in a row, with no delay. If it's CPU clock frequency scaling, then you'd generally see higher "many salts" speeds on the second one of these two invocations. Anyway, I guess you'd be disabling OpenMP and using "--fork" per my advice above, making it unimportant how efficient OpenMP is at this. > Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE > Raw: 573338 c/s real, 18079 c/s virtual > > Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... (32xOMP) DONE > Speed for cost 1 (iteration count) of 32 > Raw: 14827 c/s real, 461 c/s virtual These speeds are consistent with what I'd expect for your CPUs with turbo enabled. In fact, the latter matches my result exactly. :-) > Benchmarking: LM [DES 128/128 AVX-16]... (32xOMP) DONE > Raw: 72220K c/s real, 2255K c/s virtual LM is totally ridiculous with OpenMP. Would be an order of magnitude faster with "--fork". Ditto for other "fast hashes". (For "slow hashes", the difference is much smaller, like 10% or so, except for the issue with sensitivity to other load with OpenMP.) > Benchmarking: sha256crypt-cuda, crypt(3) $5$ (rounds=5000) [SHA256 > CUDA (inefficient, please use sha256crypt-opencl instead)]... FAILED > (cmp_all(7)) We could want to investigate why this one fails for you, but anyway you should be using sha256crypt-opencl instead. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.