john-users - Re: Benchmark result

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150801124152.GA8811@openwall.com>
Date: Sat, 1 Aug 2015 15:41:52 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Benchmark result

All -

Viktor is joining team john-users for the upcoming CMIYC contest, as per
Aleksey's invitation:

http://www.openwall.com/lists/john-users/2015/07/22/7

Hi Viktor,

First of all, thank you for running and posting this benchmark.  It is
not great that you break the thread each time you post, though.  You got
to be using the "reply" feature of your mail client to add stuff to the
same thread (only if relevant to the thread, of course), so that the
In-Reply-To header (normally hidden) is set to point to the thread.

As you probably have figured out, this benchmark is mostly of your two
CPUs, and only some of the formats - the "-cuda" suffixed ones - use one
of your GPUs (leaving the remaining 7 idle).  Moreover, as it relates to
JtR's GPU support, we focus on OpenCL rather than on CUDA.  The few CUDA
formats just happen to exist in the tree, but we normally don't use them
at all, and hence don't optimize them.

You don't have any "-opencl" suffixed formats reported, which suggests
that your build of JtR lacks OpenCL support.  Please go back to
./configure and see why it isn't detecting OpenCL - e.g., a header file
might not be in header search path, or similar.

And no, the fact that your GPU cards are from NVIDIA does not mean that
you're stuck with CUDA.  They will work with OpenCL just fine, and much
faster too - given the optimization focus we had in JtR development.
NVIDIA's CUDA SDK provides OpenCL support, too.  So you did in fact need
to install it, but you should then be using OpenCL rather than CUDA.

Also, you should prefer the "--fork" option over OpenMP.  Just like with
use of multiple GPUs, you won't be able to do that along with "--test",
but it will work for actual cracking just fine, delivering a greater
cumulative performance (than OpenMP) across the multiple processes.
Another reason to prefer "--fork" is that OpenMP is very sensitive to
other load on the system, and you will have plenty of "other load" when
you use your CPUs and GPUs simultaneously.  I'd recommend running
something like "--fork=24" for a CPU-using john instance, and thus leaving
another 8 logical CPUs (out of a total of 32 that you have) for keeping
your 8 GPUs busy.  You don't strictly have to do it that way, though.
Running more than 32 total concurrent child processes (across multiple
instances of john) is also OK, and may sometimes be convenient (e.g.,
for starting short-lived jobs while a longer one is running and is
already using all of the logical CPUs).  With "--fork", exceeding the
logical CPU count is tolerable.  With OpenMP, it is not (the threads
might all crawl to a halt if you try that with OpenMP; they might still
appear to be busily running, but would actually waste time busy-waiting
for other threads).

If you do choose to use "--fork" instead of OpenMP, I suggest that you
go a step further and exclude OpenMP support (and thus its remaining
associated overhead) from your build, for a slight extra performance
boost.  You do this by building with:

./configure --disable-openmp

To summarize, if you're serious about this and want a race car with the
muffler removed and you know how to drive it (as above), build without
OpenMP, but with OpenCL.  Use "--fork" for your CPU runs, and use
"-opencl" suffixed formats on your GPUs.  Forget about the "-cuda"
suffixed formats (you may even "./configure --disable-cuda" so that they
don't confuse you anymore).  You may also use e.g. "--fork=8
--dev=0,1,2,3,4,5,6,7" to use all of your 8 GPUs at once (of course,
with a specific OpenCL format specified, e.g. "--format=phpass-opencl").
Or you may run separate jobs on them, or group them in any way you like.

Since you will be running multiple instances of john (at least two: one
for CPUs and another for GPUs, and probably many more), you will need to
use the "--session" option to specify unique session names to them, so
that they don't fight for the same .rec (crash recovery) files.

On Fri, Jul 31, 2015 at 02:42:15PM +0200, Viktor Gazdag wrote:
> 8x GPU NVIDIA Corporation GK110GL [Tesla K20Xm] (rev a1)

These are pretty good (even if badly over-priced for this use case), but
the speeds you got for one of them with the "-cuda" formats are poor.

I am interested in what speeds you'd get with "-opencl".

> Model name: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz

These CPUs are also pretty good.  With two of them, you have a total of
16 cores and 32 logical CPUs.  2.2 GHz is the base non-turbo clock rate;
actual clock rate is higher.  Per this table:

https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Xeon_E5-26xx_.28dual-processor.29

it's up to 2.7 GHz with all cores in use, 3.0 GHz with few cores in use.

I previously benchmarked a system with CPUs exactly like this, although
I only ran the 1.8.0 release on it rather than jumbo.  I've just added
that benchmark's results (which I had in a text file) to the table at:

http://openwall.info/wiki/john/benchmarks

> Linux 3.16.0-4-amd64

Some older kernels, such as those included in RHEL6 (and CentOS 6, etc.),
wouldn't automatically enable turbo boost on Xeon E5-2600 series CPUs.
Yours appears recent enough that it would, and your benchmark results
show that it did.  So you're fine in that respect.

> Benchmarking: descrypt, traditional crypt(3) [DES 128/128 AVX-16]... (32xOMP) DONE
> Many salts:    53726K c/s real, 1725K c/s virtual
> Only one salt:    36175K c/s real, 1129K c/s virtual

Somehow the first one of these results is slightly worse than what I
had.  As you can see, I got 67764K c/s at the "many salts" result.
This might indicate that you could need to tune OpenMP parameters, or it
might simply be caused by it being the very first benchmark, and CPU
clock frequency scaling not yet having a chance to fully kick in.
You could want to run:

./john --test --format=descrypt; ./john --test --format=descrypt

That is, run the same benchmark twice in a row, with no delay.  If it's
CPU clock frequency scaling, then you'd generally see higher "many
salts" speeds on the second one of these two invocations.

Anyway, I guess you'd be disabling OpenMP and using "--fork" per my
advice above, making it unimportant how efficient OpenMP is at this.

> Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
> Raw:    573338 c/s real, 18079 c/s virtual
> 
> Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... (32xOMP) DONE
> Speed for cost 1 (iteration count) of 32
> Raw:    14827 c/s real, 461 c/s virtual

These speeds are consistent with what I'd expect for your CPUs with
turbo enabled.  In fact, the latter matches my result exactly. :-)

> Benchmarking: LM [DES 128/128 AVX-16]... (32xOMP) DONE
> Raw:    72220K c/s real, 2255K c/s virtual

LM is totally ridiculous with OpenMP.  Would be an order of magnitude
faster with "--fork".  Ditto for other "fast hashes".  (For "slow
hashes", the difference is much smaller, like 10% or so, except for the
issue with sensitivity to other load with OpenMP.)

> Benchmarking: sha256crypt-cuda, crypt(3) $5$ (rounds=5000) [SHA256
> CUDA (inefficient, please use sha256crypt-opencl instead)]... FAILED
> (cmp_all(7))

We could want to investigate why this one fails for you, but anyway you
should be using sha256crypt-opencl instead.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.