john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150425214925.GA23292@openwall.com>
Date: Sun, 26 Apr 2015 00:49:26 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

On Sat, Apr 25, 2015 at 10:35:51PM +0200, Frank Dittrich wrote:
> After pulling the latest changes up to commit a988a38c, I did reset
> BENCHMARK_LENGTH back to 0 to get these numbers.
> 
> On one of my systems (i5-4570 CPU (4 physical, 4 logical cores, AVX2
> build), the difference for --costs=2:2,2:2 is hard to notice.
> For --cost=0:0,0:0, it is 1.3%, for --costs=2:2,0:0, it is about 0.5%,
> for --costs=0:0,2:2 it is about 0.25%.
> 
> On my laptop with  Core(TM) i7-2820QM CPU (4 physical, 8 logical cores,
> SSE2 build), I get a 13% difference for --costs=0:0,0:0.
> The difference is about 9% for --costs=2:2,0:0, or --costs=0:0,2:2, and
> for --costs=2:2,2:2 it is about 8%.
> On this laptop, I tried to use short --test times, to avoid throttling
> So I used several runs of --test=1, but I'm afraid that throttling
> interfered nevertheless.

Any idea why the 8-thread CPU and build is impacted much more?  Are we
possibly exceeding L1 data cache size?  It's the same for both CPUs
(32 KB), but is twice lower per-thread when you have 2 threads/core.

The core appears to allocate 16 of (PLAINTEXT_LENGTH + 1) and of
BINARY_SIZE per thread.  That's:

16*((125+1)+257) = 6128

Hmm, looks like it should fit either way.  The overhead on top of that
shouldn't be that much.  Yet you could try halving OMP_SCALE to test.

> After resetting BENCHMARK_LENGTH to -1, I get results for "Raw" that are
> similar to the "Many salts" case in my previous tests with
> BENCHMARK_LENGTH 0.

Upon a second thought, I think we should keep it at 0.  I was wrong.

At low cost settings, this format isn't slow enough for the difference
to be negligible.  On a related note, maybe we should change
md5crypt-opencl's BENCHMARK_LENGTH to 0 too, since it gets to pretty
high speeds on GPU, and the set_key() overhead on CPU and keys transfer
may play a role.  In fact, it includes "if (new_keys)" there
specifically to optimize the "many salts" case.

Oh, and pomelo-opencl needs this optimization too.  Right now, it
includes partial keys transfer in set_key() instead, followed by
transfer of remaining keys in crypt_all().  This was reasonable in the
fast and saltless opencl_mysqlsha1_fmt_plug.c, which pomelo-opencl is
based on, but since POMELO is salted we need to check whether the keys
have changed before transferring the remaining keys in crypt_all().
Given the existing code, maybe this is as simple as setting "key_offset
= key_idx;" right after the keys have been transferred in crypt_all(),
so that subsequent crypt_all() calls will skip that.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.