Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffb9f1549895ceba9e473299ddfe1d79@smtp.hushmail.com>
Date: Fri, 19 Apr 2013 19:22:19 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: minor raw-sha1-ng pull request

On 19 Apr, 2013, at 19:12 , magnum <john.magnum@...hmail.com> wrote:
>> $ ../run/john -test -fo=raw-sha1-ng
>> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics
>> 4x]...(8xOMP) DONE
>> Raw:	23232K c/s real, 3338K c/s virtual
>> 
>> I don't understand why real is so different than virtual, compared to
>> without omp:
>> 
>> $ ../run/john -test -fo=raw-sha1-ng
>> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 SSE4.1 intrinsics 4x]... DONE
>> Raw:	9251K c/s real, 9251K c/s virtual
>> 
>> What am I doing wrong? (I already batch crypts, so I figured I could just
>> split the work across threads if available, maybe this was naive).
> 
> This is expected. The raw figures are hashes/wall-clock-time and the virtual ones are hashes/CPU-time. If you could get it to scale well, the virtual figure would be near a non-OMP one.
> 
> So for 8x OMP you only get ~2.5x speed. As long as you don't get lower speeds than for one core, we can commit it for sure. I think you need to run much larger batches under OMP (OMP_SCALE in rawSHA256_ng_fmt.c) for hiding the overhead. I got nt2 to scale fairly well on Intel with an OMP_SCALE of 1536. That is, it runs 1536*MMX_COEF*MD4_PARA crypts per call, per core. Or put another way, the for loop will submit 1536 normal batches to each thread.

OK, I get this for non-OMP build:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE
Raw:	23784K c/s real, 23784K c/s virtual

And this for OMP-build but running 1 core:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... DONE
Raw:	23553K c/s real, 23553K c/s virtual

That's fine. But trying to use more cores does not work well:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 AVX intrinsics 4x]... (4xOMP) DONE
Raw:	16872K c/s real, 9373K c/s virtual


I see you already have SHA1_PARALLEL_HASH of 512. Look at init() in raw-sha256-ng and try to mimic that - you probable want to use an OMP_SCALE of 3 and the number of keys would be actual number of cores in use * OMP_SCALE * SHA1_PARALLEL_HASH. I bet this will give much better results. But this means you need to dynamically allocate the buffers.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.