Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150425192015.GC21470@openwall.com>
Date: Sat, 25 Apr 2015 22:20:15 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka,

On Wed, Apr 15, 2015 at 04:41:33PM +0200, Agnieszka Bielec wrote:
> I've modified my previous algorithm and now it computes one hash on 4 GPU units.
> it's better coalescing now

"4 GPU units" sounded weird and confusing to me.

I finally took a look at commit
d259f3be880e9f55e2424d9819cf56e2987302fe.  It appears that you're trying
to simulate 4x SIMD by separate work-items.  Does this work reliably?
I think you'd need a barrier before you can combine the 4 intermediate
results into one hash output, and you don't appear to have that.  I am
actually surprised this works at all for you.  Am I missing something?

If you put a hash of a known password in a text file, and put the
password somewhere down a wordlist, and run "john" with pomelo-opencl
against that, will it crack the hash?  I expect it won't, since the code
looks broken to me.  You don't actually have any test vectors in
opencl_pomelo_fmt_plug.c, so when you --test you don't actually test,
right? :-(  You got to fix that, or you might have major bugs go
unnoticed for months while you "optimize".

I think you should use a vector data type instead of the separate
work-items, much like the AVX2 code on CPU uses __m256i and doesn't
rely on the compiler's auto-vectorization.  Granted, auto-vectorization
is much more common with OpenCL than with C, but you can help the
compiler by doing a part of it explicitly anyway (and then it'd be the
compiler's job to combine these narrow SIMD portions into possibly wider
SIMD that the hardware might need).  I think you need ulong4 there:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/vectorDataTypes.html

Also, I think you're over-using "unsigned long", such as for loop
variables.  This may have performance impact on GPUs.  You should prefer
"uint" except where a wider type is actually needed.

You may also try "#pragma unroll" and "#pragma unroll N" before loops,
see e.g. cryptsha512_kernel_GPU.cl for examples.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.