john-dev - Re: PHC: Parallel in OpenCL

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABob6irwDj=CMKA729AkxDLwDx-_zLZyXKLDMKAFv4i3htq=gQ@mail.gmail.com>
Date: Tue, 2 Jun 2015 02:38:22 +0200
From: Lukas Odzioba <lukas.odzioba@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Parallel in OpenCL

2015-05-31 13:39 GMT+02:00 Agnieszka Bielec <bielecagnieszka8@...il.com>:
> [a@...er run]$ ./john --test --format=parallel-opencl --dev=1
> Local worksize (LWS) 64, global worksize (GWS) 16384
> Benchmarking: Parallel-opencl [SHA-512 OpenCL]...
> Many salts:     27536 c/s real, 3276K c/s virtual


diff --git a/src/opencl/parallel_kernel.cl b/src/opencl/parallel_kernel.cl
index 1a18e14..b50c1e2 100755
--- a/src/opencl/parallel_kernel.cl
+++ b/src/opencl/parallel_kernel.cl
@@ -512,19 +512,18 @@ inline void hash(void *message, unsigned int
length, void *out, unsigned int out
        m_messageLengthLo += (LENGTH); \
 }

-#define SIMPLE(tmpJ,key) { \
-       for(k=0;k<8;k++) \
-               ((unsigned char*)m_block)[k]=((unsigned char*) (tmpJ))[k];\
-\
-       for(k=0;k<(HASH_LENGTH);k++) \
-               ((unsigned char*)m_block)[8+k]=((unsigned char*) (key))[k];\
-\
-       ((unsigned char*) m_block)[72] = 0x80; \
-       for(i=0;i<55;i++) \
-               ((unsigned char*) m_block)[73+i]=0; \
-\
-       m_block[15] = 576; \
-       sha512Block_Z(m_block, m_state);    \
+#define SIMPLE(j,key) { \
+       m_block[0]=SWAP_ENDIAN_64(j);\
+       for(k=0;k<(HASH_LENGTH/8);k++) \
+               m_block[1+k]=key[k];\
+       m_block[9]=0x80;\
+       m_block[10]=0;\
+       m_block[11]=0;\
+       m_block[12]=0;\
+       m_block[13]=0;\
+       m_block[14]=0;\
+       m_block[15] = 576;\
+       sha512Block_Z(m_block, m_state);\
 }
 //to do: why change 55 to 10, makes slower speed from 32k to 24 k?

@@ -645,9 +644,8 @@ __kernel void parallel_kernel_loop(__global const
uchar * in,
        for (j = 0; j < parallelLoops; j++)
        {
                // work ^= hash(WRITE_BIG_ENDIAN_64(j) || key)
-               tmpJ = SWAP_ENDIAN_64(j);

-               SIMPLE(&tmpJ, key);
+               SIMPLE(j, key);
                for (k = 0; k < HASH_LENGTH / sizeof(unsigned long); k++)
                {
                        work[k] ^= SWAP_ENDIAN_64(m_state[k]);//to do:
test on another GPUs
~

27536->28743

Free 1k c/s, ideally we would get rid of constants like m_block[i] = 0
and assume such values in SHA512_Z body, which means less memory ops
and less alu ops.
Copying key[] seems to be unnecesary also, just introduce references
to key in SHA512 and forget about garbage in m_block table, iirc it
will be overwritten later with the proper value.

Slowly step by step make it simple not every change will give you
thousands of c/s right away, but each will likely add something to the
end result. Compiler can be good at optimizing unoptimal code, but it
will not do the job for us. Please keep attention to the details.

Thanks,
Lukas

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.