|
Message-ID: <CABob6ioq3McY5qsYt-6J0ydkKSacJ8uiN2hsCi32KQRUWsSSXA@mail.gmail.com> Date: Sun, 9 Jun 2013 00:07:04 +0200 From: Lukas Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: sha3-opencl 2013/6/6 Dániel Bali <balijanosdaniel@...il.com>: > daniel@...l:~/bleeding-jumbo/JohnTheRipper/src$ ../run/john -test > --format=raw-keccak256-opencl > Device 0: GeForce GTX 570 > Local worksize (LWS) 128, global worksize (GWS) 524288 > Benchmarking: raw-keccak256-opencl, Raw Keccak256 [OpenCL (inefficient, > development use only)]... DONE > Raw: 27525K c/s real, 27525K c/s virtual Daniel please share with us current (we added longer test vectors) speeds on 7970, 570 and bulls cpu (using opencl implementation). For comparision this is the result of AMD FX(tm)-8120 using AVX Benchmarking: raw-keccak-256, Keccak 256 [AVX]... DONE Raw: 2056K c/s real, 2076K c/s virtual Because this code was not so easy to move to gpu in limited time we decided to change basic implementation to Matt Mahoney's code available here: http://encode.ru/threads/1613-SHA3-winner-announced/page2 As far as I remember on bull's cpu we were getting ~1200K c/s using his code. Some more comments to code: 1) This loop is generating duplicates: // (Hack) clear output buffers first for (i = 0; i < 32; ++i) { hashes[(i/4) * num_keys + gid] = 0; } Lets assume: int num_keys=1024; int gid=5; for (i = 0; i < 32; ++i) { printf("%d ",(i/4) * num_keys + gid); } And we are getting: 5 5 5 5 1029 1029 1029 1029 2053 2053 2053 2053 3077 3077 3077 3077 4101 4101 4101 4101 5125 5125 5125 5125 6149 6149 6149 6149 7173 7173 7173 7173 2) We can add #pragma unroll N where N is constant, for example: // keccak::get() #pragma unroll 32 for (i = 0; i < 32; ++i) { We're not getting any c/s improvement by doing this (now), but for purity it is good to do that and do not care later. 3) I am curious what ISA code is generated by those macros: #define GETCHAR(buf, index) ((uchar)(buf >> index * 8) & 0xff) #define PUTCHAR(buf, index, val) (buf |= (val << index * 8)) Usually it is better to make reads/writes to global memory 32bit wide. If you have some free time tomorrow you can check this out. Next week we will move to more serious tasks. 4) You can try to run opencl profiler and share here some of the results of this analysis. I have some intuition but it is an exercise for you. How did you liked this task? Lukas
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.