|
Message-ID: <CABob6iq_=KK+xpCQ-eS7a68bBmsJDbqLY3r=HkG4+ayDgkSOvw@mail.gmail.com> Date: Sat, 30 May 2015 02:20:49 +0200 From: Lukas Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: PHC candidates project - irc minutes 2015-05-25 22:51 GMT+02:00 Lukas Odzioba <lukas.odzioba@...il.com>: > Just the major points of the discussion are published here. Next minutes below. ----------------#5 28.05.2015------------ A: Implemented first version of simplified parallel loop in kernel and notified regression in performance on 7990 (--dev=1) and TITAN (--dev=5) , but improvement on GeForce 960m. Titan: 37882 c/s previously 38k 7990: 17617 c/s previously 45k 960m: 32k c/s previously 28k A: Reported problem with self-test on 7990, but not on NV also mentioned that she knows the reason and it is expected. L: Tested performance on 7970 (31k on bleeding-jumbo and 7k on parallel_opt branch - commits fbbe01e4d4 and 656f9c55a2 accordingly) L: After a brief review of the current code suggested taking look at pbkdf2-sha512-opencl format to reuse ideas on code simplification - unrolling SHA512 loop, getting rid of 0’s and avoiding endianness swapping when possible. A&L: Discussed BENCHMARK_LENGTH settings and decided to leave it at 0 for now and maybe changing to -1 for parallel in the future. ----------------#6 29.05.2015------------ A: Tested various code versions, not found a clear way of optimization. A: Concerned about code size affecting performance (after removing add 0 instructions). L: Tried to run latest codeXL locally without success… and suggested trying to run codexl remotely on super. A: Found problem in library dependecies when trying to run codexl on super and reported it to Solar. L: Helped to find a way to estimate generated code size for GCN size using Daniel’s notes on wiki. A: Checked code size for parallel_opt and bleeding-jumbo branches, 33kB and 183kB accordingly (size of L1 code cache for GCN is 32kB). L: Suggested getting rid of #pragma unroll, but it did not help. A: Reduced code size by using #pragma unroll 1 A: Reported self test problem on gcn with #pragma unroll 1 L: Noticed huge difference in speeds reported on 7970 and 7990 45k on 7990 (bleeding-jumbo fbbe01e4d44a79) 31.5k on 7970 (maybe different driver). A: Reported that after adding optimized SHA512 with 0’s code is getting slower and decided to try split kernel. L: Suggested dropping code snippets which speeds up NV or GCN and try to focus on general version at the moment, and introduce architecture dependent code/optimizations later. Benchmarks with #pragma unroll 1 and 0’s optimizations 960m: 28k Titan: 18.7k 7970: self_test failing Stay tuned, Lukas
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.