john-dev - Re: PHC candidates project

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABob6iq_=KK+xpCQ-eS7a68bBmsJDbqLY3r=HkG4+ayDgkSOvw@mail.gmail.com>
Date: Sat, 30 May 2015 02:20:49 +0200
From: Lukas Odzioba <lukas.odzioba@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC candidates project - irc minutes

2015-05-25 22:51 GMT+02:00 Lukas Odzioba <lukas.odzioba@...il.com>:
> Just the major points of the discussion are published here.

Next minutes below.

----------------#5 28.05.2015------------

A: Implemented first version of simplified parallel loop in kernel and
notified regression in performance on 7990 (--dev=1) and TITAN
(--dev=5) , but improvement on GeForce 960m.

Titan: 37882 c/s previously 38k
7990: 17617 c/s previously 45k
960m: 32k c/s  previously 28k

A: Reported problem with self-test on 7990, but not on NV also
mentioned that she knows the reason and it is expected.

L: Tested performance on 7970 (31k on bleeding-jumbo and 7k on
parallel_opt branch - commits fbbe01e4d4 and  656f9c55a2 accordingly)

L: After a brief review of the current code suggested taking look at
pbkdf2-sha512-opencl format to reuse ideas on code simplification -
unrolling SHA512 loop, getting rid of 0’s and avoiding endianness
swapping when possible.

A&L: Discussed BENCHMARK_LENGTH settings and decided to leave it at 0
for now and maybe changing to -1 for parallel in the future.

----------------#6 29.05.2015------------

A: Tested various code versions, not found a clear way of optimization.

A: Concerned about code size affecting performance (after removing add
0 instructions).

L: Tried to run latest codeXL locally without success… and suggested
trying to run codexl remotely on super.

A: Found problem in library dependecies when trying to run codexl on
super and reported it to Solar.

L: Helped to find a way to estimate generated code size for GCN size
using Daniel’s notes on wiki.

A: Checked code size for parallel_opt and bleeding-jumbo branches,
33kB and 183kB accordingly (size of L1 code cache for GCN is 32kB).

L: Suggested getting rid of #pragma unroll, but it did not help.

A: Reduced code size by using #pragma unroll 1

A: Reported self test problem on gcn with #pragma unroll 1

L: Noticed huge difference in speeds reported on 7970 and 7990

45k on 7990 (bleeding-jumbo fbbe01e4d44a79) 31.5k on 7970 (maybe
different driver).

A: Reported that after adding optimized SHA512 with 0’s code is
getting slower and decided to try split kernel.

L: Suggested dropping code snippets which speeds up NV or GCN and try
to focus on general version at the moment, and introduce architecture
dependent code/optimizations later.

Benchmarks with #pragma unroll 1 and 0’s optimizations
960m: 28k
Titan: 18.7k
7970: self_test failing

Stay tuned,
Lukas

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.