Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b1b3b182303b7930a45c176c49eb5611@smtp.hushmail.com>
Date: Tue, 13 Oct 2015 20:37:26 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: LOP3.LUT

On 2015-10-13 10:43, magnum wrote:
> Most formats now has LOP3.LUT alternatives and seem to work fine now.
> Some don't get any boost (just meaning the toolchain did a good job
> already) but I think md5crypt is the only one getting a definite
> performance regression (and still has it disabled). We should get to the
> bottom of that. BTW it would be very nice having CUDA 7.5 on super.

Comparison of md5crypt kernel compiled with bitselect vs. with explicit 
LOP3.LUT for the function primitives:

Bitselect:
ptxas info    : 0 bytes gmem, 54 bytes cmem[3]
ptxas info    : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info    : Function properties for cryptmd5
ptxas         .     592 bytes stack frame, 0 bytes spill stores, 0 bytes 
spill loads
ptxas info    : Used 38 registers, 344 bytes cmem[0], 268 bytes cmem[2]

Explicit LOP3.LUT:
ptxas info    : 0 bytes gmem, 54 bytes cmem[3]
ptxas info    : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info    : Function properties for cryptmd5
ptxas         .     592 bytes stack frame, 0 bytes spill stores, 0 bytes 
spill loads
ptxas info    : Used 37 registers, 344 bytes cmem[0], 260 bytes cmem[2]

                 explicit  bitselect
PTX #lines      4293      4375
ISA #lines      4214      4177
DEPBAR          56        62
LOP32I          31        33
LOP3            372       372
.reuse          235       349
LOP3 w/ .reuse  95        103
IADD32          420       400
IADD3           381       383

Less DEPBAR should be a good thing but I think the much lower ".reuse" 
number is not, and this may be the main problem. But we can't specify 
which registers to use! Perhaps the register scheduling when using 
inline PTX lop3 will improve over time. After reading some forum posts 
about register slots I actually tried using alternate lop3 immediates, 
shuffling x, y and z around. I could only conclude it *does* sometimes 
matter... but the chance of actually controlling the situation appears 
pretty slim to me.

LOP3 immediates used:
explicit:  0x39, 0x96, 0xca, 0xe4 (just the ones used in my functions).
bitselect: 0x4b, 0x96, 0xac, 0xb8, 0xca.

For reference, the natural truth table for just a bitselect is 0xd8 and 
alternatives when shuffling x, y and z around are 0xac, 0xb8, 0xca, 0xe2 
and 0xe4. And 0x96 is (x ^ y ^ z) in any order. That leaves 0x4b to 
investigate. Doing so, I think I located that section in PTX vs. ISA but 
I don't get what is happening. And I gave up this at that point.

On another note I find it strange that the difference in 2-op adds 
doesn't match the difference in 3-op adds at all.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.