|
Message-ID: <BANLkTik35RwWJcV6vjA0Dko8Sef95ruuJQ@mail.gmail.com> Date: Tue, 12 Apr 2011 00:11:14 +0200 From: Ćukasz Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: sha256 format patches Of course I will upload pictures, results, and patch to wiki. >Yes, but you're not hitting the PCIe bandwidth limit yet, although with >synchronous transfers some time is in fact wasted until the transfer is >complete and before actual processing starts. On my gpu it is in fact 5% but it is easy to buy 10x faster GPU so i am aware that pci-e transfer matters. Of course I could send less data through bus but for now it is not main optimization target (but will be) in fact i need to check how much time takes creating from "string" a 512bit input data for sha. Moving it on gpu may decrese those "GPU idle" time. It also should give nice main memory savings. >Why not do it in init() and keep the allocation until the "john" process terminates? I did it, it worked well and gives mentioned +8% boost. I wasn't sure that leaving cleanup to OS is good idea so it's not included in patch but can be done with 4 lines modification. I didn't implement async copy for the same reason. It's not a big deal so will be done before Friday. >Is this just because you haven't implemented unrolling for the slow hashes case yet? Yes, I have compared slow vs fast yesterday but did unrolling on fast sha today. It is easy so I will post results next time. >Now how about implementing SHA-crypt? You'll also need to implement SHA-512 for that, which is trickier (64-bit integers). I'll try and see what I can do. Cuda offers 32 and 24bits integers. The trick is that 24bis operations are almost 8times faster (but nvidia claims it might change in the future) so meaby it is worth to implement 64bit operations on both types and compare efficiency. >That "certain change" I had mentioned was switching to 64-bit partial >hashes. So 4 times less data to transfer from the GPU. In my hack of >the code, I did not deal with potential false positives in any way, but >a proper implementation will need to do it, likely by having cmp_exact() >invoke an on-CPU implementation (it doesn't need to be fast). Then even >32-bit partial hashes would work, for further speedup and memory savings. I do not understand what partial hash is and how it affect on speed. Could you please tell me more details how to do it? Thanks for benchmark. There is still what to do in optimization this code. I think that results can be improved by finding optimal threads/blocks/registes settings. As I mentioned it is important to develop some self-configuration script to get maximum occupancy on every card what I will do soon. I'll try get access to faster gpu before I will be able to buy Fermi based card. Thansk for all. Lukas
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.