|
Message-ID: <20150731083517.GB31035@openwall.com> Date: Fri, 31 Jul 2015 11:35:17 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: JtR on ARM (NEON) On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote: > A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is. You should check /proc/cpuinfo under Linux. > (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose sha512crypt here.) You'll need to investigate why PBKDF2-HMAC-SHA512 fails. This might provide a clue as to why sha512crypt became slower. > Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... DONE BTW, the 64/32 here is wrong. Should be 32/32. Just because an algorithm uses 64-bit integers logically doesn't mean we should report it as using 64 out of 32 physical bits, since it can't. magnum? > From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower. Yes. That's weird. I assume you haven't started playing with interleaving factors yet? > In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated. As I told you before, no, vcmov must not be emulated - we have it on NEON natively. Please see how it's done in DES_bs_b.c. As to vroti, yes, although there's a 2-instruction way to emulate it, see page 4 in: https://cryptojedi.org/papers/neoncrypto-20120320.pdf Maybe it'd work faster at high interleaving factors (and slower at low interleaving factors, since it's higher latency than the straightforward 3-instruction approach). BTW, when you emulate a rotate with two shifts, you may sometimes see better results when you combine them with a XOR rather than an OR, because crypto code tends to use XORs nearby, so the compiler will be able to re-order the XORs if it sees an opportunity to hide latencies that way. With an OR and a XOR, it won't be easy for the compiler to see that the OR is equivalent to a XOR in this particular case. > But I don't think they're the excuses for the poor performance, since they're also emulated in a AVX build. Yes, there must be something else as well. Maybe unaligned accesses. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.