|
Message-Id: <4FE77D30-F08A-4F44-948F-3FF1FA5ED8A5@gmail.com> Date: Mon, 3 Aug 2015 16:53:39 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: JtR on ARM (NEON) > On Jul 31, 2015, at 4:35 PM, Solar Designer <solar@...nwall.com> wrote: > > On Fri, Jul 31, 2015 at 03:58:27PM +0800, Lei Zhang wrote: >> A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is. > > You should check /proc/cpuinfo under Linux. Core-0: processor : 0 model name : ARMv7 Processor rev 3 (v7l) Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x3 CPU part : 0xc0f CPU revision : 3 It looks like Tegra 3 or 4. >> From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower. > > Yes. That's weird. > > I assume you haven't started playing with interleaving factors yet? Not yet. >> In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated. > > As I told you before, no, vcmov must not be emulated - we have it on > NEON natively. Please see how it's done in DES_bs_b.c. > > As to vroti, yes, although there's a 2-instruction way to emulate it, > see page 4 in: > > https://cryptojedi.org/papers/neoncrypto-20120320.pdf > > Maybe it'd work faster at high interleaving factors (and slower at low > interleaving factors, since it's higher latency than the straightforward > 3-instruction approach). I got your points. NEON's vbsl works just as a vcmov. With the 2-intruction emulation of vroti, PBKDF2-HMAC-SHA256 got a boost from 644 c/s real to 976 c/s real. Other formats saw no significant performance change. However, I have some problem with the emulation of vroti. Literally, it should be defined this way: #define vroti_epi32(x, i) \ (i > 0 ? vsliq_n_u32(vshrq_n_u32(x, 32 - (i)), x, i) : \ vsriq_n_u32(vshlq_n_u32(x, 32 + (i)), x, -(i))) Somehow it won't compile when and only when building rawSHA1_ng_fmt_plug.o, giving some cryptic error message: /tmp/ccgVmq2d.s: Assembler messages: /tmp/ccgVmq2d.s:4397: Error: co-processor offset out of range Manually changing -O2 to -O0 for rawSHA1_ng_fmt_plug.o could erase this error. I googled it and found some other guys encountering this same issue under various circumstances. I think it's very possibly a compiler bug. Here's a related bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47246 The version of gcc used is 4.6.4. Unfortunately, there's currently no easy way of upgrading gcc on my mate's board. Solar, do you have a newer gcc on your ARM board? Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.