|
Message-Id: <97419E41-BD34-4D85-8FD0-AABC74012AEF@gmail.com> Date: Fri, 31 Jul 2015 15:58:27 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: JtR on ARM (NEON) Hi, A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is. So I just added NEON intrinsics to JtR's pseudo-intrinsics family, and see how it performs on this board. Here're some figures: (OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose sha512crypt here.) [Without NEON] Benchmarking: PBKDF2-HMAC-MD4 [PBKDF2-MD4 32/32]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 2165 c/s real, 2353 c/s virtual Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 32/32]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 1659 c/s real, 1803 c/s virtual Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 32/32]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 1242 c/s real, 1350 c/s virtual Benchmarking: PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 32/32 OpenSSL]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 761 c/s real, 827 c/s virtual Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... DONE Speed for cost 1 (iteration count) of 5000 Raw: 179 c/s real, 192 c/s virtual [With NEON] Benchmarking: PBKDF2-HMAC-MD4 [PBKDF2-MD4 128/128 NEON 4x]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 4504 c/s real, 4741 c/s virtual Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 128/128 NEON 4x]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 3053 c/s real, 3280 c/s virtual Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 NEON 4x]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 1280 c/s real, 1347 c/s virtual Benchmarking: PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 128/128 NEON 4x]... DONE Speed for cost 1 (iteration count) of 1000 Raw: 644 c/s real, 685 c/s virtual Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 NEON 2x]... DONE Speed for cost 1 (iteration count) of 5000 Raw: 36.1 c/s real, 38.3 c/s virtual From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower. In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated. But I don't think they're the excuses for the poor performance, since they're also emulated in a AVX build. Thoughts? Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.