john-dev - JtR on ARM (NEON)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <97419E41-BD34-4D85-8FD0-AABC74012AEF@gmail.com>
Date: Fri, 31 Jul 2015 15:58:27 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: JtR on ARM (NEON)

Hi,

A schoolmate of mine got a ARM board in his lab and gave me access to it. It's some model of Nvidia Tegra, with 4-cores and NEON support, though I don't know which specific model it is.

So I just added NEON intrinsics to JtR's pseudo-intrinsics family, and see how it performs on this board. Here're some figures:

(OpenMP is disabled in this test. PBKDF2-HMAC-SHA512 failed somehow, so I chose sha512crypt here.)

[Without NEON]
Benchmarking: PBKDF2-HMAC-MD4 [PBKDF2-MD4 32/32]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	2165 c/s real, 2353 c/s virtual

Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 32/32]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	1659 c/s real, 1803 c/s virtual

Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 32/32]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	1242 c/s real, 1350 c/s virtual

Benchmarking: PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 32/32 OpenSSL]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	761 c/s real, 827 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 64/32 OpenSSL]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	179 c/s real, 192 c/s virtual

[With NEON]
Benchmarking: PBKDF2-HMAC-MD4 [PBKDF2-MD4 128/128 NEON 4x]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	4504 c/s real, 4741 c/s virtual

Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 128/128 NEON 4x]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	3053 c/s real, 3280 c/s virtual

Benchmarking: PBKDF2-HMAC-SHA1 [PBKDF2-SHA1 128/128 NEON 4x]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	1280 c/s real, 1347 c/s virtual

Benchmarking: PBKDF2-HMAC-SHA256 [PBKDF2-SHA256 128/128 NEON 4x]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:	644 c/s real, 685 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 NEON 2x]... DONE
Speed for cost 1 (iteration count) of 5000
Raw:	36.1 c/s real, 38.3 c/s virtual

From the figures above, MD4 and MD5 get 2x speedup; SHA1 and SHA256 have no speedup; SHA512 gets a lot slower.

In my currently implementation, most pseudo-intrinsics are directly mapped to NEON intrinsics. The only exceptions are vcmov and vroti, which have to be emulated. But I don't think they're the excuses for the poor performance, since they're also emulated in a AVX build.

Thoughts?

Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.