john-users - ARM NEON

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20130603003646.GA23062@openwall.com>
Date: Mon, 3 Jun 2013 04:36:46 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: ARM NEON

Hi,

I've just created and uploaded a patch adding two make targets for
Linux on ARM - with and without NEON:

http://download.openwall.net/pub/projects/john/1.8.0/

The filename is john-1.8.0-arm.diff.gz (and there's a detached signature
file nearby).  For newbies, instructions on applying patches:

http://openwall.info/wiki/john/how-to-extract-tarballs-and-apply-patches

The two added make targets are:

$ make | fgrep -w ARM
linux-arm32le-neon       Linux, ARM 32-bit little-endian w/NEON (best)
linux-arm32le            Linux, ARM 32-bit little-endian

I am testing on ZedBoard, which has two Cortex-A9 cores inside a Xilinx
Zynq 7020 chip, running at 666 MHz in this case, compiling with "gcc
version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)".  Performance with the
linux-arm32le target, without OpenMP (to have a baseline to compare NEON
and OpenMP against):

solar@...aro-ubuntu-desktop:~/john-1.8.0-arm/src$ ../run/john -te=1
Benchmarking: descrypt, traditional crypt(3) [DES 32/32]... DONE
Many salts:     99488 c/s real, 99488 c/s virtual
Only one salt:  95936 c/s real, 95936 c/s virtual

Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 32/32]...  DONE
Many salts:     3328 c/s real, 3328 c/s virtual
Only one salt:  3263 c/s real, 3263 c/s virtual

Benchmarking: md5crypt [MD5 32/32 X2]... DONE
Raw:    1479 c/s real, 1479 c/s virtual

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/32]... DONE
Raw:    84.0 c/s real, 84.0 c/s virtual

Benchmarking: LM [DES 32/32]... DONE
Raw:    1675K c/s real, 1675K c/s virtual

Benchmarking: AFS, Kerberos AFS [DES 24/32 4K]... DONE
Short:  40960 c/s real, 40960 c/s virtual
Long:   97280 c/s real, 97280 c/s virtual

Benchmarking: tripcode [DES 32/32]... DONE
Raw:    91247 c/s real, 91247 c/s virtual

Benchmarking: dummy [N/A]... DONE
Raw:    12771K c/s real, 12899K c/s virtual

Benchmarking: crypt, generic crypt(3) [?/32]... DONE
Many salts:     24096 c/s real, 24096 c/s virtual
Only one salt:  24096 c/s real, 24096 c/s virtual

This was not fast, indeed.  Now with NEON for bitslice DES (still no
OpenMP, so using one of the two cores):

Benchmarking: descrypt, traditional crypt(3) [DES 128/128 NEON]... DONE
Many salts:     134144 c/s real, 134144 c/s virtual
Only one salt:  131200 c/s real, 131200 c/s virtual

Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 NEON]... DONE
Many salts:     4736 c/s real, 4736 c/s virtual
Only one salt:  4643 c/s real, 4643 c/s virtual

Benchmarking: LM [DES 128/128 NEON]... DONE
Raw:    2292K c/s real, 2292K c/s virtual

Benchmarking: tripcode [DES 128/128 NEON]... DONE
Raw:    120470 c/s real, 120470 c/s virtual

(The rest of the formats don't make use of NEON yet, so I omitted their
same-speed benchmarks from here.)

That's only 35% faster, which is quite disappointing.  It turns out that
Cortex-A9 is actually documented to have high instruction latencies for
almost all NEON instructions:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409h/Babfjcjb.html

Trying to overcome this by using wider virtual vectors (256-bit, with
two instructions per operation) did not help - presumably because of too
few registers for that (in bitslice DES context; the same optimization
could work fine for some other hash/cipher).

Also, NEON has a vector bit select instruction, but that one is
documented to take 2 cycles at best, and has 1 cycle higher latencies
than other NEON vector bitwise ops.  Trying to use it for bitslice DES
actually resulted in a slowdown (despite of the reduced instruction
count).  The benchmark above was made without use of this instruction.
(Its use can be easily re-enabled by changing DES_BS from 1 to 3 in
arm32le.h.)

Finally, OpenMP:

solar@...aro-ubuntu-desktop:~/john-1.8.0-arm/src$ ../run/john -te=1
Will run 2 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [DES 128/128 NEON]... DONE
Many salts:     291992 c/s real, 145996 c/s virtual
Only one salt:  278368 c/s real, 139184 c/s virtual

Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 NEON]... DONE
Many salts:     10645 c/s real, 5322 c/s virtual
Only one salt:  10438 c/s real, 5219 c/s virtual

Benchmarking: md5crypt [MD5 32/32 X2]... DONE
Raw:    2953 c/s real, 1476 c/s virtual

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/32]... DONE
Raw:    166 c/s real, 83.1 c/s virtual

Benchmarking: LM [DES 128/128 NEON]... DONE
Raw:    3596K c/s real, 1798K c/s virtual

Benchmarking: AFS, Kerberos AFS [DES 24/32 4K]... DONE
Short:  40960 c/s real, 40960 c/s virtual
Long:   95232 c/s real, 95232 c/s virtual

Benchmarking: tripcode [DES 128/128 NEON]... DONE
Raw:    243326 c/s real, 124751 c/s virtual

Benchmarking: dummy [N/A]... DONE
Raw:    12987K c/s real, 12987K c/s virtual

Benchmarking: crypt, generic crypt(3) [?/32]... DONE
Many salts:     48570 c/s real, 24285 c/s virtual
Only one salt:  49152 c/s real, 24576 c/s virtual

This works pretty well (relative to the low speeds obtained before),
somehow providing more than a 2x speedup for some of the hash types
(indicating that there was some inefficiency in the non-OpenMP build).
Overall, we went from 99.5k to 292k c/s at descrypt with these two
changes (NEON and OpenMP).

I am interested in benchmarks on faster ARM processors.  Please post in
here and also add to the wiki:

http://openwall.info/wiki/john/benchmarks

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.