|
Message-ID: <20130603003646.GA23062@openwall.com> Date: Mon, 3 Jun 2013 04:36:46 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: ARM NEON Hi, I've just created and uploaded a patch adding two make targets for Linux on ARM - with and without NEON: http://download.openwall.net/pub/projects/john/1.8.0/ The filename is john-1.8.0-arm.diff.gz (and there's a detached signature file nearby). For newbies, instructions on applying patches: http://openwall.info/wiki/john/how-to-extract-tarballs-and-apply-patches The two added make targets are: $ make | fgrep -w ARM linux-arm32le-neon Linux, ARM 32-bit little-endian w/NEON (best) linux-arm32le Linux, ARM 32-bit little-endian I am testing on ZedBoard, which has two Cortex-A9 cores inside a Xilinx Zynq 7020 chip, running at 666 MHz in this case, compiling with "gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)". Performance with the linux-arm32le target, without OpenMP (to have a baseline to compare NEON and OpenMP against): solar@...aro-ubuntu-desktop:~/john-1.8.0-arm/src$ ../run/john -te=1 Benchmarking: descrypt, traditional crypt(3) [DES 32/32]... DONE Many salts: 99488 c/s real, 99488 c/s virtual Only one salt: 95936 c/s real, 95936 c/s virtual Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 32/32]... DONE Many salts: 3328 c/s real, 3328 c/s virtual Only one salt: 3263 c/s real, 3263 c/s virtual Benchmarking: md5crypt [MD5 32/32 X2]... DONE Raw: 1479 c/s real, 1479 c/s virtual Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/32]... DONE Raw: 84.0 c/s real, 84.0 c/s virtual Benchmarking: LM [DES 32/32]... DONE Raw: 1675K c/s real, 1675K c/s virtual Benchmarking: AFS, Kerberos AFS [DES 24/32 4K]... DONE Short: 40960 c/s real, 40960 c/s virtual Long: 97280 c/s real, 97280 c/s virtual Benchmarking: tripcode [DES 32/32]... DONE Raw: 91247 c/s real, 91247 c/s virtual Benchmarking: dummy [N/A]... DONE Raw: 12771K c/s real, 12899K c/s virtual Benchmarking: crypt, generic crypt(3) [?/32]... DONE Many salts: 24096 c/s real, 24096 c/s virtual Only one salt: 24096 c/s real, 24096 c/s virtual This was not fast, indeed. Now with NEON for bitslice DES (still no OpenMP, so using one of the two cores): Benchmarking: descrypt, traditional crypt(3) [DES 128/128 NEON]... DONE Many salts: 134144 c/s real, 134144 c/s virtual Only one salt: 131200 c/s real, 131200 c/s virtual Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 NEON]... DONE Many salts: 4736 c/s real, 4736 c/s virtual Only one salt: 4643 c/s real, 4643 c/s virtual Benchmarking: LM [DES 128/128 NEON]... DONE Raw: 2292K c/s real, 2292K c/s virtual Benchmarking: tripcode [DES 128/128 NEON]... DONE Raw: 120470 c/s real, 120470 c/s virtual (The rest of the formats don't make use of NEON yet, so I omitted their same-speed benchmarks from here.) That's only 35% faster, which is quite disappointing. It turns out that Cortex-A9 is actually documented to have high instruction latencies for almost all NEON instructions: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409h/Babfjcjb.html Trying to overcome this by using wider virtual vectors (256-bit, with two instructions per operation) did not help - presumably because of too few registers for that (in bitslice DES context; the same optimization could work fine for some other hash/cipher). Also, NEON has a vector bit select instruction, but that one is documented to take 2 cycles at best, and has 1 cycle higher latencies than other NEON vector bitwise ops. Trying to use it for bitslice DES actually resulted in a slowdown (despite of the reduced instruction count). The benchmark above was made without use of this instruction. (Its use can be easily re-enabled by changing DES_BS from 1 to 3 in arm32le.h.) Finally, OpenMP: solar@...aro-ubuntu-desktop:~/john-1.8.0-arm/src$ ../run/john -te=1 Will run 2 OpenMP threads Benchmarking: descrypt, traditional crypt(3) [DES 128/128 NEON]... DONE Many salts: 291992 c/s real, 145996 c/s virtual Only one salt: 278368 c/s real, 139184 c/s virtual Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 NEON]... DONE Many salts: 10645 c/s real, 5322 c/s virtual Only one salt: 10438 c/s real, 5219 c/s virtual Benchmarking: md5crypt [MD5 32/32 X2]... DONE Raw: 2953 c/s real, 1476 c/s virtual Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/32]... DONE Raw: 166 c/s real, 83.1 c/s virtual Benchmarking: LM [DES 128/128 NEON]... DONE Raw: 3596K c/s real, 1798K c/s virtual Benchmarking: AFS, Kerberos AFS [DES 24/32 4K]... DONE Short: 40960 c/s real, 40960 c/s virtual Long: 95232 c/s real, 95232 c/s virtual Benchmarking: tripcode [DES 128/128 NEON]... DONE Raw: 243326 c/s real, 124751 c/s virtual Benchmarking: dummy [N/A]... DONE Raw: 12987K c/s real, 12987K c/s virtual Benchmarking: crypt, generic crypt(3) [?/32]... DONE Many salts: 48570 c/s real, 24285 c/s virtual Only one salt: 49152 c/s real, 24576 c/s virtual This works pretty well (relative to the low speeds obtained before), somehow providing more than a 2x speedup for some of the hash types (indicating that there was some inefficiency in the non-OpenMP build). Overall, we went from 99.5k to 292k c/s at descrypt with these two changes (NEON and OpenMP). I am interested in benchmarks on faster ARM processors. Please post in here and also add to the wiki: http://openwall.info/wiki/john/benchmarks Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.