Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150606030121.GA21367@openwall.com>
Date: Sat, 6 Jun 2015 06:01:21 +0300
From: Solar Designer <solar@...nwall.com>
To: Alain Espinosa <alainesp@...ta.cu>
Cc: john-dev@...ts.openwall.com
Subject: Re: bitslice SHA-256

Hi Alain,

Your work on this is really cool!

On Wed, Jun 03, 2015 at 01:24:42PM -0400, Alain Espinosa wrote:
> I had some free time and tried bitslice SHA256 in Neon. The results are as expected. Assembly output size is 19KB that is more than the L1 code cache of this CPU, but I do not see performance drops because of it.

Here's a guess:

IIRC, this CPU is documented to have 4 KB of L0 + 16 KB of L1
instruction cache.  (And ditto for data caches.  I don't know what
exactly they mean by having an L0 cache.)  If L1 is not inclusive of L0,
then maybe you have up to 20 KB for your code, until you start getting
L1 misses into L2?

> Benchmark configuration: Android 4.4.2, GCC 4.6, Snapdragon 801 2.45GHz, only one thread
> Performance is given in millions of keys per second
> ------------------------------------------------------------------------------------------------------------------
> 2.61 : Bitslice SHA256 implemented with hand-crafted Neon assembly (5.7% faster than normal, 35% faster than intrinsics)
> 2.47 : Normal   SHA256 implemented with hand-crafted Neon assembly
> 1.94 : Bitslice SHA256 implemented with Neon intrinsics
> 0.83 : Bitslice SHA256 implemented with 64-bits code
> 
> Attached the Neon intrinsics and hand-crafted assembly source file. The VBSL (Neon bitselect) appears to be more costly than normal bitwise instructions. For practical speed-ups with bitslice SHA256 we need XOP or AVX512 instruction-sets. AVX512 probably provides speed-ups for SHA1 format also. MD5/MD4 formats uses less rotation/shifts, so bitslice is less useful and probably never practical.

You're probably right.

How about targeting NVIDIA Maxwell next?  It has a ternary logic
instruction (LOP3.LUT) similar to AVX-512's, but it's already available
for purchase (and people are posting really impressive benchmarks for
password cracking on Titan X).  On the other hand, Maxwell also has a
3-input ADD instruction (32-bit only?), so perhaps (at least for
SHA-256?) you might not see much of a speedup with bitslicing relative
to a normal implementation making use of that IADD3.  So maybe target
Maxwell for bitslice SHA-512?  Of course, you'd also need to implement
SHA-256 on it, both normal and bitslice.  There's a chance that bitslice
SHA-256 will be faster on Maxwell as well, but I think a bigger chance
for bitslice SHA-512.

Here's a seemingly usable assembler for NVIDIA Maxwell:

https://github.com/NervanaSystems/maxas
https://code.google.com/p/maxas/

BTW, I updated our wiki page on GPU assemblers recently:

http://openwall.info/wiki/john/development/GPU-low-level

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.