john-dev - Re: Formats using non-SIMD SHA2 implementations

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <CD6A0397-8A77-48BB-932A-52F2EB50354B@gmail.com>
Date: Mon, 17 Aug 2015 11:07:11 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Formats using non-SIMD SHA2 implementations

On Aug 14, 2015, at 5:31 PM, magnum <john.magnum@...hmail.com> wrote:
> 
> On 2015-08-14 04:35, Lei Zhang wrote:
>> 
>> 
>> I traced the execution of 7z's encryption: the size the hashed message could be really big, far beyond even 4 SHA2 input blocks. I think it's not possible to do the hashing with a single call to SIMDSHA256body().
>> 
>> Is there a way to repeatedly invoking SIMDSHA256body() just like SHA256_Update()?
> 
> Sure, you just have to do the job yourself. Last (or single) block is max 55 bytes of input, all other can be 64 bytes.
> 
> Say you need to do 189 bytes. You take the first 64 bytes (no 0x80, no length) and call SIMDSHA256body(). Then next 64 bytes and call it again. Now you have 61 bytes left. You put them in the buffer, add a 0x80 and zero the rest. And call SIMDSHA256body() again. Finally, in this case, you take a block of all zeros, just add the length (189*3) and make a final call.
> 
> The problem is when you have different length input in one vector. Say one of them required 4 limbs, and another just 3 and the rest only one. This is doable (we do in eg. SAP F/G) but tedious - and reduces benefit of SIMD much like diverging threads in OpenCL does. So we usually don't do SIMD with such formats.

I finally got 7z to work correctly with SIMD :)

On a AVX2 machine, with OpenMP disabled:

[without SIMD]
Benchmarking: 7z, 7-Zip (512K iterations) [SHA256 AES 32/64]... DONE
Speed for cost 1 (iteration count) of 524288
Raw:	14.5 c/s real, 14.5 c/s virtual

[with SIMD]
Benchmarking: 7z, 7-Zip (512K iterations) [SHA256 AES 32/64]... DONE
Speed for cost 1 (iteration count) of 524288
Raw:	41.0 c/s real, 41.0 c/s virtual

So there's a ~3x speedup, while the ideal speedup is 8x. As magnum mentioned, the code is really tricky to write. I'm not sure if there's space for further optimization.

And there's another minor issue: in 7z, the size of message to be hashed is like plaintext_length*rounds (not accurate, just for easy discussion), where rounds is a big number. The original plaintext_length in the scalar code is 125, which makes the entire message size really big and the overhead of copying the message to vector buffer extremely high. So I defined plaintext_length to a much smaller number (e.g. 28) in the SIMD code, I don't know if this would cause problem in practical use though.


Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.