john-dev - Re: SHA-1 H()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150903194048.GA15176@openwall.com>
Date: Thu, 3 Sep 2015 22:40:48 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA-1 H()

On Thu, Sep 03, 2015 at 09:29:37PM +0200, magnum wrote:
> On 2015-09-03 20:40, Solar Designer wrote:
> >On Thu, Sep 03, 2015 at 11:52:47AM +0200, magnum wrote:
> >>Apparently GCN has ANDN and NAND.
> >
> >I need to take a fresh look at the arch manual, but in the generated
> >code I only see scalar ANDN, and never vector ANDN (nor NAND).  They
> >defined scalar ANDN presumably because it's so useful for exec masks.
> >
> >I see you've committed this:
> >
> >+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
> >+#define HAVE_ANDNOT 1
> >+#endif
> >
> >but I think the check for amd_gcn(DEVICE_INFO) is wrong.
> 
> We currently never run vectorized on GCN anyway, unless forced by user - 
> if format supports it at all.

That's the SIMD vs. SIMT confusion again.

When talking ISA level:

By scalar, I mean the tiny scalar unit that is normally used for control
only.  By vector, I mean the SIMD units.

Per the generated assembly code, there are no ANDN and NAND instructions
for the SIMD units at all.  Trying to Google what their likely mnemonics
would be returns no hits.  I think they just don't exist.

And it does not matter whether the kernel is vectorized or not.  It uses
those same vector instructions either way.  If vectorized, it gets
interleaved instructions, e.g. phpass-opecl:

   v_add_i32     v43, vcc, v36, v43                          // 00003D78: 4A565724
   v_add_i32     v44, vcc, v37, v44                          // 00003D7C: 4A585925
   v_add_i32     v45, vcc, v38, v45                          // 00003D80: 4A5A5B26
   v_add_i32     v46, vcc, v35, v46                          // 00003D84: 4A5C5D23
-  v_not_b32     v51, v28                                    // 00003D88: 7E666F1C
-  v_not_b32     v52, v29                                    // 00003D8C: 7E686F1D
-  v_not_b32     v53, v30                                    // 00003D90: 7E6A6F1E
-  v_not_b32     v54, v27                                    // 00003D94: 7E6C6F1B
-  v_or_b32      v51, v43, v51                               // 00003D98: 3866672B
-  v_or_b32      v52, v44, v52                               // 00003D9C: 3868692C
-  v_or_b32      v53, v45, v53                               // 00003DA0: 386A6B2D
-  v_or_b32      v54, v46, v54                               // 00003DA4: 386C6D2E
+  v_bfi_b32     v51, v28, v43, -1                           // 00003D88: D2940033 0306571C
+  v_bfi_b32     v52, v29, v44, -1                           // 00003D90: D2940034 0306591D
+  v_bfi_b32     v53, v30, v45, -1                           // 00003D98: D2940035 03065B1E
+  v_bfi_b32     v54, v27, v46, -1                           // 00003DA0: D2940036 03065D1B
   v_xor_b32     v51, v36, v51                               // 00003DA8: 3A666724
   v_xor_b32     v52, v37, v52                               // 00003DAC: 3A686925
   v_xor_b32     v53, v38, v53                               // 00003DB0: 3A6A6B26
   v_xor_b32     v54, v35, v54                               // 00003DB4: 3A6C6D23

(This also shows the effect of my MD5_I optimization.)

> But perhaps it should be (amd_gcn(DEVICE_INFO) && (V_WIDTH < 2)) then?

No.

> >And why this change? -
> >
> >-#if !gpu_nvidia(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
> >+#if !gpu_nvidia(DEVICE_INFO)
> >  #define USE_BITSELECT 1
> >  #elif gpu_nvidia(DEVICE_INFO)
> >  #define OLD_NVIDIA 1
> >  #endif
> 
> I saw definite speedup for PBKDF2 and RAR iirc, and perhaps md5crypt. 
> But later I saw contradicting figures for other formats so I'm not sure 
> about this and things are in a state of flux. It might be that we should 
> revert to initially setting it (for Maxwell) in opencl_misc.h, and later 
> conditionally undefine it in certain formats.
> 
> Is bitselect() expected to always generate a LOP3.LUT? Even if it is, I 
> figure the optimizer just might be able to do better when given 
> bitselect-free code.

Yes, we should review the generated code.  It is unclear what source
code is more likely to result in optimal use of LOP3.LUT.

> Besides all this, I see I introduced a bug: Now OLD_NVIDIA is defined 
> for Maxwell and that was not the intention. I'll fix that right away.

Yes.  Thanks.

> >>BTW early tests indicate that 5916a57 made SHA-512 very slightly worse
> >>(but almost hidden by normal variations).
> >
> >On what hardware?
> 
> AVX and AVX2. My overall feeling is SHA256 got a slight boost while 
> SHA512 did not and sometimes the latter got a very slight regression. 
> But I haven't really gone systematic yet. All my tests are very 
> inconclusive as of yet, the fluctuations are larger than the 
> boosts/regressions.

That's not surprising.  I only expect much difference on XOP.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.