Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150826082411.GA4835@openwall.com>
Date: Wed, 26 Aug 2015 11:24:11 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On Wed, Aug 26, 2015 at 11:06:17AM +0300, Solar Designer wrote:
> On Tue, Aug 25, 2015 at 08:36:44PM +0200, magnum wrote:
> > Worst/best 10 for Tahiti (oldoffice failing):
> 
> > Ratio:	0.83409 real, 1.02941 virtual	salted-sha1-opencl:Many salts
> 
> This one doesn't auto-tune.  Gives same speeds to me.
> 
> > Ratio:	0.85046 real, 0.85934 virtual	sxc-opencl, StarOffice .sxc:Raw
> > Ratio:	0.90867 real, 0.78394 virtual	blockchain-opencl, blockchain My 
> > Wallet:Raw
> 
> These two don't auto-tune.

I was wrong: they do.

> They use OpenMP.  With
> GOMP_CPU_AFFINITY=0-31, they give stable same speeds to me, regardless
> of recent changes (obviously).

Running these a few more times, I see that blockchain-opencl gives
unstable speeds.  There's also a weird discrepancy between c/s rates
printed during auto-tuning and the final benchmark:

Calculating best local worksize (LWS)
Testing LWS=64 GWS=524288 ... 84.860ms+
Testing LWS=128 GWS=524288 ... 84.612ms
Testing LWS=192 GWS=524160 ... 108.859ms
Testing LWS=256 GWS=524288 ... 84.863ms
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:      2048     4410969 c/s     4410969 rounds/s 464.297us per crypt_all()!
gws:      4096     6655747 c/s     6655747 rounds/s 615.408us per crypt_all()+
gws:      8192    15338669 c/s    15338669 rounds/s 534.075us per crypt_all()!
gws:     16384     1634042 c/s     1634042 rounds/s  10.026ms per crypt_all()
gws:     32768    20423261 c/s    20423261 rounds/s   1.604ms per crypt_all()+
gws:     65536    21218723 c/s    21218723 rounds/s   3.088ms per crypt_all()+
gws:    131072    21881532 c/s    21881532 rounds/s   5.990ms per crypt_all()+
gws:    262144    22232899 c/s    22232899 rounds/s  11.790ms per crypt_all()+
gws:    524288    22359039 c/s    22359039 rounds/s  23.448ms per crypt_all()
gws:   1048576    22350637 c/s    22350637 rounds/s  46.914ms per crypt_all()
gws:   2097152    22066990 c/s    22066990 rounds/s  95.035ms per crypt_all()
gws:   4194304    22258786 c/s    22258786 rounds/s 188.433ms per crypt_all()
gws:   8388608    22233414 c/s    22233414 rounds/s 377.297ms per crypt_all()
gws:  16777216    21051147 c/s    21051147 rounds/s 796.973ms per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 262144
DONE
Raw:    6488K c/s real, 204736 c/s virtual

The auto-tuned GWS usually varies between 262144, 524288, and 1048576,
and the final speeds from ~5000K to ~6500K even for the same GWS.

Why is the discrepancy between ~22M while benchmarking and ~6M finally?
Is this how split kernel should manifest itself here?  I think not.

Looks like an issue unrelated to recent changes.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.