john-dev - Re: PHC: Lyra2 on CPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150602193415.GA23048@openwall.com>
Date: Tue, 2 Jun 2015 22:34:15 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on CPU

On Tue, Jun 02, 2015 at 07:26:52PM +0200, Agnieszka Bielec wrote:
> 2015-06-02 3:45 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > On Mon, Jun 01, 2015 at 11:57:34PM +0200, Agnieszka Bielec wrote:
> >> Lyra 2 uses by default openmp threads in one hashing function.
> >
> > IIRC, their implementation uses pthreads directly, not via OpenMP.
> > Do you know otherwise?
> 
> in source code I downloaded lately is omp

OK, I'll take a look.

> >> I think that method b) is slower because we are using synchronization
> >> many times and we have barriers  for all omp threads.
> >
> > Maybe.  You can test this hypothesis by benchmarking at higher cost
> > settings and thus at lower c/s rates.  At lower c/s rates, the
> > synchronization overhead becomes relatively lower.
> 
> I choose only the biggest noticed speeds for tests:
> ; 8896/9144
>         ~0.97287839020122484689
> ; 2312/2368
>         ~0.97635135135135135135

Can you explain these numbers?  I guess these are for two code versions
and two cost settings (that differ by a factor of 4), and you show that
the overhead has stayed the same.  Correct?  Yet I'd like more specifics
on what you benchmarked here.

> > If confirmed, a way to reduce the overhead at higher c/s rates as well
> > would be via computing larger batches of hashes per parallel for loop.
> > This is what we normally call OMP_SCALE, but possibly applied at a
> > slightly lower level in this case.
> 
> lyra2 hash uses barriers in one hash computation so I'm not sure,
> maybe I don't understand your point

We have multiple hashes to compute, so even if one instance of Lyra2 has
barriers within it we can increase the amount of work done between
barriers by bringing our higher-level parallelism (multiple candidate
passwords) down to that level (to inside of Lyra2 implementation).

> "where PARAMETERS can be:
>       nCols = (number of columns, default is 256)
>       nThreads = (number of threads, default is 2)
>       nRoundsSponge = (number of Rounds performed for reduced sponge
> function [1 - 12], default is 1)
>       bSponge = (number of sponge blocks, bitrate, 8 or 10 or 12, default is 12)
>       sponge = (0, 1 or 2, default is 0) 0 means Blake2b, 1 means
> BlaMka and 2 means half-round BlaMka"

I think for now lets export/encode nCols and nThreads, but keep the rest
at their hard-coded defaults.

> > While we're at it, have you moved the memory (de)allocation out of the
> > cracking loop?  And have you done it for POMELO too, as we had
> > discussed - perhaps reusing the allocators from yescrypt?  I don't
> > recall you reporting on this, so perhaps this task slipped through the
> > cracks?  If so, can you please (re-)add it to your priorities?
> 
> not yet for both, I will do it in this week

OK.  Also please note that if these hashes are ever used in the real
world, they will most likely be used with low t_cost.  Most likely
either the lowest supported t_cost, or whatever t_cost value the
designers recommend.  This is because keeping t_cost low and maximizing
m_cost, while staying within a certain maximum running time that a
defender can afford, maximizes their security against ASIC attacks.

So benchmarking at t_cost = 8 makes relatively little sense.  While
someone may surely use such hashes, so we should support that, this
isn't considered an optimal use of Lyra2 from a typical defender's
perspective.  I suggest that for our benchmarks, we primarily use the
lowest supported t_cost - which varies by PHC finalist, but is usually
either 0 or 1.

The memory (de)allocation overhead time is relatively more significant
when t_cost is low and m_cost is high.

Higher t_cost benchmarks make sense when you desperately try to achieve
any speedup on GPU compared to CPU, though.  You could then try low
m_cost, but higher t_cost, just to see if any speedup is achievable.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.