|
Message-ID: <20150602014518.GA13205@openwall.com> Date: Tue, 2 Jun 2015 04:45:18 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Lyra2 on CPU Agnieszka, On Mon, Jun 01, 2015 at 11:57:34PM +0200, Agnieszka Bielec wrote: > Lyra 2 uses by default openmp threads in one hashing function. IIRC, their implementation uses pthreads directly, not via OpenMP. Do you know otherwise? > nPARALLEL option determines how many omp threads are running. and if > nPARALLEL changes, output also changes. > nPARALLEL by default equals to 2 What we need is an implementation of Lyra2 that would work for any thread-level parallelism setting _without_ necessarily creating any threads. In its threads-disabled mode, it would compute those threads' portions of work sequentially. This is much like Colin Percival's original implementation of scrypt works when called with p > 1. This will enable us to introduce our desired thread-level parallelism externally, like we normally do - at JtR format level, due to having multiple candidate passwords to test. A reason to do things in this way is that a given Lyra2 hash being cracked might not have sufficient thread-level parallelism for us to make use of all of our CPU cores and such, or their thread-level parallelism setting might not be a multiple of our logical CPU count, nor vice versa (e.g., a given hash uses up to 4 threads, but we're on a 6-core CPU without SMT). Then it will make sense to explore combinations of these settings since the locality of reference (and thus cache hit rate) might turn out to be higher when we do also use Lyra2's built-in thread-level parallelism (if a given Lyra2 hash has any). Converting to use of OpenMP may make such experiments easier, since OpenMP does not forcibly create the max number of threads that the code has parallelism for, and it includes support for nested parallel regions (and ways to control its behavior for those). BTW, it's similar for scrypt (already in jumbo), yescrypt, and Argon2 - except that these don't default to 2 threads. I think Lyra2's default of 2 threads matters only for the PHS() wrapper. We should simply have its thread-level parallelism as one of the cost parameters that we'll have encoded along with the salts, much like we do for scrypt's p now. > I implemented omp in Lyra2 in 3 different ways and I am including tests below > > a) I used nested omp > b) I am using omp only in crypt_all() function and one hash is > computed by nPARALLEL number of threads > c) I am using omp only in crypt_all() function and one hash is > computed by only one thread Oh, it sounds like you already realized that there are different approaches to try here. Great. I haven't looked at your code yet - I should. > a) > Will run 8 OpenMP threads > Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE > Speed for cost 1 (t) of 8, cost 2 (m) of 8 > Many salts: 4896 c/s real, 848 c/s virtual > Only one salt: 5005 c/s real, 856 c/s virtual > > b) > Will run 8 OpenMP threads > Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE > Speed for cost 1 (t) of 8, cost 2 (m) of 8 > Many salts: 6608 c/s real, 876 c/s virtual > Only one salt: 7120 c/s real, 935 c/s virtu > > c) > Will run 8 OpenMP threads > Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE > Speed for cost 1 (t) of 8, cost 2 (m) of 8 > Many salts: 7032 c/s real, 943 c/s virtual > Only one salt: 7872 c/s real, 1035 c/s virtual > > without openmp) > Benchmarking: Lyra2, Generic Lyra2 [ ]... DONE > Speed for cost 1 (t) of 8, cost 2 (m) of 8 > Many salts: 2130 c/s real, 2152 c/s virtual > Only one salt: 2160 c/s real, 2138 c/s virtual > > drawback of method b) > -running_threads%nPARALLER must be equal to 0 > drawback of method c) > -we are using memory nPARALLEL times as many as in method b) Oh, you noticed these too. Cool. > I think that method b) is slower because we are using synchronization > many times and we have barriers for all omp threads. Maybe. You can test this hypothesis by benchmarking at higher cost settings and thus at lower c/s rates. At lower c/s rates, the synchronization overhead becomes relatively lower. If confirmed, a way to reduce the overhead at higher c/s rates as well would be via computing larger batches of hashes per parallel for loop. This is what we normally call OMP_SCALE, but possibly applied at a slightly lower level in this case. > I couldn't find a way how to do it for only x threads. What do you mean by x here? > I am leaving to you to decide which method to implement to jtr. I think the order of our experiments should be as I outlined at the start of this reply. > and also I have a question. > Lyra2 has also options like nPARALLEL, which are passed during > compilation. but every of these option has default value > I don't know if should also add these options to parameters passed in > my hashing functions or should I make a function only for default of > these values For nPARALLEL, make it a runtime parameter encoded with the hashes. What other options "like this" are there? I haven't looked at your Lyra2 code yet. Is it committed? While we're at it, have you moved the memory (de)allocation out of the cracking loop? And have you done it for POMELO too, as we had discussed - perhaps reusing the allocators from yescrypt? I don't recall you reporting on this, so perhaps this task slipped through the cracks? If so, can you please (re-)add it to your priorities? Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.