|
Message-ID: <20150725122204.GA1742@openwall.com> Date: Sat, 25 Jul 2015 14:22:04 +0200 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: yescrypt on GPU On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote: > I had in my code > > for() > { > copy to private > some operiations on private > copy to global > } > > i changed this code to > > memset(this private array,0,size of private array)//because I noticed > when I was working on parallel that kernel can slow down after using > uninitialized array > for() > { > some operations on private > } > > and runned with --skip-self-test and speed was the same, even without > this memset. OK. Why would you incur any accesses to an uninitialized array, though? yescrypt fully initializes its S-boxes with non-zero data before the very first invocation of pwxform, which uses them. > this is big array 8KB but I have in another place copying > 64 B and this also decreases speed even when copying 8KB is turned off Moving a 64 bytes array from global to private decreases speed? That's surprising if so. Is this 64 bytes array frequently accessed? Which one is it? The current sub-block buffer in pwxform? You should keep it in private, I think. The S-boxes should likely be in local on AMD and in private on NVIDIA, although you do in fact need to test with them in global as well - in fact, ideally you'd have this tri-state choice auto-tuned at runtime, since the optimal one will likely vary across GPUs (even similar ones). yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which reduces parallelism, but OTOH 4 such pwxform lanes are computed in parallel, which increases parallelism. This is with yescrypt's current default pwxform settings. We previously found that it's more optimal to keep bcrypt's S-boxes in local or private (depending on GPU) rather than in global, but the differences of pwxform (with particular settings) vs. bcrypt might change this on some GPUs. Also, yescrypt's other uses of global memory (for its large V array) might make use of global memory for the S-boxes as well more optimal, since those other accesses to global memory might limit the overall latency reduction possible with moving the S-boxes to local or private memory, thereby skewing the balance towards keeping them in global. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.