|
Message-ID: <CAKGDhHXN__sipQuMY9TZ86dOH2utVEFtX71SVFDsh2ZofOfO+A@mail.gmail.com> Date: Sat, 25 Jul 2015 18:17:06 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: yescrypt on GPU 2015-07-25 14:22 GMT+02:00 Solar Designer <solar@...nwall.com>: > On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote: >> I had in my code >> >> for() >> { >> copy to private >> some operiations on private >> copy to global >> } >> >> i changed this code to >> >> memset(this private array,0,size of private array)//because I noticed >> when I was working on parallel that kernel can slow down after using >> uninitialized array >> for() >> { >> some operations on private >> } >> >> and runned with --skip-self-test and speed was the same, even without >> this memset. > > OK. Why would you incur any accesses to an uninitialized array, though? > > yescrypt fully initializes its S-boxes with non-zero data before the > very first invocation of pwxform, which uses them. this was only experiment, I wanted to know the speed without copying data from global memory > >> this is big array 8KB but I have in another place copying >> 64 B and this also decreases speed even when copying 8KB is turned off > > Moving a 64 bytes array from global to private decreases speed? That's > surprising if so. Is this 64 bytes array frequently accessed? Which > one is it? The current sub-block buffer in pwxform? You should keep it > in private, I think. in pwxform, 64 bytes - only once they are used > > The S-boxes should likely be in local on AMD and in private on NVIDIA, > although you do in fact need to test with them in global as well - in > fact, ideally you'd have this tri-state choice auto-tuned at runtime, > since the optimal one will likely vary across GPUs (even similar ones). > > yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger > (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than > bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which > reduces parallelism, but OTOH 4 such pwxform lanes are computed in > parallel, which increases parallelism. This is with yescrypt's current > default pwxform settings. We previously found that it's more optimal to > keep bcrypt's S-boxes in local or private (depending on GPU) rather than > in global, but the differences of pwxform (with particular settings) vs. > bcrypt might change this on some GPUs. Also, yescrypt's other uses of > global memory (for its large V array) might make use of global memory for > the S-boxes as well more optimal, since those other accesses to global > memory might limit the overall latency reduction possible with moving the > S-boxes to local or private memory, thereby skewing the balance towards > keeping them in global. today I was removing getting smaller and smaller parts of the code to track down the slow part and this is indeed in pwxform, and when I have x0 += p0[0]; x0 ^= p1[0]; [...] x1 += p0[1]; x1 ^= p1[1]; commented out speed is the same with copying and without (but I have another version of pwxform using vectors now)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.