|
|
Message-ID: <CAKGDhHXN__sipQuMY9TZ86dOH2utVEFtX71SVFDsh2ZofOfO+A@mail.gmail.com>
Date: Sat, 25 Jul 2015 18:17:06 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: yescrypt on GPU
2015-07-25 14:22 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote:
>> I had in my code
>>
>> for()
>> {
>> copy to private
>> some operiations on private
>> copy to global
>> }
>>
>> i changed this code to
>>
>> memset(this private array,0,size of private array)//because I noticed
>> when I was working on parallel that kernel can slow down after using
>> uninitialized array
>> for()
>> {
>> some operations on private
>> }
>>
>> and runned with --skip-self-test and speed was the same, even without
>> this memset.
>
> OK. Why would you incur any accesses to an uninitialized array, though?
>
> yescrypt fully initializes its S-boxes with non-zero data before the
> very first invocation of pwxform, which uses them.
this was only experiment, I wanted to know the speed without copying
data from global memory
>
>> this is big array 8KB but I have in another place copying
>> 64 B and this also decreases speed even when copying 8KB is turned off
>
> Moving a 64 bytes array from global to private decreases speed? That's
> surprising if so. Is this 64 bytes array frequently accessed? Which
> one is it? The current sub-block buffer in pwxform? You should keep it
> in private, I think.
in pwxform, 64 bytes - only once they are used
>
> The S-boxes should likely be in local on AMD and in private on NVIDIA,
> although you do in fact need to test with them in global as well - in
> fact, ideally you'd have this tri-state choice auto-tuned at runtime,
> since the optimal one will likely vary across GPUs (even similar ones).
>
> yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
> (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
> bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
> reduces parallelism, but OTOH 4 such pwxform lanes are computed in
> parallel, which increases parallelism. This is with yescrypt's current
> default pwxform settings. We previously found that it's more optimal to
> keep bcrypt's S-boxes in local or private (depending on GPU) rather than
> in global, but the differences of pwxform (with particular settings) vs.
> bcrypt might change this on some GPUs. Also, yescrypt's other uses of
> global memory (for its large V array) might make use of global memory for
> the S-boxes as well more optimal, since those other accesses to global
> memory might limit the overall latency reduction possible with moving the
> S-boxes to local or private memory, thereby skewing the balance towards
> keeping them in global.
today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
x0 += p0[0];
x0 ^= p1[0];
[...]
x1 += p0[1];
x1 ^= p1[1];
commented out speed is the same with copying and without
(but I have another version of pwxform using vectors now)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.