|
Message-ID: <20150819165135.GA17072@openwall.com> Date: Wed, 19 Aug 2015 19:51:35 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU Agnieszka, On Wed, Aug 19, 2015 at 07:40:57AM +0300, Solar Designer wrote: > On Wed, Aug 19, 2015 at 07:10:42AM +0300, Solar Designer wrote: > > I think the modulo division operations are causing a lot of latency: > > > > [solar@...er opencl]$ fgrep % argon2*.cl > > argon2d_kernel.cl: reference_block_offset = (phi % r); > > argon2i_kernel.cl: uint reference_block_index = addresses[0] % r; > > argon2i_kernel.cl: uint reference_block_index = addresses[i] % r; > > You might also achieve speedup by moving these operations up in code, to > be performed as soon as their input data is available. Maybe the > compiler already does it for you, or maybe not. Moreover, you may also prefetch the data pointed to by the index from global memory sooner. You have limited local or private memory to prefetch to, but you probably do have it allocated for one block anyway, and you can start fetching it sooner. Or you can prefetch() into global memory cache: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/prefetch.html If you continue to use ulong2, then in 2d you may prefetch after 9 out of 16 BLAKE2's. With ulong8, you may prefetch after 12 out of 16. With ulong16, you can't... yet it might be optimal for other reasons (twice higher parallelism over ulong8 yet with the same total concurrent instance count). In 2i, you may prefetch whenever you like (which should be when you determine is the optimal time to prefetch, so that the prefetched data is available in time for its use yet isn't thrown out of global memory cache before it's used), regardless of how much parallelism in ComputeBlock you exploit. > For 2i, there's no way those 256 modulo operations would be run > concurrently from one work-item. And besides, to run them concurrently > you'd need to provide storage for the results (you already have that > addresses[] array) and then it's no better than copying this data e.g. > from a larger array (holding all precomputed indices) in global memory. You could probably pack more instances of 2i per GPU by reducing the size of this addresses[] array, fetching smaller groups of indices from global memory at a time (than are being computed at a time now). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.