|
Message-ID: <20150823061511.GA15246@openwall.com> Date: Sun, 23 Aug 2015 09:15:11 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU Agnieszka, Your current Argon2 kernels use global and private memory only. They don't use local memory. While private memory might be larger and faster on specific devices, I think that not making any use of local memory is wasteful. By using both private and local memory at once, we should be able to optimally pack more concurrent Argon2 instances per GPU and thereby hide more of the various latencies. You should try moving some of the arrays from private to local memory. Here's a related finding: http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#shared-memory-bandwidth "[...] shared memory bandwidth in SMX is twice that of Fermi's SM. This bandwidth increase is exposed to the application through a configurable new 8-byte shared memory bank mode. When this mode is enabled, 64-bit (8-byte) shared memory accesses (such as loading a double-precision floating point number from shared memory) achieve twice the effective bandwidth of 32-bit (4-byte) accesses. Applications that are sensitive to shared memory bandwidth can benefit from enabling this mode as long as their kernels' accesses to shared memory are for 8-byte entities wherever possible." Argon2's accesses are wider than and are a multiple of 8 bytes, so I think we need to enable this mode. Please try to find out how to enable it, and whether it possibly gets enabled automatically e.g. when the kernel uses specific data types. I think it could benefit many more of our kernels. So this is important to figure out and learn to use regardless of Argon2. Yet another relevant finding is that, per the tuning guides, Kepler and Maxwell do not use L1 caches for global memory (they only use L2), but there's a compiler option to change this behavior (enable use of both L1 and L2 caches for global memory). We could give this a try (if we find how to do this for OpenCL) and see if it improves or hurts performance, especially if we end up not using local memory anyway (for whatever reason) and have no or few register spills (where L1 cache for local memory could have been helpful). I don't expect this to be of much help, though - most likely the default is actually optimal for us, unless we don't use local memory at all (not even implicitly via spills). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.