|
Message-ID: <20150902163508.GA25383@openwall.com> Date: Wed, 2 Sep 2015 19:35:08 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: interleaving on GPUs On Sun, Aug 23, 2015 at 08:08:06AM +0300, Solar Designer wrote: > I just read this about NVIDIA's Kepler (such as the old GTX TITAN that > we have in super): > > http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#device-utilization-and-occupancy > > "Also note that Kepler GPUs can utilize ILP in place of > thread/warp-level parallelism (TLP) more readily than Fermi GPUs can. > Furthermore, some degree of ILP in conjunction with TLP is required by > Kepler GPUs in order to approach peak single-precision performance, > since SMX's warp scheduler issues one or two independent instructions > from each of four warps per clock. ILP can be increased by means of, for > example, processing several data items concurrently per thread or > unrolling loops in the device code, though note that either of these > approaches may also increase register pressure." > > Note that they explicitly mention "processing several data items > concurrently per thread". So it appears that when targeting Kepler, up > to 2x interleaving at OpenCL kernel source level could make sense. [...] > On Maxwell (such as the newer GTX Titan X), interleaving shouldn't be > needed anymore: > > http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#smm-occupancy > > "The power-of-two number of CUDA Cores per partition simplifies > scheduling, as each of SMM's warp schedulers issue to a dedicated set of > CUDA Cores equal to the warp width. Each warp scheduler still has the > flexibility to dual-issue (such as issuing a math operation to a CUDA > Core in the same cycle as a memory operation to a load/store unit), but > single-issue is now sufficient to fully utilize all CUDA Cores." > > I interpret the above as meaning that 2x interleaving can still benefit > Maxwell at low occupancy, but we can simply increase the occupancy and > achieve the same effect instead. Also relevant, about GCN: http://bartwronski.com/2014/03/27/gcn-two-ways-of-latency-hiding-and-wave-occupancy/ "We have two ways of latency hiding: * By issuing multiple ALU operations on different registers before waiting for load of specific value into given register. Waiting for results of a texture fetch obviously increases the register count, as increases the lifetime of a register. * By issuing multiple wavefronts on a CU - while one wave is stalled on s_waitcnt, other waves can do both vector and scalar ALU. For this one we need multiple waves active on a CU." This blog post doesn't exactly recommend interleaving (when it's possible memory-wise, we could as well issue more wavefronts instead), but it does recommend having some instruction-level parallelism. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.