|
Message-ID: <CAKtfLcus0P=yiU+8qu+yg2yR5zXO4kOhQFfno=aE9UQoiSMYCg@mail.gmail.com> Date: Mon, 16 Sep 2013 10:53:15 -0700 From: Alain Espinosa <alainesp@...il.com> To: john-dev@...ts.openwall.com Subject: Re: GTX TITAN (was: new dev box wishes) On 9/15/13, Milen Rangelov <gat3way@...il.com> wrote: > I have almost never seen any benefits from vectorizing on 7970 (well expect > one or two cases). > > There are occasions where the same code, vectorized, gives better > performance on 7970, but then if you can transform the scalar code so that > more work is being done in the kernel (e.g by doing 4 consecutive > operations in a loop as compared to using 4x vectors) you'd eventually come > up with a faster, scalar solution given the same ammount of work (global > work size / vector size). In case global work size is the same, the > vectorized solution may seem faster just because overall you have more > kernel launches per second with the scalar code as compared to vector code > and kernel launch latency and host-device transfers then come into play. I > think AMD APP profiler can be very helpful to figure out what's happening > in such cases. This is not the case with my test (almost all you write). For clearance i give the general kernel before: __kernel void ntlm_crack(...) { // Initialization code ... for(uint i=0;i<100;i++) { uint a,b,c,d; // NTLM code ... } } and after: __kernel void ntlm_crack(...) { // Initialization code ... for(uint i=0;i<100;i+=3) { uint3 a,b,c,d; // NTLM code ... } } This give the 15-20% performance improvement in a HD 7970 (and in a GT 630) (this is in my particular case, but i think it give in other case too). You note that all parameters are the same for the two solutions, and yes, i make a waste calculation in the second solution at end of cycle. saludos, alain
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.