Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dbcf725f7a4c4cd49aaf19ff0f62d8f@smtp.hushmail.com>
Date: Sun, 27 Jan 2013 00:17:34 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Proposed optimizations to pwsafe

I've seen that too. I think I recall you could produce a faster rotate() for nvidia yourself a year ago but today I believe just using rotate() will be fine for both AMD and nvidia. I'm not quite sure about 64-bit rotates though.
I can understand how they (like any compiler) could miss to optimize just any odd version of spelled out rotate to the most effective code - but failing it with the rotate() function? That is just pure CRAP.

magnum


On 27 Jan, 2013, at 0:07 , Milen Rangelov <gat3way@...il.com> wrote:

> Yes, you are right about that. I have those problems when porting opencl kernels to nvidia because I am lazy and don't bother to change those. Some day I need to go through all the nv kernels and change those. I investigated that a while ago by looking at some generated ptx. They were doing some movs for no real reason so that was the cause for the worse performance, otherwise the bitwise operations were the same. Things might have changed since then and I am using an old nvidia driver. BTW another (this time weird) thing is that on nvidia, doing a rotate like this:
> 
> #define rotate(a,b) ((a<<b)+(a>>(32-b))
> 
> is faster than doing it the usual way:
> 
> #define rotate(a,b) ((a<<b)|(a>>(32-b))
> 
> and generated PTX is the same except for the ADD/OR thing. My theory is that using addition somehow utilizes the hardware instruction (the integer fused multiply-add one) but at least at PTX level, this is not visible.
> 
> 
> On Sun, Jan 27, 2013 at 12:58 AM, magnum <john.magnum@...hmail.com> wrote:
> On 26 Jan, 2013, at 23:51 , Milen Rangelov <gat3way@...il.com> wrote:
> 
> > Hm, I guess the compiler got smarter and was able to generate the bfi_int when not explicitly doing bitselect(). This was not the case some ago and that's good news. Need to do some experiments and check the ISA generated.
> 
> I think we should do it anyway (for now, at least).
> 
> What is much more annoying is that bitselects on nvidia hurts performance (last time I checked). That is just weird. I mean, OK if they do not have a hardware instruction but they should definitely not end up slower than using the spelled out syntax. Or is it harder than I imagine?
> 
> magnum
> 


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.