Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABh=JRH9ozL1fvTHbm7fQnwfUhTgJjVgqkqMNc31qJeOVTGf6A@mail.gmail.com>
Date: Sun, 27 Jan 2013 01:22:21 +0200
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Proposed optimizations to pwsafe

No, 64-bit rotate() is just bad (at least on AMD, never checked for NV).
Basically OpenCL generates horrible code when using 64-bit operations.
Really horrible. For VLIW it's kinda better, for GCN it can be a disaster :(

On Sun, Jan 27, 2013 at 1:17 AM, magnum <john.magnum@...hmail.com> wrote:

> I've seen that too. I think I recall you could produce a faster rotate()
> for nvidia yourself a year ago but today I believe just using rotate() will
> be fine for both AMD and nvidia. I'm not quite sure about 64-bit rotates
> though.
> I can understand how they (like any compiler) could miss to optimize just
> any odd version of spelled out rotate to the most effective code - but
> failing it with the rotate() function? That is just pure CRAP.
>
> magnum
>
>
> On 27 Jan, 2013, at 0:07 , Milen Rangelov <gat3way@...il.com> wrote:
>
> Yes, you are right about that. I have those problems when porting opencl
> kernels to nvidia because I am lazy and don't bother to change those. Some
> day I need to go through all the nv kernels and change those. I
> investigated that a while ago by looking at some generated ptx. They were
> doing some movs for no real reason so that was the cause for the worse
> performance, otherwise the bitwise operations were the same. Things might
> have changed since then and I am using an old nvidia driver. BTW another
> (this time weird) thing is that on nvidia, doing a rotate like this:
>
> #define rotate(a,b) ((a<<b)+(a>>(32-b))
>
> is faster than doing it the usual way:
>
> #define rotate(a,b) ((a<<b)|(a>>(32-b))
>
> and generated PTX is the same except for the ADD/OR thing. My theory is
> that using addition somehow utilizes the hardware instruction (the integer
> fused multiply-add one) but at least at PTX level, this is not visible.
>
>
> On Sun, Jan 27, 2013 at 12:58 AM, magnum <john.magnum@...hmail.com> wrote:
>
>> On 26 Jan, 2013, at 23:51 , Milen Rangelov <gat3way@...il.com> wrote:
>>
>> > Hm, I guess the compiler got smarter and was able to generate the
>> bfi_int when not explicitly doing bitselect(). This was not the case some
>> ago and that's good news. Need to do some experiments and check the ISA
>> generated.
>>
>> I think we should do it anyway (for now, at least).
>>
>> What is much more annoying is that bitselects on nvidia hurts performance
>> (last time I checked). That is just weird. I mean, OK if they do not have a
>> hardware instruction but they should definitely not end up slower than
>> using the spelled out syntax. Or is it harder than I imagine?
>>
>> magnum
>>
>
>
>

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.