|
Message-ID: <CABh=JRF3L7V+2VNcRu-Ekgg01-HtRqYQH581UebJue0fDW3GKQ@mail.gmail.com>
Date: Thu, 22 Mar 2012 11:10:15 +0200
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: CUDA & OpenCL status
Hello Alexander,
Milen - any additional info on that "fused SHL+ADD instruction" on
> Nvidia and its use for MD5 and the like? I don't immediately see how
> such an instruction would be usable there because we actually need
> rotate+ADD.
>
Problem is that NVidia does not have the rotate. AMD has BITALIGN_INT that
does it. With NVidia, rotate() would actually do (a<<s)|(a>>(32-s)). With
Fermi you can have the SHL+ADD thing (which I guess is just a 32-bit MAD in
fact), but rotate() still does not do the trick. What I do is something
like:
#define ROTATE (a<<S)+(a>>(32-s))
Which makes the compiler emit the needed instructions. Of course that's not
as good as having a single "rotate" instruction, but still doing 2 bitwise
ops is better than doing 3. This works only on sm_2x architectures, so it
really does not matter on say 9800GT.
Just a side note, you don't need to explicitly use amd_bitalign from the
cl_amd_media_ops extensions as rotate() maps to bitalign since SDK 2.3 or
even before that. Since recently (Catalyst 12.2), bitselect() is mapped to
BFI_INT too. Thus no need to do the binary patching to get bfi working.
Regards.
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.