|
Message-ID: <20110604231446.GA6815@openwall.com> Date: Sun, 5 Jun 2011 03:14:46 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: alternative approach David, Yuri - I made some rather naive comments on DES on GPUs last month. I've since discussed the matter with some people more familiar with GPUs, so here's an update. First, the best benchmark result I am currently aware of is for MTY, a tripcode finder, doing 250 Mhashes/sec on ATI 5970. This translates to 6.25 billion of DES block encryptions per second. So it is still 3+ times slower than the 22 billion on Virtex-6, but it's by far not as bad as the numbers for GPUs that we had before. Our new DES S-box expressions will improve this number (for the GPU) some further. http://en.wikipedia.org/wiki/Tripcode http://sourceforge.jp/projects/naniya/ http://harry.lu/mty/ On Thu, May 12, 2011 at 01:51:54PM +0400, Solar Designer wrote: > Disclaimer: I am not into GPU programming (yet). This is why I use > words such as "apparently" below. > > Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat > likely to change. Bitslice implementations of DES with the S-box > expressions published so far need up to approx. 20 registers, which is > apparently more than what typical GPUs have available per processing > element. Also, apparently accessing a temporary variable spilled to > memory is relatively slow (compared to accessing the L1 cache in a CPU). I've since read up on this a bit, mostly with Nvidia focus (even though AMD/ATI cards are apparently more suitable for the task). The number of available registers per stream processor is actually pretty high (thousands), but instruction latencies are so large that you have to run a large number of simultaneous threads in order to hide those latencies. This is why the number of registers actually available per thread turns out to be pretty low (tens). Memory accesses are indeed slow, although that differs by memory type a lot. On Nvidia, there's so-called shared memory, which is nearly as fast as registers, but there's very little of it (its size per stream processor may even be less than the combined size of the stream processor's registers). Other types of memory have huge latencies (hundreds of cycles). With an architecture like this, not only the temporary values (such as those computed inside DES S-box expressions), but also DES blocks and keys need to be stored in registers (or in Nvidia's shared memory, but there's not more of it). This gives us something like 64+56+20 = 140 registers needed per bitslice DES instance. But there are not this many on Nvidia cards for the otherwise optimal number of threads to run. On high-end ATI cards, according to the author of MTY, slowdown starts to occur when a thread needs more than 49 registers, but MTY exceeds this number anyway (no better choice). Yet, as stated above, its overall performance is nevertheless very good (comparable to 8 billion hashes/second reported for MD5, which is much more GPU-friendly). > Another change that may or may not happen is GPUs getting more registers > or/and faster access to temporary variables. This sounds quite > realistic to me, but I am not an expert in this area. Based on what I > read (about the implementation difficulty for DES on GPUs), I wouldn't > be surprised if a mere 2x increase in register count results in a 10x > speedup for DES on GPUs. I think I was wrong about the above. If not having enough registers results in a 10x performance hit and having 2x more registers would avoid that, then perhaps the performance hit could be reduced from 10x to 2x by simply running twice fewer threads. But I might be wrong about that as well... Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.