|
Message-ID: <20120813051935.GA6139@openwall.com> Date: Mon, 13 Aug 2012 09:19:35 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice DES on GPU On Mon, Aug 13, 2012 at 08:36:46AM +0530, Sayantan Datta wrote: > What should be the value of DES_BS_EXPAND for GPU implentation ? First, note that for LM hashes DES_BS_EXPAND is effectively always 0, regardless of what it's set to (for other hash types). DES_bs_crypt_LM() does not even include #if's for the other code version. This is because there's just one DES computation per LM hash, so each expanded key bit is accessed exactly once, and thus there's no gain from preloading those bits and writing them into another array sequentially (which is what DES_BS_EXPAND does for other hash types). Now, for DES-based crypt(3) hashes this is trickier. Due to multiple iterations of DES per hash computation, there are potential savings from going through the pointer indirection just once and then loading those expanded key bits sequentially on each iteration. However, when the word size as used in the bitslice DES implementation (e.g., 128-bit on SSE2) is larger than pointer size (e.g., 32- or 64-bit), this increases the required fast memory size that key material is being read from in the loop. With DES_BS_EXPAND = 0, we read 768 pointers and 56 data words. With DES_BS_EXPAND = 1, we read 768 data words. Luckily, the latter is still small enough for current CPUs (although it is getting close to L1 data cache size on some). On GPU, the available local memory per work-item is a scarce resource. There might not be enough of it for the keys either way, so you might end up choosing to use global memory for them - and then reading the 768 expanded key bits sequentially might work better (more data to read, but those are sequential reads) - or it might not. The partially non-sequential reads of 56 adjacent data words would likely be pretty fast as well, but then the question is where you store the 768-element array of pointers or indices - and whether it is pointers (will have to be separate for each work-item) or indices (may be shared and constant). Note that 768 times 4 bytes (since you use 32-bit integers on GPU) is already 3 KiB, and you will have some other data as well - so if you do this expansion and you try to keep the expanded keys in local memory, the occupancy would be similar to what you're getting with bf-opencl. This is likely not the way to go. Thus, it appears that if you want to keep keys in local memory, DES_BS_EXPAND must be 0. And then the question is how to implement the indirection and whether you can have the 768-element array shared between your work-items. To have it shared, it won't be an array of pointers, but rather an array of indices into the 56-bit array. And then its elements can be 8-bit, and it will be constant. This is probably the way to go _if_ you're able to keep the keys themselves in local memory without this being too much of a constraint on the number of in-flight wavefronts. To summarize, you'd need to try three approaches (and their variations): 1. Keys in global memory, expanded. 2. Keys in global memory, not expanded. 3. Keys in local memory, not expanded. For LM hashes, it's just #2 or #3 above. Maybe start with #3? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.