|
Message-ID: <20120501050110.GA10300@openwall.com> Date: Tue, 1 May 2012 09:01:10 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Sayantan :Weekly Report #2 Hi Sayantan, On Tue, May 01, 2012 at 10:06:07AM +0530, SAYANTAN DATTA wrote: > Accomplishments: > 1. Implemented a function to find the optimum local work group size in > opencl-mscash2. Is this in magnum-jumbo? It seems not. If so, where is it? > 2. The following are not an accomplishment in strict sense but more of a > kind of experiment and the results should be helpful in future: > a. rotate() function caused huge performance drop on 7970 on bull. This is puzzling. Perhaps you can try reviewing the generated code (IL or native) to figure out the cause of the performance drop? In general, I think we (as a team) should learn to do that. > I replaced the rotate function with a macro resulting in nearly 50% > performance improvement. With this change the opencl-mscash2 produces > around 72-73K real/s. > b. other cards(4890,570) I tested don't have such issues with rotate(). > c. Vectorization of the code caused small(2-3%) performance drop on > 7970 and 570. OK. The code currently in magnum-jumbo only achieves 36k c/s on 7970: user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=mscash2-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti Benchmarking: MSCASH2-OPENCL [PBKDF2_HMAC_SHA1]... DONE Raw: 35754 c/s real, 50592 c/s virtual > Priorities: > 1. More experimentation and optimization. OK. One task I think you could approach slightly later is trying to implement and optimize Eksblowfish on GPU. As discussed before, we expect it to be slow, but it'd be useful to have some hard data to prove this - or maybe disprove it (unlikely), and to have some OpenCL code (and maybe CUDA as well) that we could run on future GPUs easily as they become available. Specifically, this may be helpful for design of future password hashing methods. Additionally, this OpenCL code may happen to be readily capable of making use of AVX2's VSIB addressing with Intel's OpenCL SDK - if so, it may actually be faster (on those future CPUs) than the existing CPU code for bcrypt, until we implement proper AVX2 code more directly (perhaps with intrinsics). I had mentioned this task in GSoC student selection context before, but it may also be approached outside of that context and with slightly different goals as above. In that way, it will actually be useful even if the implementation is indeed slower than the current CPU code on current hardware. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.